2022-03-15 ~2 min read

Likelihood Estimation

Motivation

The goal of maximum likelihood is to find the optimal way to fit a distribution to the data
We want to fit a distribution to the data so that we can generalize the data and it makes it easier to work with the dataset
We are trying to find a PDF something like: $p(x, \theta)$ where $\theta$ is the model parameters and $x$ is the dataset. - We want the maximum of this function for the mean and variance of the distribution (not the data)
Likelihood means that we are trying to find the optimal value of mean, standard deviation (or other measures) for a distribution given a bunch of observed measurements
One way to interpret MLE is to view it as minimizing the “closeness” between the training data distribution $p_{data}(\textbf{x})$ and the model distribution $p_{model}(\textbf{x}, \boldsymbol{\theta})$. The best way to quantify this “closeness” between distributions is the KL divergence
Maximizing the likelihood is equivalent to minimizing the KL divergence

$I(x) = -\log{P(x)}$ This represents information of an event

Entropy measure the number of bits required to encode the information

Minimum number of bits to encode $x$ with distribution $P$ using the wrong optimized encoding scheme from $Q$

Difference between two distributions P andQ

Using above 3 equations from KL divergence, cross entropy, and entropy, we will get

$H(P, Q) = H(P) + D_{KL}(P Q)$
When minimizing the above function wrt. model parameters $H(P)$ can be considered constant. So it will convert to minimize KL divergence.

Maximizing the likelihood is equivalent to minimizing the KL divergence

MLE maximizes $p(y x;\theta)$
$\hat{\theta} = argmax \prod_{i}^{N} {p(x_{i} \theta)}$
Instead of maximizing the above function, we minimize the negative log of the function which is equivalent
$\hat{\theta} = argmin -\sum_{i}^{N}\log({p(x_{i}) \theta})$ -> This is called negative log likelihood (NLL)
NLL and minimizing cross entropy is equivalent (refer reference 3) - $\hat{\theta} = argmin\space H(p, q)$

In regression problem we assume a normal distribution and define the log likelihood function
After plugging in the values of normal distribution assumption in likelihood equation, we get MSE