Maximum likelihood estimation is an essential approach to estimate parameters in statistics. It requires statistical assumptions on error pattern. I will introduce this technique with really simple statistical model - linear regression.

We usually want to infer the relationship between input and output. They can be cause and effect of a physical phenomena or some indirect relationship in social science. We always want to realize the truth about our interest. Statistical tools are the tools to help you. The relationship between input and output are formulated as a mathematical function as follow:

$$
y = f(x)
$$

A linear model is assumed in linear regression. The relationship between $x$ and $y$ is assumed to be linear and $w$ and $b$ are introduced as unknown parameters into the model. $w$ denotes the extent of $x$ contributing to $y$ while $b$ denotes an inherent bias in $y$.

$$
y = wx + b
$$

On the other side, data are collected in pair of $(x^i, y^i)$. The superscript $n$ denotes the index for each sample.

$$
\mathcal{D} = {(x^1, y^1), (x^2, y^2), \cdots, (x^n, y^n)}
$$

The maximum likelihood estimation starts from the likelihood function. A likelihood function quantified the likelihood of data generation from the model we specified. That is, how likely the sample $(x_i, y_i)$ are generated from the model. All parameters are reduced into $\theta$. Precisely, $\mathcal{L}(\theta \mid \mathcal{D})$ represents the likelihood of model with parameter $\theta$ given data $\mathcal{D}$. Notably, data is fixed here. Thus, we want the most likely model that fits our samples. The model with maximum likelihood is estimated.

$$
\begin{aligned}
\mathcal{L}(\theta \mid \mathcal{D}) &= f(\mathcal{D} \mid \theta) \\
&= p((x^1, y^1), (x^2, y^2), \cdots, (x^n, y^n) \mid \theta)
\end{aligned}
$$

A probability distribution is used to estimate the likelihood of model $\theta$. The interpretation here is the probability of data generation from the model $\theta$.

$$
\begin{aligned}
\mathcal{L}(\theta \mid \mathcal{D}) &= p((x^1, y^1), (x^2, y^2), \cdots, (x^n, y^n) \mid \theta) \\
&= p((x^1, y^1) \mid \theta) \cdot p((x^2, y^2) \mid \theta) \cdots p((x^n, y^n) \mid \theta)
\end{aligned}
$$

It is formed from the joint probability distributions. A critical statistical assumption is introduced here. Samples $(x^i, y^i)$ are generated independently such that the joint probability can be split into multiplication of probabilities. Further, samples are generated from the same population/model $\theta$. So, we call that data are independently, identically distributed, or iid, generated from some model.

$$
\begin{aligned}
\mathcal{L}(\theta \mid \mathcal{D}) &= p((x^1, y^1) \mid \theta) \cdot p((x^2, y^2) \mid \theta) \cdots p((x^n, y^n) \mid \theta) \\
&= \prod_{i=1}^n p((x^i, y^i) \mid \theta)
\end{aligned}
$$

Normal distribution is used as error pattern in linear regression model. A linear regression model always infers the mean value of $y$ given $x$. The uncertainty of $y$ can be described by normal distribution. We add an error term after the linear regression model.

$$
y = wx + b + \epsilon
$$

$$
\epsilon = y - wx - b = \mathcal{N}(z \mid \mu = 0, \sigma^2)
$$

Or we can rewrite it into

$$
\epsilon = \mathcal{N}(y - wx - b \mid \mu = 0, \sigma^2)
$$

where $\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}} exp(-\frac{1}{2} (\frac{x-\mu}{\sigma})^2)$.

Likelihood function would be

$$
\begin{aligned}
\mathcal{L}(\theta \mid \mathcal{D}) &= \prod_{i=1}^n p((x^i, y^i) \mid \theta) \\
&= \prod_{i=1}^n \mathcal{N}(y^i - wx^i - b \mid \mu = 0, \sigma^2)
\end{aligned}
$$.

Negative log likelihood would be

$$
\begin{aligned}
\mathcal{l}(\theta \mid \mathcal{D}) &= - ln \mathcal{L}(\theta \mid \mathcal{D}) \\
&= - ln \prod_{i=1}^n \mathcal{N}(y^i - wx^i - b \mid \mu = 0, \sigma^2) \\
&= - \sum_{i=1}^n ln \mathcal{N}(y^i - wx^i - b \mid \mu = 0, \sigma^2) \\
&= - \sum_{i=1}^n - \frac{1}{2} (\frac{y^i - wx^i - b - 0}{\sigma})^2 - ln(\sigma \sqrt{2 \pi}) \\
&= \sum_{i=1}^n \frac{1}{2} (\frac{y^i - wx^i - b}{\sigma})^2 + ln(\sigma) + ln(\sqrt{2 \pi})
\end{aligned}
$$

The primal goal is to maximize likelihood function $\arg\max_{\theta} \mathcal{L}(\theta \mid \mathcal{D})$, while we have to minimize the negative log likelihood (NLL) function

$$
\arg\min_{\theta} \mathcal{l}(\theta \mid \mathcal{D})
$$.

To solve this problem, we need to calculate the first-order derivative of NLL. The minimum value happens while $\frac{\partial \mathcal{l}}{\partial w} = 0$ and $\frac{\partial \mathcal{l}}{\partial b} = 0$ are satisfied.

$$
\begin{aligned}
\frac{\partial \mathcal{l}}{\partial w} &= \frac{\partial}{\partial w} \sum_{i=1}^n \frac{1}{2} (\frac{y^i - wx^i - b}{\sigma})^2 + ln(\sigma) + ln(\sqrt{2 \pi}) \\
&= \sum_{i=1}^n \frac{y^i - wx^i - b}{\sigma} \cdot \frac{\partial}{\partial w} \frac{y^i - wx^i - b}{\sigma} \\
&= \sum_{i=1}^n \frac{y^i - wx^i - b}{\sigma} \frac{-x^i}{\sigma} \\
&= \frac{-1}{\sigma^2} \sum_{i=1}^n x^i (y^i - wx^i - b) \\
&= 0
\end{aligned}
$$

$$
\begin{aligned}
\frac{-1}{\sigma^2} \sum_{i=1}^n x^i (y^i - wx^i - b) &= 0 \\
\sum_{i=1}^n x^i (y^i - wx^i - b) &= 0 \\
\sum_{i=1}^n{x^i y^i} - w \sum_{i=1}^n{(x^i)^2} - b \sum_{i=1}^n{x^i} &= 0
\end{aligned}
$$

The same way we can have

$$
\sum_{i=1}^n{y^i} - w \sum_{i=1}^n{x^i} - b = 0
$$

derived from $\frac{\partial \mathcal{l}}{\partial b} = 0$.

Finally, we can estimate the parameters from the following two equations:

$$
\begin{cases}
(\sum_{i=1}^n{(x^i)^2}) w + (\sum_{i=1}^n{x^i}) b = \sum_{i=1}^n{x^i y^i} \\
(\sum_{i=1}^n{x^i}) w + b = \sum_{i=1}^n{y^i}
\end{cases}
$$

$$
\begin{cases}
w = ? \\
b = ?
\end{cases}
$$