Analysis of variance (ANOVA) starts from the analysis of variance which is caught by the model or not. If the variance is modeled, then the variance must be explained by the model, I mean of which is “caught” by the model. If it is not, variance is left unexplained as noises. These two concepts forms explained variance (or between-group variance in categorical factor) and unexplained variance (or within-group variance in categorical factor).

Sum of squares

Sum of squares come from the concept of calculating the (squared Euclidean) distance among data. Let’s consider the total sum of squares $SS_{total}$.

$$
SS_{total} = \sum_i (y_i - \bar{y})^2
$$

$y_i$ represents the true data value, $\bar{y}$ represents the mean of data $y_i$. We can get variance from $SS_{total}$ by dividing the degree of freedom ($df$).

$$
\begin{align}
Var[Y] &= \frac{SS_{total}}{df} \\
&= \frac{1}{n-1} \sum_i (y_i - \bar{y})^2
\end{align}
$$

The equation above shows the relation between sum of square and variance.

We always attempt to model the phonomena from data. Constructing the hypothesis or model and validate our hypothesis by testing their siginificant. If our model is true, then model must catch some variance and extract information from data for us. If the model is not true, model catch less variance just as the noise does.

So, we defined two kinds of sum of squares to measure how much variance is caught by model or not. We usually use a regression to model our data, so we have regression sum of square $SS_{reg}$, which represents the amount of sum of square explained by model. Sum of square other than regression sum of square is left as residual sum of square $SS_{res}$, which represents the amount of sum of square unexplained by model.

$$
SS_{reg} = \sum_i (\hat{y}_i - \bar{y})^2
$$

$\hat{y}_i$ represents the prediction value given by the model. If the sum of square is modeled, then the distance between the mean and the model prediction is explained by model.

$$
SS_{res} = \sum_i (y_i - \hat{y}_i)^2
$$

If the sum of square is not modeled, then the distance between the data and the model prediction is left unexplained as noise.

There is a relationship amoung them. We are going to proof it. However, the tools are not enough to prove it. We need the following two equations:

$$
\sum_i e_i = \sum_i (y_i - \hat{y}_i) = 0
$$

$$
\sum_i \hat{y}_i e_i = 0
$$

We start from $SS_{total}$.

$$
\begin{align}
SS_{total} &= \sum_i (y_i - \bar{y})^2 \\
&= \sum_i (y_i - \hat{y}_i + \hat{y}_i - \bar{y})^2 \\
&= \sum_i (y_i - \hat{y}_i)^2 + (\hat{y}_i - \bar{y})^2 + 2(y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) \\
&= \sum_i (y_i - \hat{y}_i)^2 + \sum_i (\hat{y}_i - \bar{y})^2 + 2 \sum_i \hat{y}_i(y_i - \hat{y}_i) - 2\bar{y} \sum_i (y_i - \hat{y}_i) \\
\end{align}
$$

Now you may see $SS_{reg}$ and $SS_{res}$ in the formula.

$$
SS_{total} = SS_{reg} + SS_{res} + 2 \sum_i \hat{y}_i e_i - 2\bar{y} \sum_i e_i
$$

Due to the fact that the last two summation terms are zero. So we have:

$$
SS_{total} = SS_{reg} + SS_{res}
$$

So now we have it! If we face a dataset, assuming that we know nothing about the data, we could only measure the mean and variance about the data.

Mean tells us the “location” about the data. If I told you the average height of whole people in the class, you prabably could have a idea or even guess where data locate.

Variance tell us the “range”, or formally “dispersion”, about the dataset. If the variance is large, we have high uncertainty about the data. Sometimes, we even know the mean, we still cannot guess the answer accurately, because we have no idea about what is the scale of data. In some sense, variance is associated with sum of squares(, and it truely is).

If we know the model in advance, we know the inner structure of phenomena and how the data is generated. That’s why we usually want to model things. If we introduce a model, it can help us extracting the inner structure (or information) from uncertainty. It reduce the uncertainty and give us information.

The total sum of squares act as the total uncertainty we face, and it can further be decoupled into two parts. If we have a model to help us, model extracts information and represents as regression sum of squares. The rest of uncertainty remains as residuals.

One-way ANOVA

ANOVA is a way to decouple sum of squares so that we can quantify how well the model is.

We can further make the ANOVA table as follow:

$$
\begin{array}{l c c c}
\ & SS & df & MS & F \\
\hline
model & SS_{reg} & k-1 & MS_{reg} & \frac{MS_{reg}}{MS_{res}} \\
error & SS_{res} & n-k & MS_{res} & \\
\hline
total & SS_{total} & n-1 & MS_{total} & \\
\end{array}
$$

Fill in the corresponding cell. Degree of freedom is about how many parameters you used to estimate your model. However, it is not always the case, you may want to read statistics textbook for some theoretical reasons. $k$ is the random variables you used in your model.

$$
MS_{reg} = \frac{SS_{reg}}{k-1} \\
MS_{res} = \frac{SS_{res}}{n-k}
$$

$MS$ means mean of squares, which it $SS$ divided by $df$. Moreover, we can calculate the F statistics for F test and the F test itself is so called one-way ANOVA.

$$
F = \frac{\text{explained variance}}{\text{unexplained variance}}
$$

F-value has its own meaning. It measures how much variance is caught by our model (or $MS_{reg}$), and it is a relative measurement, so it is divided by the variance not caught by our model (or $MS_{res}$). In aonother way, the model can explained part of variance, and the rest is left unexplained.

If the model we used is not something like linear regression, rather, we separate data into different categories. We used the following formula:

$$
F = \frac{\text{between-group variance}}{\text{within-group variance}}
$$

We could further test if the F-value is significant or not.