Analysis of variance

# Analysis of variance
## Research methods
### Jüri Lillemets
### 2021-11-30

---

# Are differences between groups higher than within groups?

---

# Two sources of variation in data

---

## Why analysis of variance?

Variance expresses mean squared deviation from average:

`$$Var(X) = \frac{1}{N} \sum{^N_{i=1}{(X_i-\bar{X})^2}},$$`

where `$X_i$` is the value of observation `$i$` and `$\bar{X}$` the mean value.

---

Let's use data of some iris flowers for illustration.

![](../img/iris.png)

---

Variance is a measure of how different on average are **individual values from mean value**.

![](04_anova_files/figure-html/unnamed-chunk-1-1.png)

These differences represent **variation** in data.

???
Average squared difference from the mean.

---

We can actually separate variation in into two parts:

`$$Var(Y) = \frac{1}{N} \sum^{G}_{j=1}{\sum^{N_k}_{i=1}{(Y_{ij} - \bar{Y})^2}},$$`

where 
- `$N$` is the number of all observations, 
- `$Y_{ij}$` is the value of variable `$Y$` for observation `$i$` in group `$j$`, and 
- `$\bar{Y}$` the overall mean of variable `$Y$` of all observations in all samples.

---

![](04_anova_files/figure-html/unnamed-chunk-2-1.png)

---

## The two parts as a linear model

We can also express each value as a model that distinguishes between the two sources of variation:

`$$Y_{ij}=\bar{x}+\alpha_{j}+\varepsilon_{ij},$$`

where 
- `$Y_{ij}$` is the value of `$Y$` for observation `$i$` from group `$j$`, 
- `$\bar{x}$` the overall mean, 
- `$\alpha_{i}$` the difference between mean of group `$j$` and overall mean `$\bar{x}$`, and 
- `$\varepsilon_{ij}$` the difference between group mean and value for observation `$i$` from factor `$j$`.

---

Total variation can be expressed as the deviation of each observation from the mean.

`$$Y_{ij}= \bar{x} + \varepsilon_{i}$$`

|    |Species    | Sepal.Length|        x|  epsilon_i|
|:---|:----------|------------:|--------:|----------:|
|1   |setosa     |          5.1| 6.022222| -0.9222222|
|2   |setosa     |          4.9| 6.022222| -1.1222222|
|3   |setosa     |          4.7| 6.022222| -1.3222222|
|51  |versicolor |          7.0| 6.022222|  0.9777778|
|52  |versicolor |          6.4| 6.022222|  0.3777778|
|53  |versicolor |          6.9| 6.022222|  0.8777778|
|101 |virginica  |          6.3| 6.022222|  0.2777778|
|102 |virginica  |          5.8| 6.022222| -0.2222222|
|103 |virginica  |          7.1| 6.022222|  1.0777778|

---

We can also separate factor `$\alpha$` from total variation:

`$$Y_{ij}=\bar{x}+\alpha_{j}+\varepsilon_{ij},$$`

- `$Y_{ij}$` is the value for observation `$i$` from group `$j$`, 
- `$\bar{x}$` the overall mean, 
- `$\alpha_{i}$` the difference between mean of group `$j$` and overall mean `$\bar{x}$`, and 
- `$\varepsilon_{ij}$` the difference between group mean and value for observation `$i$` from factor `$j$`.

---

`$$Y_{ij}=\bar{x}+\alpha_{j}+\varepsilon_{ij}$$`

|    |Species    | Sepal.Length|        x|  epsilon_i|    alpha_j| epsilon_ij|
|:---|:----------|------------:|--------:|----------:|----------:|----------:|
|1   |setosa     |          5.1| 6.022222| -0.9222222| -1.1222222|  0.2000000|
|2   |setosa     |          4.9| 6.022222| -1.1222222| -1.1222222|  0.0000000|
|3   |setosa     |          4.7| 6.022222| -1.3222222| -1.1222222| -0.2000000|
|51  |versicolor |          7.0| 6.022222|  0.9777778|  0.7444444|  0.2333333|
|52  |versicolor |          6.4| 6.022222|  0.3777778|  0.7444444| -0.3666667|
|53  |versicolor |          6.9| 6.022222|  0.8777778|  0.7444444|  0.1333333|
|101 |virginica  |          6.3| 6.022222|  0.2777778|  0.3777778| -0.1000000|
|102 |virginica  |          5.8| 6.022222| -0.2222222|  0.3777778| -0.6000000|
|103 |virginica  |          7.1| 6.022222|  1.0777778|  0.3777778|  0.7000000|

---

For

`$$Y_{ij}=\bar{x}+\alpha_{j}+\varepsilon_{ij},$$`

there are two sources of variation: 
- variation **between groups**, expressed by `$\alpha_{j}$`
- random variation **within a group**, expressed by `$\varepsilon_{ij}$`

The idea behind ANOVA is comparing these two sources of variation.

If the variation between groups is much higher than variation within groups, we can say that the groups are different not only in our sample but also in population.

---

# Comparing the two sources of variation

---

## One way ANOVA

Assumptions:

- normality within each group
- independence
- homogeneity of variance (only for Welch's test)

???
I will discuss the homogeneity of variance later.

---

Let's use data set "Multi-environment trial of sorghum at 3 locations across 5 years". We wish to compare yields in different locations in 2004.

|    |gen |trial |env | yield| year|loc      |
|:---|:---|:-----|:---|-----:|----:|:--------|
|150 |G01 |T1    |E09 |  6275| 2004|Melkassa |
|151 |G02 |T1    |E09 |  6375| 2004|Melkassa |
|152 |G03 |T1    |E09 |  5925| 2004|Melkassa |
|153 |G04 |T1    |E09 |  7125| 2004|Melkassa |
|154 |G05 |T1    |E09 |  6000| 2004|Melkassa |
|155 |G06 |T1    |E09 |  5950| 2004|Melkassa |
]

|loc      | Freq|
|:--------|----:|
|Kobo     |   28|
|Melkassa |   28|
|Mieso    |   28|
]

---

### Hypotheses

Are mean values equal?

`$H_0: \bar{x}_1 = \bar{x}_2 = \bar{x}_3$`  
`$H_1: \bar{x}_1 = \bar{x}_2 \neq \bar{x}_3$` or `$\bar{x}_1 \neq \bar{x}_2 \neq \bar{x}_3$`

In other words, our `$H_1$` is that at least one  group mean is different from others.

---

Is mean yield different among locations? Does location have an effect on yield?

![](04_anova_files/figure-html/unnamed-chunk-8-1.png)

???
Do we even need ANOVA? Yes, because this is only a sample.

Let's set significance level `$\alpha = 0.05$`.

---

### Test statistic `$F$`

Test statistic `$F$` is calculated as

`$F = MSA / MSE,$`

where `$MSA$` expresses variation between group and `$MSE$` represents random variaton.

---

Sums of squares of variable `$y$` for observations `$i$` in groups `$j$`:

- **within-group** sum of squares, i.e. sum of squares of errors (SSE)  
  `$SSE=\sum_{j=1}^{k}\sum_{i=1}^{n} (y_{ij}-\overline{y_{j}})^{2}$`, `$df = n - k$`
- **between group** sum of squares, i.e. sum of squares between groups (SSA)  
  `$SSA=\sum_{j=1}^{k} (\overline{y_{j}}-\overline{y})^{2}$`, `$df = k - 1$`
- **total** sum of squares (SST=SSE+SSA)  
  `$SST=\sum_{j=1}^{k}\sum_{i=1}^{n} (y_{ij}-\overline{y})^{2}$`, `$df = n - 1$`

---

We can use these variations to find `$MSA$` and `$MSE$` to calculate `$F = MSA / MSE$` as follows:

- mean squares for SSE (MSE)
  `$MSE = SSE / (n - k)$`
- mean squares for SSA (MSA)
  `$MSA = SSA / (k - 1)$`
  
---

> Why are we taking **sums of squares** instead of using just sums?

---

If we simply sum the differences, the result would be zero, i.e. no variation.

|    |gen |trial |env | yield| year|loc      |        x|  epsilon_i|
|:---|:---|:-----|:---|-----:|----:|:--------|--------:|----------:|
|161 |G12 |T1    |E09 |  5300| 2004|Melkassa | 4414.667|   885.3333|
|172 |G23 |T2    |E09 |  4569| 2004|Melkassa | 4414.667|   154.3333|
|183 |G06 |T1    |E10 |  3375| 2004|Mieso    | 4414.667| -1039.6667|

```r
(sorg3$yield - mean(sorg3$yield)) %>% sum %>% round
```

```
## [1] 0
```

```r
(sorg3$yield - mean(sorg3$yield))^2 %>% sum %>% round
```

```
## [1] 1888541
```

???
Does it have to be this complicated?

---

The test statistic `$F = MSA / MSE$` simply compares within-group differences to between group differences.

![](04_anova_files/figure-html/unnamed-chunk-11-1.png)

---

.footnote[Navarro D.J., Foxcroft D.R. (2018). Learning statistics with jamovi: a tutorial for psychology students and other beginners. Danielle J.  Navarro and David R. Foxcroft. doi:10.24384/HGC3-7P15]

---

### Test statistic `$F$` on F-distribution

Value of test statistic is 27.364. Test statistic is evaluated on t-distribution with df `$k - 1$` and `$n - k$`.

![](04_anova_files/figure-html/unnamed-chunk-13-1.png)

---

P-value is 0 and we chose significance level to be `$\alpha = 0.05$`.

> Is mean yield different among locations?

> Does location have an effect on yield?

---

# Post-hoc tests and checking homogeneity of variance

---

## Tukey HSD

We found that at least one group is different from others. But which one(s)?

We can use post-hoc tests to see pairwise differences. An example test is **Tukey HSD** (hhonest significant differences).

```
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = yield ~ loc, data = sorg)
## 
## $loc
##                      diff        lwr        upr     p adj
## Melkassa-Kobo    752.2500   252.5353  1251.9647 0.0016007
## Mieso-Kobo      -795.9286 -1295.6433  -296.2138 0.0007997
## Mieso-Melkassa -1548.1786 -2047.8933 -1048.4638 0.0000000
```

---

We can also plot differences in means along with their **confidence intervals**.

![](04_anova_files/figure-html/unnamed-chunk-15-1.png)

???
All CIs are outside 0.

---

## Levene's test of homogeneity of means

For each observation we 
1. calculate its difference from its group mean, and then
2. use ANOVA to test if these within-group differences are different.

`$H_0: s_1 = s_2 = s_3$`  
`$H_1: s_1 = s_2 \neq s_3$` or `$s_1 \neq s_2 \neq s_3$`

If we fail to reject `$H_0$` (p-value is `$\ge \alpha$`), then we conclude that all groups have the same variance.

---

```
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)   
## group  2  6.4342 0.00255 **
##       81                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

> Are variances homogenous?

> What to do if they are not?

---

---

## Practice

1. Download the data set `chickwts` from the course notes.
2. Go to cloud.jamovi.org.
3. Open the data set in Jamovi.

> Are the weights normally distributed in each diet?

> Are the samples independent?

> Are variances of diets homogenous?

> Does diet have an impact on weight?

---