Principal component analysis

# Principal component analysis
## Research methods
### Jüri Lillemets
### 2021-12-13

---

# How to summarize variables?

---

# Intuition

---

## Multivariate statistics

In general, we do not test any hypotheses nor estimate statistical significance.

There are no response or predictor variables.

Response variable is not predicted from a model.

Instead we examine the overall **structure of data to find patterns**.

*Principal component analysis* (*PCA*) is one of the methods in the field of multivariate statistics.

---

## Principal component analysis

The aim is to **summarize continuous variables** into one, few or several new variables.

This is done by determining a set of standardized **linear combinations** that would explain a maximum amount of variation in data.

A **dimensionality reduction** technique.

The maximum number of PCs we can determine is `$min(p, n - 1)$`.

---

Let's use data on top performances in the *Decathlon* from 1985 to 2006 and examine results for each event in 2000.

Below are the first rows in the data set.

|                 | m100| Longjump| Shotput| Highjump| m400| m110hurdles| Discus| Polevault| Javelin| m1500|
|:----------------|----:|--------:|-------:|--------:|----:|-----------:|------:|---------:|-------:|-----:|
|Tomas Dvorak     | 10.5|     8.03|    16.7|     2.09| 48.4|        13.9|   47.9|      4.85|    67.2|   282|
|Roman Sebrle     | 10.6|     7.88|    15.2|     2.15| 49.0|        14.0|   47.2|      4.75|    67.2|   275|
|Erki Nool        | 10.7|     7.78|    14.1|     1.97| 47.2|        14.4|   44.2|      5.55|    69.1|   276|
|Chris Huffins    | 10.5|     7.71|    15.3|     2.09| 48.3|        13.9|   49.5|      4.70|    56.6|   279|
|Aleksandr Yurkov | 10.7|     7.93|    15.3|     2.03| 49.7|        14.6|   47.9|      5.15|    58.9|   272|
|Dean Macey       | 10.8|     7.77|    14.6|     2.09| 46.4|        14.5|   43.4|      4.80|    60.4|   263|
|Attila Zsivoczky | 10.6|     7.24|    15.7|     2.18| 48.1|        14.9|   45.6|      4.65|    63.6|   263|
|Frank Busemann   | 10.8|     7.92|    14.5|     2.03| 48.9|        14.7|   43.7|      5.05|    62.8|   263|
|Stefan Schmid    | 10.8|     7.59|    14.1|     2.01| 49.0|        14.2|   44.2|      5.06|    67.6|   272|
|Thomas Pappas    | 10.7|     7.41|    15.3|     2.14| 48.9|        14.3|   45.8|      5.10|    63.5|   300|
]

Why would we want to summarize these variables?

---

Variables are linearly correlated.

|            |   m100| Longjump| Shotput| Highjump|   m400| m110hurdles| Discus| Polevault| Javelin|  m1500|
|:-----------|------:|--------:|-------:|--------:|------:|-----------:|------:|---------:|-------:|------:|
|m100        |  1.000|   -0.440|  -0.130|   -0.097|  0.644|       0.493| -0.130|    -0.154|  -0.010| -0.018|
|Longjump    | -0.440|    1.000|   0.269|    0.355| -0.259|      -0.412|  0.186|     0.295|   0.199| -0.044|
|Shotput     | -0.130|    0.269|   1.000|    0.171| -0.023|      -0.268|  0.705|     0.311|   0.493|  0.131|
|Highjump    | -0.097|    0.355|   0.171|    1.000| -0.105|      -0.240|  0.137|     0.100|   0.031| -0.008|
|m400        |  0.644|   -0.259|  -0.023|   -0.105|  1.000|       0.454| -0.068|    -0.132|   0.004|  0.418|
|m110hurdles |  0.493|   -0.412|  -0.268|   -0.240|  0.454|       1.000| -0.239|    -0.293|  -0.125|  0.086|
|Discus      | -0.130|    0.186|   0.705|    0.137| -0.068|      -0.239|  1.000|     0.300|   0.455|  0.033|
|Polevault   | -0.154|    0.295|   0.311|    0.100| -0.132|      -0.293|  0.300|     1.000|   0.249| -0.074|
|Javelin     | -0.010|    0.199|   0.493|    0.031|  0.004|      -0.125|  0.455|     0.249|   1.000| -0.031|
|m1500       | -0.018|   -0.044|   0.131|   -0.008|  0.418|       0.086|  0.033|    -0.074|  -0.031|  1.000|
]

---

![](10_pca_files/figure-html/unnamed-chunk-4-1.png)

> How could we summarise all of this variation into a single score?

---

### Linear combinations

Each linear combination is called a principal component (PC). A `$j$`th PC can be expressed as linear combination `$\xi_j$` as:

`$$\xi_j = b_{j1}x_1 + b_{j2}x_2+ ... a_{jp}x_p,$$`

where `$b_{jp}$` is a weight of variable `$x_p$` of a particular `$\xi_j$`.

???
PCs are thus essentially weighted sum of variables.
We need to give each sports event some weight.

---

## Estimation of PCs

We can think of the estimation as a process starting from estimation of first PC.

The first PC is chosen so that

1. it follows the direction of largest variance in data cloud and
2. it is a line with smallest orthogonal distance to all data points.

Each following PC is chosen so that it is orthogonal to preceding PC.

---

Suppose we have two variables:

![](10_pca_files/figure-html/unnamed-chunk-7-1.png)

---

The line that maximizes variation and has smallest orthogonal distance to data points is the first PC.

![](10_pca_files/figure-html/unnamed-chunk-8-1.png)

> What does this line remind you of?

---

PCA is somewhat similar to least squares estimation.

![](10_pca_files/figure-html/unnamed-chunk-9-1.png)

OLS minimizes residuals with respect to response, PC1 minimizes orthogonal distances to data points.

???
Except that in PCA we consider variation in all variables (not just the response)

---

Each following PC is orthogonal to the first PC.

![](10_pca_files/figure-html/unnamed-chunk-10-1.png)

Each PC explains the variation unexplained by previous PC as much as possible.

---

The process involves transforming data points to a new coordinate system:

1. centering data points, 
2. scaling the axes so that they would be equal, 
3. rotating axes into the direction of a PC.

---

### Center

![](10_pca_files/figure-html/unnamed-chunk-11-1.png)

We subtract the mean from all variables so that mean value of each PC is 0.

---

### Scale

![](10_pca_files/figure-html/unnamed-chunk-12-1.png)

We divide each variable by its standard deviation so that PCs have similar scales relative to variables.

---

### Rotate

![](10_pca_files/figure-html/unnamed-chunk-13-1.png)

We rotate the data so that each PC represents an axis.

---

How do we know how exactly to scale and rotate axes?

We use some parameters derived from data:

- for scaling we use *eigenvalues* and 
- for rotating we use *eigenvectors*.

---

## Calculation

There are two methods for calculating eigenvalues and eigenvectors from data, depending on whether we use

1. covariance or correlation matrix `$\Sigma$` of data matrix `$X : n \times p$` or
2. the raw data matrix `$X : n \times p$`.

---

### Spectral decomposition

This calculation uses correlation or covariance matrix `$\Sigma$`.

Because `$\Sigma = V D V^T$`, we can find eigenvalues from `$D$` and corresponding eigenvectors from `$V$`.

Computationally more simple.

---

### Singular value decomposition

This calculation uses data matrix with raw values `$X : n \times p$`.

Because `$X = UDV^T$`, we can find eigenvalues from `$D$` and corresponding eigenvectors from `$U$`.

Numerically more stable.

---

Because PCA involves maximizing variation, variables with higher variance are given more importance. Consider scaling variables ( `$x\prime = (x - \mu) / \sigma$` ) prior to estimation.

![](10_pca_files/figure-html/unnamed-chunk-15-1.png)

---

## Number of PCs

Aside from theoretical considerations, there also several possible criteria for determining the number of estimated PCs:

1. choose a threshold of variation in data that PCs should cumulatively explain, e.g. 50%; 
2. use PCs that have eigenvalue above average (one); 
3. use PCs that are above an "elbow" in scree plot.

---

The choice can be made by looking at scree plots.

![](10_pca_files/figure-html/unnamed-chunk-17-1.png)

---

# Interpretation

---

## Variation explained

The proportion of total *variation explained* by a `$j$`th PC can be found from its eigenvalues: it is the proportion of sum of eigenvalues, i.e. `$\lambda_j / \Sigma^p_{j=1} \lambda$`.

Total variation explained by PCs illustrates how much information is preserved when summarizing data into a particular number of PCs.

|                        | PC1| PC2| PC3| PC4| PC5| PC6| PC7| PC8| PC9| PC10|
|:-----------------------|---:|---:|---:|---:|---:|---:|---:|---:|---:|----:|
|Variation explained, %  |  31|  19|  11|  10|   8|   6|   5|   5|   3|    2|
|Cumulative variation, % |  31|  50|  61|  71|  79|  86|  91|  95|  98|  100|

> How much of total variation in data do the first two PCs explain?

---

## Loadings

Variables "load" PCs and thus each variable has a particular *loading* on each PC. These can be understood as *weights*.

|             |    PC1|    PC2|    PC3|    PC4|    PC5|    PC6|    PC7|    PC8|    PC9|   PC10|
|:------------|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|
|run100       | -0.349| -0.369|  0.158| -0.443|  0.070| -0.107| -0.250| -0.227| -0.098|  0.617|
|longjump     | -0.379| -0.114|  0.301|  0.203| -0.170| -0.574| -0.248|  0.488| -0.054| -0.224|
|shotput      | -0.370|  0.420| -0.019| -0.098|  0.202|  0.164| -0.135|  0.192|  0.735|  0.114|
|highjump     | -0.216| -0.035|  0.459|  0.689|  0.268|  0.196| -0.004| -0.384| -0.017|  0.092|
|run400       | -0.303| -0.481| -0.264| -0.128|  0.202|  0.088| -0.143| -0.302|  0.174| -0.632|
|run110hurdle | -0.401| -0.213|  0.102| -0.094| -0.050|  0.241|  0.793|  0.285| -0.061|  0.016|
|polevault    | -0.354|  0.390| -0.148| -0.095|  0.264|  0.327| -0.257|  0.170| -0.638| -0.114|
|discus       | -0.308|  0.111| -0.138|  0.122| -0.849|  0.244| -0.131| -0.249|  0.012|  0.010|
|javelin      | -0.269|  0.384| -0.295|  0.016|  0.099| -0.601|  0.348| -0.440| -0.077|  0.016|
|run1500      | -0.070| -0.291| -0.682|  0.478|  0.101| -0.009| -0.060|  0.263|  0.014|  0.366|
]

> How would you describe these PCs?

???
Loadings can be used to give a theoretical meaning to each PC. This is creative.

---

### Scores

Principal component scores are the original variables weighted by PCs.

Recall the representation of linear combinations `$\xi_j = b_{j1}x_1 + b_{j2}x_2+ ... a_{jp}x_p$`.

|                 |   PC1|    PC2|    PC3|   PC4|    PC5|    PC6|    PC7|    PC8|    PC9|   PC10|
|:----------------|-----:|------:|------:|-----:|------:|------:|------:|------:|------:|------:|
|Tomas Dvorak     | -6.59|  0.420|  0.754| 0.278|  0.705| -1.065| -0.114|  0.340|  0.389|  0.218|
|Roman Sebrle     | -5.67|  0.090|  0.727| 1.223|  0.852| -1.027|  0.229| -0.056| -0.424|  0.516|
|Erki Nool        | -5.33| -0.469| -0.982| 0.112| -1.545| -1.310| -0.220| -0.901| -0.275| -0.513|
|Chris Huffins    | -5.42| -0.651|  0.825| 0.094|  0.968|  0.341| -0.442|  0.533| -0.536|  0.434|
|Aleksandr Yurkov | -4.90|  0.259|  0.236| 0.704| -0.580| -0.530| -1.142|  0.848| -0.348|  0.562|
|Dean Macey       | -4.65| -1.451| -0.435| 1.306|  0.611| -0.658| -0.681| -0.211|  0.353| -0.644|
|Attila Zsivoczky | -4.40| -0.108| -0.332| 1.450|  1.731|  0.116| -0.752| -1.239|  0.487|  0.961|
|Frank Busemann   | -4.41| -0.484| -0.407| 1.187| -0.492| -1.367| -0.828|  0.307| -0.080|  0.324|
|Stefan Schmid    | -4.35| -0.037| -0.601| 0.592| -0.480| -1.039|  0.519| -0.316| -0.439|  0.226|
|Thomas Pappas    | -4.78|  0.673|  1.348| 0.300|  0.009|  0.235|  0.087| -1.294|  0.045|  0.083|
]

---

## Biplot

![](10_pca_files/figure-html/unnamed-chunk-21-1.png)

> How would you interpret these first two PCs?

---

How are our results related to actual points given to athletes?

![](10_pca_files/figure-html/unnamed-chunk-22-1.png)

???
It seems like technical skills are given higher importance. Methodology for total points calculation is from 1985.

---

# Practical application

---

Use the data set `UN98`.

> Are variables linearly correlated?

> Summarize the social indicators into a single index. What are the weights of these variables?

> Use the index to describe social situation in world regins.

---