class: center, middle, inverse, title-slide # Cluster analysis ## Research methods ### Jüri Lillemets ### 2021-12-15 --- class: center middle clean # How to group objects? --- class: center middle inverse # What is clustering? --- ## The idea behind clustering The purpose of cluster analysis is to **categorize objects into some homogeneous groups** -- so that **objects within the same groups are more closely related than objects in different groups**. -- The catogorization is based on **similarities between objects** according to a set of variables. --- There are various methods for categorizing objects: - classification, - **clustering**, - model-based (e.g. mixture models) methods, - **distance-based (combinatorial) methods**, - **hierarchical clustering**, - **partitioning (K-means clustering)**. --- ### Clustering and classification In clustering we do not have any information on possible existing classes and we can not compare clusters to classes. In classification we already know the classes and use that information to determine classification rules. -- We used classification after estimating a logistic regression model. --- ### Model- and distance-based clustering Model-based methods assume that an underlying model can explain clusters in data (e.g. mixture models). Distance-based methods apply distances between objects and clusters to separate objects into clusters. --- Mixture models assume that objects follow mixture of distributions. Thus objects can be clustered by locating the densities in data. Distance-based methods use distances. <!-- --> --- ### Hierarchical clustering and partitioning In hierarchical clustering clusters are constructed **incrementally** according to similarities between objects and clusters. In partitioning objects are assigned to a particular number of groups and the optimal clustering is determined **iteratively**. -- While partitioning is intrinsically a *divisive* process, *hierachical clustering* can be applied *agglomeratively* as well as *divisively*. --- ## Why objects? It is more common to assign observations to clusters. However, we can cluster either observations or variables. That's why we refer to **objects** as the phenomena to be assigned to clusters. --- ## Standardization Variables with higher variances have a higher influence on how the objects are clustered. If this is not desired, variables should be standardized prior to clustering. Conversely, sometimes it might be desirable to give more weight to particular variables. --- Why should we standardize? <!-- --> -- The distances between objects are scale-dependent. --- ## Application Clustering can be applied in practice for various purposes. Marketing and sales - find homogeneous groups of customers so that promotional campaigns could be addresses more accurately and thus more efficiently. Medicine - cluster patients with similar symptoms or predispositions for treatment or discovery of risk. Finance - categorize enterprises into different types based on some financial or other characteristics. Biology - assign plants to species depending on characteristics they share. -- We can cluster anything we want. --- ## Distance measures The assignment of objects into groups should be such that observations are more similar within groups than between groups. -- .pull-left[ We need to somehow measure the distances between all objects. ] .pull-right.small[ | Length| Flaws| |------:|-----:| | 1.22| 1| | 1.70| 4| | 2.71| 5| | 3.71| 14| | 3.72| 7| | 3.75| 9| | 4.17| 2| | 4.41| 8| | 4.58| 4| | 4.91| 7| ] --- ### Distance matrix For data matrix `\(X : n \times p\)` with `\(n\)` observations and `\(p\)` variables the distances `\(d\)` between objects `\(i\)` and `\(j\)` can be described as a *proximity matrix* or a *distance or matrix* `\(D : n \times n\)` where `\(d_{ij} = d(x_i,x_j)\)`. -- We usually need to calculate this. --- A distance matrix contains pairwise distances between all objects. | 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| |-----:|-----:|----:|-----:|----:|----:|-----:|----:|-----:|----:| | 0.00| 3.04| 4.27| 13.24| 6.50| 8.39| 3.12| 7.69| 4.50| 7.04| | 3.04| 0.00| 1.42| 10.20| 3.62| 5.40| 3.18| 4.83| 2.88| 4.39| | 4.27| 1.42| 0.00| 9.05| 2.24| 4.13| 3.34| 3.45| 2.12| 2.97| | 13.24| 10.20| 9.05| 0.00| 7.00| 5.00| 12.01| 6.04| 10.04| 7.10| | 6.50| 3.62| 2.24| 7.00| 0.00| 2.00| 5.02| 1.22| 3.12| 1.19| | 8.39| 5.40| 4.13| 5.00| 2.00| 0.00| 7.01| 1.20| 5.07| 2.31| | 3.12| 3.18| 3.34| 12.01| 5.02| 7.01| 0.00| 6.00| 2.04| 5.05| | 7.69| 4.83| 3.45| 6.04| 1.22| 1.20| 6.00| 0.00| 4.00| 1.12| | 4.50| 2.88| 2.12| 10.04| 3.12| 5.07| 2.04| 4.00| 0.00| 3.02| | 7.04| 4.39| 2.97| 7.10| 1.19| 2.31| 5.05| 1.12| 3.02| 0.00| --- We look at the most common measures for continuous variables, the Euclidean and Manhattan distances: `$$d_{Euclidean}(x_i,x_j) = [\sum^p_{k=1}(x_{ik} - x_{jk})^2]^{1/2},$$` `$$d_{Manhattan}(x_i,x_j) = \sum^p_{k=1} |x_{ik} - x_{jk}|.$$` -- Distance measures for ordinal and nominal variables also exist but are not explained here. --- Example data on the number of flaws in cloth for 32 pieces of cloth. <!-- --> --- class: center middle inverse # Hierarchical clustering --- Clusters are constructed **incrementally**. This can be done - **divisive**ly ("top-down") where we begin with a single cluster and divide it into smaller clusters and eventually into objects or - **agglomerative**ly ("bottom-up") where we start from combining objects into clusters and eventually have a single cluster. We will explore the agglomerative hierarchical clustering (*Agnes* - agglomerative nesting). --- The example data we use contains on 908 measurements on three hawk species. .small[ | | Wing| Weight| Culmen| Hallux| Tail| StandardTail| Tarsus| WingPitFat| KeelFat| Crop| |:---|----:|------:|------:|------:|----:|------------:|------:|----------:|-------:|----:| |899 | 200| 185| 12.8| 15.2| 158| 166| NA| NA| 4.0| 1.00| |900 | 360| 1325| 26.2| 30.6| 224| 230| NA| NA| 4.0| 0.75| |901 | 366| 945| 25.3| 27.2| 199| 205| NA| NA| 2.0| 0.00| |902 | 402| 1350| 28.7| 31.0| 219| 214| NA| NA| 3.0| 0.00| |903 | 366| 805| 23.5| 25.7| 217| 222| NA| NA| 1.5| 0.25| |904 | 380| 1525| 26.0| 27.6| 224| 227| NA| NA| 3.0| 0.00| |905 | 190| 175| 12.7| 15.4| 150| 153| NA| NA| 4.0| 0.00| |906 | 360| 790| 21.9| 27.6| 211| 215| NA| NA| 2.0| 0.00| |907 | 369| 860| 25.2| 28.0| 207| 210| NA| NA| 2.0| 0.00| |908 | 199| 1290| 28.7| 32.1| 222| 226| NA| NA| 1.0| 0.00| ] --- We will use the following variables to cluster the hawks. - `Wing` Length (in mm) of primary wing feather from tip to wrist it attaches to - `Weight` Body weight (in gm) - `Culmen` Length (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird - `Hallux` Length (in mm) of the killing talon - `Tail` Measurement (in mm) related to the length of the tail (invented at the MacBride Raptor Center) - `StandardTail` Standard measurement of tail length (in mm) - `Tarsus` Length of the basic foot bone (in mm) - `WingPitFat` Amount of fat in the wing pit - `KeelFat` Amount of fat on the breastbone (measured by feel - `Crop` Amount of material in the crop, coded from 1=full to 0=empty --- ## Process Agglomerative hierarchical clustering has the following steps. 1. Initial number of clusters is `\(n\)`, so each cluster contains one object. 2. Calculate distance matrix `\(D\)` that expresses pairwise distances between clusters (objects). 3. Find the smallest distance and merge the objects with smallest distance into a single cluster. 4. Calculate a new distance matrix `\(D\)` that now includes distance between the new cluster and all other clusters using a linkage method (see below). 5. Repeat the previous two steps until all objects are in a single cluster. --- Let's illustrate the process with two variables and 5 hawks. <!-- --> --- Here's the part of the respective initial Euclidean distance matrix `\(D\)` that represent pairwise distances between hawks. | | 6| 7| 8| 9| 10| 11| |:--|----:|----:|----:|----:|----:|----:| |6 | 0.0| 45.7| 39.2| 20.0| 10.6| 20.6| |7 | 45.7| 0.0| 31.4| 42.0| 43.6| 25.1| |8 | 39.2| 31.4| 0.0| 49.6| 30.4| 27.7| |9 | 20.0| 42.0| 49.6| 0.0| 28.9| 22.5| |10 | 10.6| 43.6| 30.4| 28.9| 0.0| 20.0| |11 | 20.6| 25.1| 27.7| 22.5| 20.0| 0.0| --- What if we have more than two variables? For example three?
--- ## Linkage methods How to calculate distance between cluster and a point? - single linkage: `\(d_{IJ,k} = min(d_{i,K}, d_{j,K})\)`; - complete linkage: `\(d_{IJ,k} = max(d_{I,K}, d_{J,K})\)`; - average linkage: `\(d_{IJ,k} = \sum_{i \in IJ} \sum_{k \in K} d_{ik} / (n_{ij}n_k)\)`; - Ward's method: compares the within-cluster and between-cluster squared distances. --- How to think about single, complete and average linkage? <!-- --> ??? Draw linkage results. --- Single linkage tends to link objects serially, resulting in **clusters with large diameter** where objects within a cluster are not similar. Complete linkage has the tendency to produce **clusters with small diameter** and as a result, an object can be closer to members of another cluster. Average linkage is a **compromise between the two** but is sensitive to the scale on which distances are measured. --- There is no correct linkage method.  --- ## Dendrogram The clusters in case of hierarchical clustering are estimated incrementally, resulting in a **nested structure**. This tree-shaped structure can be visualized by a **dendrogram**. Dendrogam is highly intrepretative and provides complete description of the clustering process. --- <!-- --> --- Dendrogram allows us to also illustrate differences between linkage methods. <!-- --> --- ## Number of clusters In hierarchical clustering we can decide the suitable number of clusters after the clustering procedure. The descision can be made by examining dendrogram. We can find the longest consecutive height and cut between the ends of that at the point where another heights are also the longest. --- To determine the clusters we can choose either the **height of cut** or **number of clusters**. <!-- --> --- If we would cut at 1000 then we would obtain 5 clusters. <!-- --> -- > At what height would we have to cut if we wished to obtain 3 clusters? --- Actually, the species for each Hawk is already determined. How does it coincide with our clusters? | | CH| RT| SS| |:--|--:|---:|---:| |1 | 3| 398| 1| |2 | 67| 8| 260| |3 | 0| 171| 0| --- class: center middle inverse # K-means clustering --- Clusters are constructed by *partitioning* objects *iteratively*. The **number of clusters `\(K\)` has to be defined before estimation**. The goal is to partition objects `\(x\)` into `\(K\)` clusters so that distances between objects within cluster are small compared to distances to points outside the cluster. We can achieve this by assigning each object to the closest **centroid**, i.e. cluster mean. --- We thus need optimal cluster means. The optimal mean vector `\(\bar x_1, ... \bar x_K\)` can be found by minimizing the following function: `$$ESS = \sum^K_{k = 1} \sum_{c(i)=k} (x_i - \bar x_k)^T(x_i - \bar x_k),$$` where `\(c(i)\)` is the cluster containing `\(x_i\)`. -- An alternative is **K-medoids clustering** in which case the centroids are not some mean values but represented by actual objects. ??? Number of clusters has to be defined before clustering. We attempt to minimize the sum of distances within all clusters --- Let's attempt to cluster 392 vehicles. | mpg| cylinders| displacement| horsepower| weight| acceleration| year| |---:|---------:|------------:|----------:|------:|------------:|----:| | 18| 8| 307| 130| 3504| 12.0| 70| | 15| 8| 350| 165| 3693| 11.5| 70| | 18| 8| 318| 150| 3436| 11.0| 70| | 16| 8| 304| 150| 3433| 12.0| 70| -- We should scale the variables. | mpg| cylinders| displacement| horsepower| weight| acceleration| year| |------:|---------:|------------:|----------:|------:|------------:|-----:| | -0.698| 1.48| 1.08| 0.663| 0.620| -1.28| -1.62| | -1.082| 1.48| 1.49| 1.573| 0.842| -1.47| -1.62| | -0.698| 1.48| 1.18| 1.183| 0.540| -1.65| -1.62| | -0.954| 1.48| 1.05| 1.183| 0.536| -1.28| -1.62| --- We have 7 variables, so 7 dimensions. Let's use PCs to represent the data. <!-- --> --- What do we mean by "partitioning"? <!-- --> --- ## Process K-means clustering involves the following steps: 1. We start with a distance matrix `\(D\)` based on - random assignment of objects to `\(K\)` clusters with cluster means, or - some (random) cluster means. 2. Calculate squared Euclidean distance between each object and each cluster mean. Reassign each item to its nearest cluster mean, resulting in decreased `\(ESS\)`. 3. Update cluster means. 4. Repeat the previous two steps until objects can not be reassigned, so each object is closest to its own cluster mean. --- Can we see any clusters if we summarize the data into PCs? <!-- --> -- > How many clusters would you distinguish? --- ## Number of clusters (Gap statistic) The Gap statistic is a technique used to determine the optimal number of clusters. The measure compares the sum of average distances within cluster to the same sum obtained from uniformly distributed data. The optimal number of clusters is at the value of the Gap statistic `\(k\)` where `\(k + se(k)\)` is higher or equal to the estimate `\(k\)` of next number of clusters. --- <!-- --> > How many clusters should we estimate? --- We get a better picture the more dimensons we look at.
--- Let's estimate 4 clusters as suggested by the Gap statistic. Depending on the centers we choose, we obtain different clusters. <!-- --> ??? Cluster number is arbitrary. It should not converge to illustrate that we may get different results! --- We can use the pairwise scatterplots to asses the results of clustering. <!-- --> --- The results are usually printed using first PCs instead of all variables.
--- ## Cluster plot When we apply K-means clustering we do not obtain a nested structure and therefore can not express the clustering as a dendrogram. We can create a plot that shows the locations of objects and cluster centroids on a two-dimensional plot. The dimensions can be found via PCA or multidimensional scaling. --- Here is the final estimate if we run 10 iterations and 10 different starting values. <!-- --> --- class: center middle inverse # How to use clusters? --- Save clusters as a new variable. .smaller[ | mpg| cylinders| displacement| horsepower| weight| acceleration| year| origin|name | Cluster| |---:|---------:|------------:|----------:|------:|------------:|----:|------:|:-------------------------|-------:| | 18| 8| 307| 130| 3504| 12.0| 70| 1|chevrolet chevelle malibu | 3| | 15| 8| 350| 165| 3693| 11.5| 70| 1|buick skylark 320 | 3| | 18| 8| 318| 150| 3436| 11.0| 70| 1|plymouth satellite | 3| | 16| 8| 304| 150| 3433| 12.0| 70| 1|amc rebel sst | 3| | 17| 8| 302| 140| 3449| 10.5| 70| 1|ford torino | 3| | 15| 8| 429| 198| 4341| 10.0| 70| 1|ford galaxie 500 | 3| ] --- You can describe clusters using descriptive statistics, e.g. mean values as below .small[ | Cluster| mpg| cylinders| displacement| horsepower| weight| acceleration| year| |-------:|----:|---------:|------------:|----------:|------:|------------:|----:| | 1| 19.8| 6.05| 219| 103.1| 3222| 16.2| 76.2| | 2| 25.3| 3.99| 108| 82.0| 2301| 16.4| 73.5| | 3| 14.5| 8.00| 349| 161.8| 4151| 12.6| 73.6| | 4| 32.6| 4.04| 112| 74.8| 2327| 16.8| 80.0| ] --- Or create some visualizations. <!-- --> --- class: center middle inverse # Practical application --- Use the data set `UN98`. Cluster countries according to social indicators using the hierarchical clustering method. > How many clusters seem natural? -- > Can clusters be explained by world regions? --- Use the data set `HousePrices`. Cluster houses using the K-means method. > How many clusters should be extracted? -- > Describe each cluster. --- class: inverse