Section 4 Descriptive statistics

Any data analysis should begin with an exploration of the structure of and values in dataset. Description of values should also be part of the reporting of analysis. This can be done using some simple methods.

When we use the methods of descriptive statistics, we need to be aware of that we merely describe the dataset at hand. If observations in the data set constitute a sample, we cannot draw any wider conclusions on population using these methods. Below we are thus defining the statistics for populations and not samples.

Description of nominal or ordinal variables is simple and limited to just counting unique values and finding the mode and/or median value. But suppose we have variables that contain sequences of numbers. How can we describe these?

4.1 Central tendency

A concise way to represent numeric values is expressing the center of values. This can also be understood the typical value or expected value (expectation of \(x\), i.e. \(E(x)\)). Note that estimating a central tendency is only meaningful if there is a natural center. i.e. the distribution of values is unimodal.

See Navarro and Foxcroft (2018) section 4.1.

4.1.1 Mode

Mode is simply the most frequent value in a sequence. This statistic can be useful for nominal and ordinal variables but it’s rarely used for numeric variables, especially on a continuous scale.

In Jamovi: Exploration > Descriptives > Statistics > Central Tendency | Mode

4.1.2 Median

When sequence of values is ordered, the middle value is median. In case there are even number of values (\(N\)) in a sequence, median is the average of the two values in the middle. Thus, there are equal number of values above and below the median value. When \(N\) is an even number, half of the values are above and half below the median value.

In Jamovi: Exploration > Descriptives > Statistics > Central Tendency | Median
In R: median(x)

4.1.3 Mean

While there are several forms of mean, arithmetic mean is most useful in statistics and also considered here. Mean is in other words the average value: it’s the sum of values (\(X\)) divided by number of values:

\[\bar{X} = \frac{1}{N} \sum{^N_{i=1}}{X_i}\]

A feature of mean value is that if you only know the \(\bar{X}\) and \(N\), you can still calculate \(\sum{^N_{i=1}}{X_i}\).

In Jamovi: Exploration > Descriptives > Statistics > Central Tendency | Mean
In R: mean(x)

4.2 Variability

Central tendency describes values only partially as it tells nothing about the deviation of values from that mean. Estimating also the variability of values provides a more manifold understanding of values.

See Navarro and Foxcroft (2018) section 4.2.

4.2.1 Range

Most simple way to express variability is to indicate the range of all values, i.e. the difference between minimum and maximum values.

In Jamovi: Exploration > Descriptives > Statistics > Dispersion | Range, Minimum, Maximum
In R: range(x)

4.2.2 Quantiles and interquartile range

Quantiles are in essence calculated just like median, because median is the lowest quantile, the 2-quantile. In addition to dividing sequence of ordered values into two halves separated by median, such sequence can be diveded into any number of equal groups as long as the number of groups (\(k\)) is smaller than or equal to the number of values, i.e. \(k \le N\). Some common quantiles are quartiles (\(k = 4\)), deciles (\(k = 10\)) and centiles/percentiles (\(k = 100\)).

In Jamovi: Exploration > Descriptives > Statistics > Percentile Values
In R: quantile(x)

Arguably quartile is most frequently used among quantiles because the 2nd and 4th quartiles contain half of all values. This interquartile range (IQR) thus contains the middle half of all values.

In Jamovi: Exploration > Descriptives > Statistics > Dispersion | IQR
In R: IQR(x)

4.2.3 Variance

Although difficult to interpret, variance is used in many statistical calculations due to its mathematical properties. Variance is mean squared difference from the mean value:

\[Var(X) = \frac{1}{N} \sum{^N_{i=1}{(X_i-\bar{X})^2}}\],

where \(X_i\) is the value of observation \(i\) and \(\bar{X}\) the mean value.

In Jamovi: Exploration > Descriptives > Statistics > Dispersion | Variance
In R: var(x)

4.2.4 Standard deviation

Interpretability issues of variance arise from the fact that the calculation includes squaring and thus the value of the statistic is not comparable to the initial scale. A solution would be to find a square root of variance. This is what the standard deviation is: the square root of variance. To simplify, standard deviation is the average difference from the mean.

\[\sigma = \sqrt{\frac{1}{N} \sum{^N_{i=1}{(X_i-\bar{X})^2}}}\]

In Jamovi: Exploration > Descriptives > Statistics > Dispersion | Std. deviation
In R: sd(x)

4.3 Plots

Previously explained measures allow us to only summarize values into fewer numbers. Plotting allows to present these values more intuitively or even depict all of the values at once.

4.4 Scatterplot

A simple way to examine a relationship between two continuous variables is a scatterplot. It simply depicts all data points on a two-dimensional space where location of each points is determined by it’s value of each variable. If the relationship between the two variables is causal, then it is conventional to use have predictor variable on the horizontal (x-)axis and response variable on vertical (y-)axis. Additional variables can be shown by alternating the size, shape or color of data points.

In Jamovi: Regression > Correlation Matrix | Plot
In R: plot(x)

4.4.1 Boxplot

Boxplots illustrate the quartiles of values (including interquartile range) and extreme values (outliers). Outliers are usually values that are below the 2nd quartile or above the 4th quartile by 1.5 times the interquartile range. On boxplots, half of all the values lie within the box and another half on the lines, outliers excluded.

See Navarro and Foxcroft (2018) section 5.2.

In Jamovi: Exploration > Descriptives > Plots | Box Plots
In R: boxplot(x)

4.4.2 Histogram and density plot

Histograms are essentially barplots where all values are divided between ranges of equal with, i.e. bins.

Instead of bins, distribution of values can also be expressed continuously by smoothing the bins. This results in a density plot.

See Navarro and Foxcroft (2018) section 5.1.

In Jamovi: Exploration > Descriptives > Plots | Histograms
In R: histogram(x) and plot(density(x))

References

Navarro, Danielle J, and David R Foxcroft. 2018. Learning Statistics with Jamovi: A Tutorial for Psychology Students and Other Beginners. Danielle J. Navarro; David R. Foxcroft. https://doi.org/10.24384/HGC3-7P15.