Section 10 Simple linear regression
Regression analysis is a statistical procedure that allows us to model relationships between variables so that the causal relationship between variables is defined. It can be considered as the main statistical technique used in economics, i.e. econometrics. There is a large variety of regression models, depending on estimation method, model specification and assumed distributions. This section introduces the most basic of these, the simple linear regression model, i.e. ordinary least squares (OLS) with one independent variable.
10.1 Ordinary least squares
See Navarro and Foxcroft (2018) section 12.3 for an introduction.
10.1.1 Model specification
Regression models represent how one or more more predictor variables have an effect on a response variable. The most simple regression model has only one predictor, which is why it is called “simple linear regression” (SLS). An SLS model is mathematically expressed as follows:
\[\hat{y} = \beta_0 + \beta_1 x + \varepsilon,\]
where
- \(\hat{y}\) is dependent or explained variable, response, regressand or output,
- \(\beta_0\) is the intercept or constant,
- \(\beta_1\) is a coefficient of \(x\),
- \(x\) is independent or explanatory variable, predictor or regressor, and
- \(\varepsilon\) is the model error or residual.
The values of \(x\) and \(y\) come from our data, whereas the \(\beta\) coefficients need to be calculated from this data. The \(\varepsilon\) is estimated using both, the \(\beta\) coefficients and data.
10.1.2 Calculation of an SLS model
The underlying idea behind OLS regression is the minimization of squared residuals, thus the name “least squares”. We simply find the coefficients \(\beta_0\) and \(\beta_1\) that give us the smallest possible sum of differences between observed and predicted values of the response variable \(y\).
Model parameters are calculated directly (unlike maximum likelihood estimation that is based on iterations). To estimate the model \(\hat{y} = \beta_0 + \beta_1 x + \varepsilon\) for variables \(x\) and \(y\) we estimate the parameters \(\hat{\beta_1}\) and \(\hat{\beta_0}\) as follows:
\(\beta_1 = \frac{\sum{x_{i} y_{i}} - \frac{1}{n} \sum{x_{i}}\sum{y_{i}}}{\sum{x_{i}^{2}} - \frac{1}{n} (\sum{x_{i}})^{2}} = \frac{Cov[x,y]} {Var[x]},\)
\(\beta_0 = \overline{y} - \beta_1 \overline{x}.\)
For a simple model \(y = \beta_1 x + \varepsilon\) we can simply use matrix algebra on values of \(x\) and \(y\) to find \(\hat{\beta_1}\):
\[\hat{\beta_1} = (X^{T} X)^{-1} X^{T} Y,\]
where \(X\) is the column matrix of predictor and \(Y\) the column matrix of response.
See Navarro and Foxcroft (2018) section 12.4 for illustrations of the idea.
In Jamovi: Regression > Linear regression
.
In R: lm(y ~ x, data)
or summary(lm(y ~ x, data))
10.2 Elements of a regression model
10.2.1 Intercept
In the equation this is the \(\beta_0\) and often also referred to as the constant. Intercept is the value of \(y\) where regression line crosses the Y-axis, so intercept is the value of \(y\) when \(x\) is zero ( \(y|x=0\) ). Intercept does not need to be theoretically valid but it sometimes is. In particular, the intercept is not relevant when the range of our observed values of \(x\) does not cover 0.
For example, the model below has an intercept.
\[finalwt_{i} = 195.135 + 1.176 * initialwt_{i} + \varepsilon_{i}\]
When
initialwt
is 0 units, then the value offinalwt
is 195.135 units.
10.2.2 Coefficient(s)
In the equation the \(\beta_1\) represents coefficient of \(x\). It indicates by how many units \(y\) increases when \(x\) increases by one unit.
We usually estimate models on a sample and wish to make generalizations or inferences about population. It is thus necessary to test whether or not coefficients are different from 0. This is done by calculating t-statistic from coefficient and standard error. Coefficients are only relevant if their difference from 0 is statistically significant. If a coefficient is not statistically significant, then we would conclude that the predictor has no effect on the response.
The interpretation of \(\beta\) coefficients is different for interval or ratio (numeric) predictors and ordinal or nominal (categorical) predictors.
In the following example we estimate the effect of a ratio (numeric) predictor initialwt
on response finalwt
.
\[finalwt_{i} = 195.135 + 1.176 * initialwt_{i} + \varepsilon_{i}\]
When
initialwt
increases by 1 unit, thenfinalwt
increases by 1.176 units.
Suppose that a categorical variable called diet
has the following levels: Low, Medium, High. Then in regression analysis the first level (Low) is considered as the base level. All other factor values are compared against this base value.
\[finalwt_{i} = 1019.406 + 1.898dietMedium_{i} + 12.01dietHigh_{i} + \varepsilon_{i}\]
When
diet
is Low, thenfinalwt
is 1019.406.
When
diet
is Medium, thenfinalwt
is higher by 1.898 units compared to whendiet
is Low, i.e. 1021.304.
Suppose now that the coefficient for dietHigh
is not statistically significant. When diet
is High, finalwt
is no different compared to when diet
is Low. If it was statistically significant, we could say that when diet
is High, then finalwt
is higher by 12.01 units compared to when diet
is Low.
10.2.3 Fitted values
These are the estimated values of response ( \(\hat{y}\) ) that are calculated by applying the model for \(x\) of every observation in the data. In other words, fitted values are predictions of \(y\) based on a model. After we have estimated the \(\beta_0\) and \(\beta_1\) we can calculate fitted values for each value of \(x\) as follows: \(\hat{y} = \beta_0 + \beta_1 x\).
10.2.4 Residuals
Residuals are model errors, represented by the \(\varepsilon\) in the equation \(\hat{y} = \beta_0 + \beta_1 x + \varepsilon\). Residuals are the difference between observed and fitted (predicted) values of response variable. So we can find residuals for each observation after we have estimated \(\beta_0\) and \(\beta_1\) using values of \(x\) and \(y\) as follows: \(\varepsilon = y - \beta_0 - \beta_1 x\).
We use residuals to evaluate how well model fits data. Large residuals would mean that fitted values are different from observed values, which indicates that model does not fit data well.
10.2.5 The \(R^2\)
This is a way to measure goodness of fit of a model, i.e. how well model fits data. It is sometimes called coefficient of determination. \(R^2\) expresses the part of variation in response variable that is explained by model:
\[R^{2} = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS},\]
where the elements are defined a follows:
- explained sum of squares, \(ESS\); \(\sum_{i = 1}^{n} (\hat{y}_{i} - \bar{y})^2\)
- residual sum of squares, \(RSS\); \(\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2\)
- total sum of squares, \(TSS\); \(\sum_{i = 1}^{n} (y_{i} - \bar{y})^2.\)
In above equations the following notation is used:
- \(\hat{y}_{i}\) are fitted values,
- \(\overline{y}\) is the mean value of \(y\), and
- \(y_{i}\) are the observed values of \(y\).
Simply put, the \(R^2\) measures how much better is model at explaining the variance of \(y\) compared to how well just the mean of \(y\) explains variance of \(y\).
10.2.6 The adjusted- \(R^2\)
Each additional predictor likely improves the model fit. So the more predictors we add, the more the model explains variance of \(y\). So \(R^2\) can be inflated just by adding more predictors to the model even if the predictors are not very meaningful. To penalize a model for the number of predictors ( \(k\) ) while considering the number of observations ( \(n\) ), the adjusted \(R^{2}\) can also be used, particularly for model comparison:
\[\overline{R^{2}} = 1 - \frac{RSS/(n-k)}{TSS/(n-1)}\]
10.3 Assumptions and diagnostics
See Navarro and Foxcroft (2018) section 12.9 for an explanation of assumptions and section 12.10 on model checking.
After the estimation of a regression model it should be diagnosed to make sure that it meets at least the following assumptions:
- residuals are normally distributed, and
- residuals have equal variance, i.e. variance of residuals does not depend on the values of \(x\)
In Jamovi: Regression > Linear regression > Assumption Checks
.
In R: plot(lm(formula, data))
10.3.1 Gauss-Markov assumptions
In addition to the practical considerations outlined above, a theoretical way of expressing the assumptions of OLS is via the Gauss–Markov theorem. It posits the following assumptions:
- linear in parameters
\(Y = \beta_0 + \beta_1 x + \varepsilon\) - expected error is zero
\(E(\varepsilon) = 0\) - homoscedasticity
\(var(\varepsilon) = E(\varepsilon^{2})\) - no autocorrelation
\(cov(\varepsilon_{i}, \varepsilon_{j}) = 0, i \neq j\) - independence of predictor(s) and residuals
\(cov(x,\varepsilon) = 0\)
If these are true, then the model is the best linear unbiased estimator (BLUE).