Section 13 Logistic regression

An important assumption of linear regression models is the normality and constant variance of residuals. With some particular types of response variable these assumptions about model errors can not be met. An example of such variables would be binary variables, i.e. variables that can take only two values (\(0\) and \(1\)). A common way to model relationships with a binary response variable is logistic regression.

13.1 Generalized linear models

See the introduction of section 13 in Crawley (2013).

The reason why the assumptions regarding the residuals are often not met using ordinary linear model is the structure or errors. Even transforming variables or altering functional form might often not solve the problem. Thus, it is necessary to define an error structure prior to estimation. In case of binary data (proportions) we can use binomial distribution to obtain a suitable error structure (binomial errors). Binomial distribution can be used to represent the proportion of successes at a particular number of trials. A distribution of errors can not be applied when estimating a model using the idea of least squares. Instead, the estimation needs to be applied more generally, hence the name generalized linear models (GLM).

13.1.1 Logit link

See Crawley (2013) section 13.3 on link function and section 13.4 on proportion data and binomial errors.

In order to apply a particular distribution on model errors in the framework of GLM, we need to use a link function that relates the mean value of response to predictor variable(s). Link function transforms the mean of response variable. Note that we are not transforming the values themselves but using a function of mean value of response. There is no direct relationship between the values of response and predictor variables, which is why we can not use least squares estimation.

The canonical link function for binomial errors is the logit link (in econometrics it is often common that probit link is used instead). The logit link is mathematically expressed as follows:

\[ln(\frac{p}{1 - p}) = \alpha + \beta x,\]

where \(p\) is the probability that \(y = 1\).

In addition to the assumptions of normality and consistency of errors, the logit link also takes into consideration the fact that proportions are bounded to values between 0 and 1.

13.2 Maximum likelihood estimation

See Crawley (2013) section 7.3.3 for an idea behind the maximum likelihood estimation (MLE).

The parameters of a GLM can not be simply calculated as in case of the least squares method. Instead, we need to use the MLE. It’s a little bit complicated idea but can be summarized as follows: models with various parameters are iteratively fitted until such parameters are found for which the data we have is most likely. While the value of this likelihood can not be directly interpreted, it is used to calculate measures of model fit (see below).

In Jamovi: Regression > 2 Outcomes.
In R: glm(y ~ x, data, family = binomial(link = 'logit'))

13.3 Interpretation of coefficients

Due to the transformation of response variable, the interpretation of coefficients of logistic regression with the logit link is somewhat obscure. Recall that the form of the response variable is \(log(\frac{p}{1 - p})\), where \(p\) is the probability that \(y = 1\) and hence \(1-p\) the probability that \(y \neq 1\).

Therefore, the coefficients in logistic regression model represent increase in logged odds of \(y=1\) when predictor variable increases by one unit. We can also exponentiate the model equation above. Then \(\frac{p}{1-p} = e^{\alpha + \beta x}\) which means that one unit of increase in predictor multiplies the odds of \(y = 1\) by \(e^\beta\). As this demonstrates, there isn’t an intuitive way to interpret the coefficients. Still, we can interpret the direction and statistical significance.

In Jamovi: Regression > 2 Outcomes > Model Coefficients | Odds ratio.

13.4 Model fit

13.4.1 Deviance

See Crawley (2013) section 13.6.

When fitted values are determined as a result of MLE, using residuals is not straightforward. A MLE equivalent of RSS in case of OLS is deviance that can also be used to assess the goodness of fit of a GLM. Deviance expresses the difference in log-likelihood between current and saturated model. For binomial error structure the deviance is expressed as

\[D = 2 \sum y ln(y/\mu) + (n - y) ln(n - y) / (n - \mu),\]

where \(y\) are the observed values of response, \(\mu\) the mean value of response and \(n\) the number of observations.

In Jamovi: Regression > 2 Outcomes > Model Fit | Deviance.

13.4.2 Pseudo \(R^2\)

Because MLE does not involve the minimization of squared residuals, it is not suitable to use residuals via \(R^2\) to determine model fit. Still, there are equivalent pseudo-\(R^2\) measures for expressing model fit for MLE that are calculated using likelihoods. While there are several such measures, most common is the McFadden’s pseduo-\(R^2\):

\[R^2 = 1 - \frac{ln(\theta_{reduced})}{ln(\theta_{full})},\]

where \(\theta_{reduced}\) is the likelihood of model with only intercept and \(\theta_{full}\) likelihood of model with predictors. As such, the measure illustrates how much does the addition of predictors improve the model. The values lie between 0 and 1 and usually values \(>0.4\) are considered to indicate a good fit.

In Jamovi: Regression > 2 Outcomes > Model Fit | McFadden's R^2.

13.4.3 Likelihood ratio test

This test can be used to determine if a complex model is different from a simpler one. This can be used to estimate the overall model fit. The test statistic is

\[LR = -2ln(\frac{\theta_{reduced}}{\theta_{full}}),\]

where \(\theta_{reduced}\) is the likelihood of simpler model and \(\theta_{full}\) the likelihood of complex model. The hypotheses are as follows:

\(H_0: L_{reduced} = L_{full}\)
\(H_1: L_{reduced} \neq L_{full}.\)

If difference can not be shown, then the simpler model should be preferred.

In Jamovi: Regression > 2 Outcomes > Model Coefficients | Likelihood ratio tests.

13.5 Classification

The model fitted values of response variable lie between 0 and 1. These can be interpreted as probabilities that \(y = 1\) for each observation. In case we wish to have predictions on original binary scale, we need to choose a cutoff point that determines if a probability should be classified as 0 or 1.

The classified predictions can be tabulated against true classes which results in a classification table or a confusion matrix. Such table can then be used to calculate measures such as accuracy, specificity, sensitivity, AUC and others that illustrate how good the model is at predicting.

In Jamovi: Regression > 2 Outcomes > Prediction.

References

Crawley, Michael J. 2013. The R Book. Second edition. Chichester, West Sussex, UK: Wiley. https://www.cs.upc.edu/~robert/teaching/estadistica/TheRBook.pdf.