Section 11 Data transformations

Sometimes the response variable is not best expressed as a linear function of predictors. This problem often results in residuals that are then not normally distributed or not constant. In such cases estimating the model on transformed data might improve model fit because (1) many distributions in real world become normal after taking a logarithm of their values and (2) some associations fit a linear model better when predictor is exponentiated. These are the two main ways of transforming data for improving the fit of a linear model: using natural logarithms or polynomials.

11.1 Log-transformations

Particularly when the distribution of a variable has a strong right skew, a linear model could be improved by using naturally logged ( \(log_e x\) ) values of such a variable ( \(x\)). As a result we are essentially converting the distribution of a variable to normal. Depending on whether response or predictor(s) are log-transformed, the resulting model can be one of the following three explained below.

It’s important to note that in such transformations the coefficients represent elasticities, i.e. not absolute values but per cent changes in values of response (log-linear) or per cent changes in both, predictors and response (log-log). To illustrate the interpretation of coefficients, examples are provided of a model where the fuel consumption (mpg) was predicted from horsepower (hp).

11.1.1 Linear

When our data is normally distributed, then we should start modelling with untransformed variables. When we use untransformed values of all variables in a regression model, then we have a linear model:

\[y = \alpha + \beta_{1} x + \varepsilon.\]

11.1.1.1 Interpretation

An increase of 1 unit in predictor coincides with a \(\beta_1\) unit increase in response.

\(mpg_{i} = 30.099 + -0.068 * hp_{i} + \varepsilon_{i}\)

When hp increases by 1 unit, mpg increases by -0.068228 \(* 100 =\) -6.823 percent.

In Jamovi: Regression > Linear regression
In R: lm(log(y) ~ x, data)

11.1.2 Log-linear

When we use log-transformation on response variable, then we have a log-linear model:

\[ln(y) = \alpha + \beta_{1} x + \varepsilon.\]

Keep in mind that measures of model fit (e.g. \(R^2\), AIC) are only valid for model comparison when models have the same response variable. Thus, such measures should not be used to compare log-linear model to a model with untransformed response.

11.1.2.1 Interpretation

An increase of 1 unit in predictor coincides with a \((e^{\beta_1} - 1) \times 100\) per cent increase in response. The change in response can be simplified to \(\beta_1 \times 100\), which is approximately equal for \(-0.1 < \beta_1 < 0.1\).

\(ln(mpg_{i}) = 3.46 + -0.003 * hp_{i} + \varepsilon_{i}\)

When hp increases by 1 unit, mpg increases by -0.003429 \(* 100 =\) -0.343 percent.

In Jamovi: Regression > Linear regression after transforming response variable
In R: lm(log(y) ~ x, data)

11.1.3 Linear-log

When one or more predictors are used in logged form, then the model is referred to as a linear-log model:

\[y = \alpha + \beta_{1} ln(x) + \varepsilon\]

11.1.3.1 Interpretation

An increase of 1 per cent in predictor coincides with a \(\beta_1 /100\) unit increase in response.

\(mpg_{i} = 72.64 + -10.764 * ln(hp_{i}) + \varepsilon_{i}\)

When hp increases by 1 per cent, mpg increases by -10.764 \(/ 100 =\) -0.108 units.

In Jamovi: Regression > Linear regression after transforming predictor variable
In R: lm(y ~ log(x), data)

11.1.4 Log-log

When either of the previous transformations have not improved the fit of a linear model, we can also transform both response and predictor(s) as follows:

\[ln(y) = \alpha + \beta_{1} ln(x) + \varepsilon\]

11.1.4.1 Interpretation:

An increase of 1 per cent in predictor coincides with a \(\beta_1\) per cent increase in response.

\(ln(mpg_{i}) = 5.545 + -0.53 * ln(hp_{i}) + \varepsilon_{i}\)

When hp increases by 1 per cent, mpg increases by -0.53 percent.

In Jamovi: Regression > Linear regression after transforming both, response and predictor variable
In R: lm(log(y) ~ log(x), data)

11.2 Polynomials

See Crawley (2013) sections 7.1.4 on polynomial functions and 10.3 on polynomial regression.

Application of polynomials usually improves the fit of a model when a curve rather than a straight line describes a relationship. We choose the degree of a polynomial depending on how many curves best express the relationship. For example, if there is a single curve, the relationship is quadratic and we should use the 2nd degree polynomial as follows:

\(Y = \alpha + \beta_{1} x + \beta_{2} x^2 + \varepsilon.\)

11.2.0.1 Interpretation

Note that as a result of including polynomials the coefficients for the respective predictor are estimated more than once, so there is no reasonable way to interpret the coefficients.

In Jamovi: Regression > Linear regression after transforming predictor variable
In R: lm(y ~ x + I(x^2), data)

11.2.1 Ramsey RESET test

The test can be used to determine if adding polynomials to a model improves fit of the model or not. We simply do an F-test to determine whether or not there is a difference between a reduced model \(y = \alpha + \beta x + \varepsilon\) and a full model \(y = \alpha + \beta_1 x + \beta_2 x^2 + \varepsilon\). Empirically, we are testing if residuals (\(RSS\)) of the two models are different or not:

\(H_0: RSS_1 = RSS_2\)
\(H_1: RSS_1 \neq RSS_2.\)

References

Crawley, Michael J. 2013. The R Book. Second edition. Chichester, West Sussex, UK: Wiley. https://www.cs.upc.edu/~robert/teaching/estadistica/TheRBook.pdf.