Section author: Danielle J. Navarro and David R. Foxcroft

Quantifying the fit of the regression model¶

So we now know how to estimate the coefficients of a linear regression model. The problem is, we don’t yet know if this regression model is any good. For example, the regression.1 model claims that every hour of sleep will improve my mood by quite a lot, but it might just be rubbish. Remember, the regression model only produces a prediction Ŷ_i about what my mood is like, but my actual mood is Y_i. If these two are very close, then the regression model has done a good job. If they are very different, then it has done a bad job.

The R² (R-squared) value¶

Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals

\[\mbox{SS}_{res} = \sum_i (Y_i - \hat{Y}_i)^2\]

which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable

\[\mbox{SS}_{tot} = \sum_i (Y_i - \bar{Y})^2\]

While we’re here, let’s calculate these values ourselves, not by hand though. Let’s use jamovi instead. Open up the parenthood data set in so that we can work in it. The first thing to do is calculate the Ŷ values, and for the simple model that uses only a single predictor we would do the following:

Go to an empty column (at the end of the data set) and double click on the column header, choose New computed variable and enter Y_pred in the first line and the formula 125.97 + (-8.94 * dani.sleep) in the line starting with = (next to the f_x).

Okay, now that we’ve got a variable which stores the regression model predictions for how grumpy I will be on any given day, let’s calculate our sum of squared residuals. We would do that using the following formula:

Calculate the squared residuals by creating a new column called sq_resid using the formula (dani.grump - Y_pred) ^ 2. The values in this column are later summed up to obtain SS_res.
Calculate the squared deviation from the mean by creating yet another column called sq_total using the formula (dani.grump - VMEAN(dani.grump)) ^ 2. The values in this column are later summed up to obtain SS_tot.

To calculate the sum of these values, click Descriptives → Descriptive Statistics and move sq_resid and sq_total to the Variables box. You’ll then need to select Sum from the Statistics drop-down menu below. The sum of sq_resid has a value of 1838.722. This is a big number, however, that doesn’t mean very much. The sum of sq_total has a value of 9998.590. Well, it’s a much (about five times) bigger number than the last one, so this does suggest that our regression model was making good predictions (that is, it has greatly reduced the residual error compared to the model that uses the mean as a single predictor). But it’s not very interpretable.

To can fix this, we’d like to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R². What we would like is for the value of R² to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero. That is, if SS_res = 0 then we expect R² = 1. Similarly, if the model is completely useless, we would like R² to be equal to 0. What do I mean by “useless”? Tempting as it is to demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition. In this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SS_res = SS_tot. Wait, why don’t we do exactly that? The formula that provides us with our R² value is pretty simple to write down, and equally simple to calculate by hand:[1]

R² = 1 - (SSres / SStot)
R² = 1 - (1838.722 / 9998.590)
R² = 1 - 0.184

This gives a value for R² of 0.816. The R² value, sometimes called the coefficient of determination[2] has a simple interpretation: it is the proportion of the variance in the outcome variable that can be accounted for by the predictor. So, in this case the fact that we have obtained R² = 0.816 means that the predictor (dani.sleep) explains 81.6% of the variance in the outcome (dani.grump).[3]

Naturally, you don’t actually need to do all these calculations yourself if you want to obtain the R² value for your regression model. As we’ll see later on in Running the hypothesis tests in jamovi, all you need to do is specify this as an option in jamovi. However, let’s put that to one side for the moment. There’s another property of R² that I want to point out.

The relationship between regression and correlation¶

At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R² value from linear regression? Of course there is: the squared correlation R² is identical to the R² value for a linear regression with only a single predictor. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable.

The adjusted R² (R-squared) value¶

One final thing to point out before moving on. It’s quite common for people to report a slightly different measure of model performance, known as “adjusted R²”. The motivation behind calculating the adjusted R² value is the observation that adding more predictors into the model will always cause the R² value to increase (or at least not decrease).

The adjusted R² value introduces a slight change to the calculation, as follows. For a regression model with K predictors, fit to a data set containing N observations, the adjusted R² is:

\[\mbox{adj. } R^2 = 1 - \left(\frac{\mbox{SS}_{res}}{\mbox{SS}_{tot}} \times \frac{N - 1}{N - K - 1} \right)\]

This adjustment is an attempt to take the degrees of freedom into account. The big advantage of the adjusted R² value is that when you add more predictors to the model, the adjusted R² value will only increase if the new variables improve the model performance more than you’d expect by chance. The big disadvantage is that the adjusted R² value can’t be interpreted in the elegant way that R² can. R² has a simple interpretation as the proportion of variance in the outcome variable that is explained by the regression model. To my knowledge, no equivalent interpretation exists for adjusted R².

An obvious question then is whether you should report R² or adjusted R². This is probably a matter of personal preference. If you care more about interpretability, then R² is better. If you care more about correcting for bias, then adjusted R² is probably better. Speaking just for myself, I prefer R². My feeling is that it’s more important to be able to interpret your measure of model performance. Besides, as we’ll see in section Hypothesis tests for regression models, if you’re worried that the improvement in R² that you get by adding a predictor is just due to chance and not because it’s a better model, well we’ve got hypothesis tests for that.

[1]	If you don’t want to do these calculations by hand, just create another computed variable called, e.g., `R2`, and containing the formula `1 - VSUM(sq_resid) / VSUM(sq_total)`. But then you have a whole column containing R².

[2]	And by “sometimes” I mean “almost never”. In practice everyone just calls it “R-squared”.

[3]	If you made a mistake or could not follow the explanations, you can simply download and open the `parenthood_r2` data set.