Section author: Danielle J. Navarro and David R. Foxcroft
Quantifying the fit of the regression model¶
So we now know how to estimate the coefficients of a linear regression
model. The problem is, we don’t yet know if this regression model is any
good. For example, the regression.1
model claims that every hour
of sleep will improve my mood by quite a lot, but it might just be
rubbish. Remember, the regression model only produces a prediction
Ŷi about what my mood is like, but my actual mood is
Yi. If these two are very close, then the regression model has
done a good job. If they are very different, then it has done a bad job.
The R² (R-squared) value¶
Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals
which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable
While we’re here, let’s calculate these values ourselves, not by hand though.
Let’s use jamovi instead. Open up the parenthood
data set in so that we can
work in it. The first thing to do is calculate the Ŷ values, and for the
simple model that uses only a single predictor we would do the following:
- Go to an empty column (at the end of the data set) and double click on the
column header, choose
New computed variable
and enterY_pred
in the first line and the formula125.97 + (-8.94 * dani.sleep)
in the line starting with=
(next to the fx).
Okay, now that we’ve got a variable which stores the regression model predictions for how grumpy I will be on any given day, let’s calculate our sum of squared residuals. We would do that using the following formula:
- Calculate the squared residuals by creating a new column called
sq_resid
using the formula(dani.grump - Y_pred) ^ 2
. The values in this column are later summed up to obtain SSres. - Calculate the squared deviation from the mean by creating yet another
column called
sq_total
using the formula(dani.grump - VMEAN(dani.grump)) ^ 2
. The values in this column are later summed up to obtain SStot.
To calculate the sum of these values, click Descriptives
→ Descriptive
Statistics
and move sq_resid
and sq_total
to the Variables
box.
You’ll then need to select Sum
from the Statistics
drop-down menu
below. The sum of sq_resid
has a value of 1838.722. This is a big
number, however, that doesn’t mean very much. The sum of sq_total
has a
value of 9998.590. Well, it’s a much (about five times) bigger number
than the last one, so this does suggest that our regression model was making
good predictions (that is, it has greatly reduced the residual error compared
to the model that uses the mean as a single predictor). But it’s not very
interpretable.
To can fix this, we’d like to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R². What we would like is for the value of R² to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero. That is, if SSres = 0 then we expect R² = 1. Similarly, if the model is completely useless, we would like R² to be equal to 0. What do I mean by “useless”? Tempting as it is to demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition. In this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SSres = SStot. Wait, why don’t we do exactly that? The formula that provides us with our R² value is pretty simple to write down, and equally simple to calculate by hand:[1]
This gives a value for R² of 0.816. The R² value, sometimes called the
coefficient of determination[2] has a simple interpretation: it is the
proportion of the variance in the outcome variable that can be accounted for
by the predictor. So, in this case the fact that we have obtained R² = 0.816
means that the predictor (dani.sleep
) explains 81.6% of the variance in the
outcome (dani.grump
).[3]
Naturally, you don’t actually need to do all these calculations yourself if you want to obtain the R² value for your regression model. As we’ll see later on in Running the hypothesis tests in jamovi, all you need to do is specify this as an option in jamovi. However, let’s put that to one side for the moment. There’s another property of R² that I want to point out.
The relationship between regression and correlation¶
At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R² value from linear regression? Of course there is: the squared correlation R² is identical to the R² value for a linear regression with only a single predictor. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable.
The adjusted R² (R-squared) value¶
One final thing to point out before moving on. It’s quite common for people to report a slightly different measure of model performance, known as “adjusted R²”. The motivation behind calculating the adjusted R² value is the observation that adding more predictors into the model will always cause the R² value to increase (or at least not decrease).
The adjusted R² value introduces a slight change to the calculation, as follows. For a regression model with K predictors, fit to a data set containing N observations, the adjusted R² is:
This adjustment is an attempt to take the degrees of freedom into account. The big advantage of the adjusted R² value is that when you add more predictors to the model, the adjusted R² value will only increase if the new variables improve the model performance more than you’d expect by chance. The big disadvantage is that the adjusted R² value can’t be interpreted in the elegant way that R² can. R² has a simple interpretation as the proportion of variance in the outcome variable that is explained by the regression model. To my knowledge, no equivalent interpretation exists for adjusted R².
An obvious question then is whether you should report R² or adjusted R². This is probably a matter of personal preference. If you care more about interpretability, then R² is better. If you care more about correcting for bias, then adjusted R² is probably better. Speaking just for myself, I prefer R². My feeling is that it’s more important to be able to interpret your measure of model performance. Besides, as we’ll see in section Hypothesis tests for regression models, if you’re worried that the improvement in R² that you get by adding a predictor is just due to chance and not because it’s a better model, well we’ve got hypothesis tests for that.
[1] | If you don’t want to do these calculations by hand, just create another
computed variable called, e.g., R2 , and containing the formula
1 - VSUM(sq_resid) / VSUM(sq_total) . But then you have a whole column
containing R². |
[2] | And by “sometimes” I mean “almost never”. In practice everyone just calls it “R-squared”. |
[3] | If you made a mistake or could not follow the explanations, you can simply
download and open the parenthood_r2 data set. |