Afsnitsforfatter: Danielle J. Navarro and David R. Foxcroft

χ² (chi-kvadrat) godhed-of-fit-testen

χ²-testen for god tilpasning er en af de ældste hypotesetests, der findes. Den blev opfundet af Karl Pearson (1900), med nogle rettelser senere foretaget af Sir Ronald Fisher (1922a). Den tester, om en observeret frekvensfordeling af en nominel variabel stemmer overens med en forventet frekvensfordeling. Antag f.eks. at en gruppe patienter har gennemgået en eksperimentel behandling og har fået vurderet deres helbred for at se, om deres tilstand er blevet bedre, uændret eller forværret. En goodness-of-fit-test kan bruges til at afgøre, om tallene i hver kategori - forbedret, uændret, forværret - svarer til de tal, der ville være forventet i betragtning af standardbehandlingsmuligheden. Lad os tænke lidt mere over dette, med lidt psykologi.

Kortenes data

Gennem årene har der været mange undersøgelser, der viser, at mennesker har svært ved at simulere tilfældighed. Prøv som vi kunne at »handle« tilfældigt, vi tænker i form af mønstre og struktur, og så når de bliver bedt om at »gøre noget tilfældigt«, er det, folk rent faktisk gør, alt andet end tilfældigt. Som en konsekvens heraf åbner studiet af menneskelig tilfældighed (eller ikke-tilfældighed, alt efter tilfældet) en masse dybe psykologiske spørgsmål om, hvordan vi tænker om verden. Med dette i tankerne, lad os overveje en meget enkel undersøgelse. Antag, at jeg bad folk om at forestille sig et blandet sæt kort, og mentalt vælge ét kort fra denne imaginære bunke »tilfældigt«. Efter at de har valgt et kort, beder jeg dem mentalt vælge et andet. For begge valg er det, vi skal se på, den farve (hjerter, kløver, spar eller ruder), som folk valgte. Efter at have bedt f.eks. N =200 personer om at gøre dette, vil jeg gerne se på dataene og finde ud af, om de kort, som folk foregav at vælge, virkelig var tilfældige. Dataene er indeholdt i datasættet randomness, hvor du, når du åbner det i jamovi og tager et kig på regnearksvisningen, vil se tre variabler. Disse er: en id-variabel, der tildeler en unik identifikator til hver deltager, og de to variabler choice_1 og choice_2, der angiver de kortfarve, som folk valgte.

Lad os for øjeblikket bare fokusere på det første valg, som folk traf. Vi bruger muligheden Frekvenstabeller under Udforskning → Descriptives til at tælle antallet af gange, vi observerede, at folk valgte hver farve. Dette er hvad vi får:

clubs diamonds   hearts   spades
   35       51       64       50

Den lille frekvenstabel er ret nyttig. Når man ser på det, er der lidt af et hint om, at folk måske er mere tilbøjelige til at vælge hjerter end køller, men det er ikke helt indlysende bare ved at se på det, om det virkelig er sandt, eller om det bare skyldes tilfældigheder. Så vi bliver nok nødt til at lave en form for statistisk analyse for at finde ud af det, hvilket er det, jeg vil tale om i næste afsnit.

Fremragende. Fra nu af vil vi behandle denne tabel som de data, vi ønsker at analysere. Men da jeg bliver nødt til at tale om disse data i matematiske termer (undskyld!), er det måske en god idé at gøre sig klart, hvad notationen er. I matematisk notation forkorter vi det menneskeligt læsbare ord »observeret« til bogstavet O, og vi bruger subscripts til at angive observationens position. Så den anden observation i vores tabel skrives som O₂ i matematik. Sammenhængen mellem de engelske beskrivelser og de matematiske symboler er illustreret nedenfor:

etiket	indeks, i	matematisk symbol	værdien
klør (♣)	1	O₁	35
diamanter (♢)	2	O₂	51
hjerter (♡)	3	O₃	64
spades (♠)	4	O₄	50

Forhåbentlig er det ret klart. Det er også værd at bemærke, at matematikere foretrækker at tale om generelle snarere end specifikke ting, så du vil også se notationen O:sub:‹i‹, som henviser til antallet af observationer, der falder inden for i-th kategorien (hvor i kunne være 1, 2, 3 eller 4). Endelig, hvis vi vil henvise til sættet af alle observerede frekvenser, grupperer statistikere alle observerede værdier i en vektor, [1], som jeg vil henvise til som * O *.

O = (O₁, O₂, O₃, O₄)

Igen, det er ikke noget nyt eller interessant. Det er bare notation. Hvis jeg siger, at O = (35, 51, 64, 50), beskriver jeg blot tabellen over de observerede frekvenser (dvs. observed), men jeg henviser til den ved hjælp af matematisk notation.

Nulhypotesen og den alternative hypotese

Som det fremgår af sidste afsnit, er vores forskningshypotese, at »folk ikke vælger kort tilfældigt«. Det, vi nu vil gøre, er at omsætte dette til nogle statistiske hypoteser og derefter konstruere en statistisk test af disse hypoteser. Den test, som jeg vil beskrive for Dem, er Pearsons χ²-test (chi-square-test) for god tilpasning, og som det ofte er tilfældet, skal vi begynde med at opstille vores nulhypotese omhyggeligt. I dette tilfælde er det ret nemt. Lad os først formulere nulhypotesen i ord:

H₀: Alle fire muligheder er valgt med lige stor sandsynlighed

Da vi taler om statistik, skal vi kunne sige det samme på en matematisk måde. For at gøre dette bruger vi betegnelsen P _jtil at referere til den sande sandsynlighed for, at den j-te farve vælges. Hvis nulhypotesen er sand, har hver af de fire farver en chance på 25 % for at blive valgt. Med andre ord hævder vores nulhypotese, at P₁ = 0,25, P₂ = 0,25, P₃ = 0,25 og endelig at P₄ = 0,25. Men på samme måde som vi kan gruppere vores observerede frekvenser i en vektor O, der opsummerer hele datasættet, kan vi bruge P til at henvise til de sandsynligheder, der svarer til vores nulhypotese. Så hvis jeg lader vektoren P = (P₁, P₂, P₃, P₄) henvise til den samling af sandsynligheder, der beskriver vores nulhypotese, så har vi:

H₀: P = (0.25, 0.25, 0.25, 0.25)

I dette særlige tilfælde svarer vores nulhypotese til en vektor af sandsynligheder P, hvor alle sandsynlighederne er lige store. Men dette behøver ikke at være tilfældet. Hvis forsøgsopgaven f.eks. gik ud på, at folk skulle forestille sig, at de trak fra et spil, der havde dobbelt så mange klør som alle andre farver, ville nulhypotesen svare til noget i retning af P = (0,4, 0,2, 0,2, 0,2, 0,2). Så længe sandsynlighederne alle er positive tal, og de alle summerer til 1, er det et helt legitimt valg for nulhypotesen. Den mest almindelige anvendelse af goodness-of-fit-testen er imidlertid at teste en nulhypotese om, at alle kategorierne er lige sandsynlige, så vi holder os til det i vores eksempel.

Hvad med vores alternative hypotese, H₁? Det eneste, vi er interesseret i, er at påvise, at de involverede sandsynligheder ikke alle er identiske (dvs. at folks valg ikke var helt tilfældige). Som følge heraf ser de »menneskevenlige« versioner af vores hypoteser således ud:

H₀: Alle fire muligheder er valgt med lige stor sandsynlighed

H₁: Mindst en af sandsynlighederne for valg af kulør er ikke 0.25

og den »matematikervenlige« version er:

H₀: P = (0.25, 0.25, 0.25, 0.25)

H₁: P ≠ (0.25, 0.25, 0.25, 0.25)

Teststatistikken for »god tilpasning«

At this point, we have our observed frequencies O and a collection of probabilities P corresponding to the null hypothesis that we want to test. What we now want to do is construct a test of the null hypothesis. As always, if we want to test H₀ against H₁, we’re going to need a test statistic. The basic trick that a goodness-of-fit test uses is to construct a test statistic that measures how “close” the data are to the null hypothesis. If the data don’t resemble what you’d “expect” to see if the null hypothesis were true, then it probably isn’t true. Okay, if the null hypothesis were true, what would we expect to see? Or, to use the correct terminology, what are the expected frequencies. There are N = 200 observations, and (if the null is true) the probability of any one of them choosing a heart is P₃ = 0.25, so I guess we’re expecting 200 · 0.25 = 50 hearts, right? Or, more specifically, if we let E_i refer to “the number of category i responses that we’re expecting if the null is true”, then:

E_i = N · P_i

This is pretty easy to calculate.If there are 200 observations that can fall into four categories, and we think that all four categories are equally likely, then on average we’d expect to see 50 observations in each category, right?

Now, how do we translate this into a test statistic? Clearly, what we want to do is compare the expected number of observations in each category (E_i) with the observed number of observations in that category (O_i). And on the basis of this comparison we ought to be able to come up with a good test statistic. To start with, let’s calculate the difference between what the null hypothesis expected us to find and what we actually did find. That is, we calculate the “observed minus expected” difference score, O_i - E_i. This is illustrated in the following table:

		♣	♢	♡	♠
expected frequency	E_i	50	50	50	50
observed frequency	O₁	35	51	64	50
difference score	E_i - O₁	-15	1	14	0

So, based on our calculations, it’s clear that people chose more hearts and fewer clubs than the null hypothesis predicted. However, a moment’s thought suggests that these raw differences aren’t quite what we’re looking for. Intuitively, it feels like it’s just as bad when the null hypothesis predicts too few observations (which is what happened with hearts) as it is when it predicts too many (which is what happened with clubs). So it’s a bit weird that we have a negative number for clubs and a positive number for hearts. One easy way to fix this is to square everything, so that we now calculate the squared differences, (O_i - O_i)². As before, we can do this by hand:

(observed - expected) ^ 2
   clubs diamonds   hearts   spades
     225        1      196        0

Now we’re making progress. What we’ve got now is a collection of numbers that are big whenever the null hypothesis makes a bad prediction (clubs and hearts), but are small whenever it makes a good one (diamonds and spades). Next, for some technical reasons that I’ll explain in a moment, let’s also divide all these numbers by the expected frequency E_i, so we’re actually calculating \(\frac{(E_i-O_i)^2}{E_i}\). Since E_i = 50 for all categories in our example, it’s not a very interesting calculation, but let’s do it anyway:

(observed - expected) ^ 2 / expected
   clubs diamonds   hearts   spades
    4.50     0.02     3.92     0.00

In effect, what we’ve got here are four different “error” scores, each one telling us how big a “mistake” the null hypothesis made when we tried to use it to predict our observed frequencies. So, in order to convert this into a useful test statistic, one thing we could do is just add these numbers up. The result is called the goodness-of-fit statistic, conventionally referred to either as χ² (chi-square) or GOF. We can calculate it as follows:

sum((observed - expected) ^ 2 / expected)

This gives us a value of 8.44.

If we let k refer to the total number of categories (i.e., k = 4 for our cards data), then the χ² statistic is given by:

\[\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}\]

Intuitively, it’s clear that if χ² is small, then the observed data O_i are very close to what the null hypothesis predicted E_i, so we’re going to need a large χ² statistic in order to reject the null.

As we’ve seen from our calculations, in our cards data set we’ve got a value of χ² = 8.44. So now the question becomes is this a big enough value to reject the null?

The sampling distribution of the GOF statistic

To determine whether or not a particular value of χ² is large enough to justify rejecting the null hypothesis, we’re going to need to figure out what the sampling distribution for χ² would be if the null hypothesis were true. So that’s what I’m going to do in this section. I’ll show you in a fair amount of detail how this sampling distribution is constructed, and then, in the next section, use it to build up a hypothesis test. If you want to cut to the chase and are willing to take it on faith that the sampling distribution is a χ²-distribution with k - 1 degrees of freedom, you can skip the rest of this section. However, if you want to understand why the goodness-of-fit test works the way it does, read on.

Okay, let’s suppose that the null hypothesis is actually true. If so, then the true probability that an observation falls in the i-th category is P_i. After all, that’s pretty much the definition of our null hypothesis. Let’s think about what this actually means. This is kind of like saying that “nature” makes the decision about whether or not the observation ends up in category i by flipping a weighted coin (i.e., one where the probability of getting a head is P_j). And therefore we can think of our observed frequency O_i by imagining that nature flipped N of these coins (one for each observation in the data set), and exactly O_i of them came up heads. Obviously, this is a pretty weird way to think about the experiment. But what it does (I hope) is remind you that we’ve actually seen this scenario before. It’s exactly the same set up that gave rise to the binomial distribution. In other words, if the null hypothesis is true, then it follows that our observed frequencies were generated by sampling from a binomial distribution:

O_i ~ Binomial(P_i, N)

Now, if you remember from our discussion of the central limit theorem the binomial distribution starts to look pretty much identical to the normal distribution, especially when N is large and when P_i isn’t too close to 0 or 1. In other words as long as N · P_i is large enough. Or, to put it another way, when the expected frequency E_i is large enough then the theoretical distribution of O_i is approximately normal. Better yet, if O_i is normally distributed, then so is \((O_i - E_i)/\sqrt{E_i}\). Since E_i is a fixed value, subtracting off E_i and dividing by \(\sqrt{E_i}\) changes the mean and standard deviation of the normal distribution but that’s all it does. Okay, so now let’s have a look at what our goodness-of-fit statistic actually is. What we’re doing is taking a bunch of things that are normally-distributed, squaring them, and adding them up. Wait. We’ve seen that before too! As we discussed in Other useful distributions, when you take a bunch of things that have a standard normal distribution (i.e., mean 0 and standard deviation 1), square them and then add them up, the resulting quantity has a χ²-distribution. So now we know that the null hypothesis predicts that the sampling distribution of the goodness-of-fit statistic is a χ²-distribution. Cool.

There’s one last detail to talk about, namely the degrees of freedom. If you remember back to Other useful distributions, I said that if the number of things you’re adding up is k, then the degrees of freedom for the resulting χ²-distribution is k. Yet, what I said at the start of this section is that the actual degrees of freedom for the χ²-goodness-of-fit test is k - 1. What’s up with that? The answer here is that what we’re supposed to be looking at is the number of genuinely independent things that are getting added together. And, as I’ll go on to talk about in the next section, even though there are k things that we’re adding only k - 1 of them are truly independent, and so the degrees of freedom is actually only k - 1. That’s the topic of the next section.[2]

Degrees of freedom

When I introduced the χ²-distribution in Other useful distributions, I was a bit vague about what “degrees of freedom” actually means. Obviously, it matters. Looking at figur 75, you can see that if we change the degrees of freedom then the χ²-distribution changes shape quite substantially. But what exactly is it? Again, when I introduced the distribution and explained its relationship to the normal distribution, I did offer an answer: it’s the number of “normally distributed variables” that I’m squaring and adding together. But, for most people, that’s kind of abstract and not entirely helpful. What we really need to do is try to understand degrees of freedom in terms of our data. So here goes.

The basic idea behind degrees of freedom is quite simple. You calculate it by counting up the number of distinct “quantities” that are used to describe your data and then subtracting off all of the “constraints” that those data must satisfy.[3] This is a bit vague, so let’s use our cards data as a concrete example. We describe our data using four numbers, O₁, O₂, O₃ and O₄ corresponding to the observed frequencies of the four different categories (hearts, clubs, diamonds, spades). These four numbers are the random outcomes of our experiment. But my experiment actually has a fixed constraint built into it: the sample size N.[4] That is, if we know how many people chose hearts, how many chose diamonds and how many chose clubs, then we’d be able to figure out exactly how many chose spades. In other words, although our data are described using four numbers, they only actually correspond to 4 - 1 = 3 degrees of freedom. A slightly different way of thinking about it is to notice that there are four probabilities that we’re interested in (again, corresponding to the four different categories), but these probabilities must sum to one, which imposes a constraint. Therefore the degrees of freedom is 4 - 1 = 3. Regardless of whether you want to think about it in terms of the observed frequencies or in terms of the probabilities, the answer is the same. In general, when running the χ² (chi-square) goodness-of-fit test for an experiment involving k groups, then the degrees of freedom will be k - 1.

Testing the null hypothesis

The final step in the process of constructing our hypothesis test is to figure out what the rejection region is. That is, what values of χ² would lead us to reject the null hypothesis. As we saw earlier, large values of χ² imply that the null hypothesis has done a poor job of predicting the data from our experiment, whereas small values of χ² imply that it’s actually done pretty well. Therefore, a pretty sensible strategy would be to say there is some critical value such that if χ² is bigger than the critical value we reject the null, but if χ² is smaller than this value we retain the null. In other words, to use the language we introduced in chapter Hypothesis testing the χ²-goodness-of-fit test is always a one-sided test. Right, so all we have to do is figure out what this critical value is. And it’s pretty straightforward. If we want our test to have significance level of α = 0.05 (that is, we are willing to tolerate a Type I error rate of 5%), then we have to choose our critical value so that there is only a 5% chance that χ² could get to be that big if the null hypothesis is true. This is illustrated in figur 76.

Hypothesis testing works for the χ² GOF test — figur 76 Illustration of how the hypothesis testing works for the χ² (chi-square) goodness-of-fit test

Ah but, I hear you ask, how do I find the critical value of a χ²-distribution with k - 1 degrees of freedom? Many many years ago when I first took a psychology statistics class we used to look up these critical values in a book of critical value tables, like the one in tabel 11. We can see that the critical value for a χ²-distribution with 3 degrees of freedom, and p = 0.05 is 7.815.

tabel 11 Table of critical values for the χ² (chi-square) distribution
df	Probability
	non-significant						significant
	0.95	0.90	0.70	0.50	0.30	0.10	0.05	0.01	0.001
1	0.004	0.016	0.148	0.455	1.074	2.706	3.841	6.635	10.828
2	0.103	0.211	0.713	1.386	2.408	4.605	5.991	9.210	13.816
3	0.352	0.584	1.424	2.366	3.665	6.251	7.815	11.345	16.266
4	0.711	1.064	2.195	3.357	4.878	7.779	9.488	13.277	18.467
5	1.145	1.610	3.000	4.351	6.064	9.236	11.070	15.086	20.515
6	1.635	2.204	3.828	5.348	7.231	10.645	12.592	16.812	22.458
7	2.167	2.833	4.671	6.346	8.383	12.017	14.067	18.475	24.322
8	2.733	3.490	5.527	7.344	9.524	13.362	15.507	20.090	26.124
9	3.325	4.168	6.393	8.343	10.656	14.684	16.919	21.666	27.877
10	3.940	4.865	7.267	9.342	11.781	15.987	18.307	23.209	29.588

So, if our calculated χ² statistic is bigger than the critical value of 7.815, then we can reject the null hypothesis (remember that the null hypothesis, H₀, is that all four suits are chosen with equal probability). Since we actually already calculated that before (i.e., χ² = 8.44) we can reject the null hypothesis. And that’s it, basically. You now know “Pearson’s χ² test for the goodness-of-fit”. Lucky you.

Doing the test in jamovi

Not surprisingly, jamovi provides an analysis that will do these calculations for you. From the main Analyses toolbar select Frequencies → One Sample Proportion Tests → N Outcomes. Then in the options panel that appears move the variable you want to analyse (choice_1 across into the Variable box. Also, click on the Expected counts check box so that these are shown on the results table. When you have done all this, you should see the analysis results in jamovi as in figur 77. No surprise then that jamovi provides the same expected counts and statistics that we calculated by hand above, with a χ² value of 8.44 with df = 3 and p =0.038. Note that we don’t need to look up a critical p-value threshold value any more, as jamovi gives us the actual p-value of the calculated χ² for df = 3.

figur 77 χ² One Sample Proportion Test in jamovi, with table showing both observed and expected frequencies and proportions

Specifying a different null hypothesis

At this point you might be wondering what to do if you want to run a goodness-of-fit test but your null hypothesis is not that all categories are equally likely. For instance, let’s suppose that someone had made the theoretical prediction that people should choose red cards 60% of the time, and black cards 40% of the time (I’ve no idea why you’d predict that), but had no other preferences. If that were the case, the null hypothesis would be to expect 30% of the choices to be hearts, 30% to be diamonds, 20% to be spades and 20% to be clubs. In other words we would expect hearts and diamonds to appear 1.5 times more often than spades and clubs (the ratio 30% : 20% is the same as 1.5 : 1). This seems like a silly theory to me, and it’s pretty easy to test this explicitly specified null hypothesis with the data in our jamovi analysis. In the analysis window (labelled Proportion Test (N Outcomes) in figur 77 you can expand the options for Expected Proportions. When you do this, there are options for entering different ratio values for the variable you have selected, in our case this is choice_1. Change the ratio to reflect the new null hypothesis, as in figur 78, and see how the results change.

The expected counts are now:

		♣	♢	♡	♠
expected frequency	E_i	40	60	60	40

and the χ² statistic is 4.74, df = 3, p = 0.192. Now, the results of our updated hypotheses and the expected frequencies are different from what they were last time. As a consequence our χ² test statistic is different, and our p-value is different too. Annoyingly, the p-value is 0.192, so we can’t reject the null hypothesis (look back at section The p-value of a test to remind yourself why). Sadly, despite the fact that the null hypothesis corresponds to a very silly theory, these data don’t provide enough evidence against it.

Changing expected proportions in the χ² One Sample Proportion Test — figur 78 Changing the expected proportions in the χ² One Sample Proportion Test in jamovi

How to report the results of the test

So now you know how the test works, and you know how to do the test using a wonderful jamovi flavoured magic computing box. The next thing you need to know is how to write up the results. After all, there’s no point in designing and running an experiment and then analysing the data if you don’t tell anyone about it! So let’s now talk about what you need to do when reporting your analysis. Let’s stick with our card-suits example. If I wanted to write this result up for a paper or something, then the conventional way to report this would be to write something like this:

Of the 200 participants in the experiment, 64 selected hearts for their first choice, 51 selected diamonds, 50 selected spades, and 35 selected clubs. A χ²-goodness-of-fit test was conducted to test whether the choice probabilities were identical for all four suits. The results were significant (χ²(3) = 8.44, p < 0.05), suggesting that people did not select suits purely at random.

This is pretty straightforward and hopefully it seems pretty unremarkable. That said, there’s a few things that you should note about this description:

The statistical test is preceded by the descriptive statistics. That is, I told the reader something about what the data look like before going on to do the test. In general, this is good practice. Always remember that your reader doesn’t know your data anywhere near as well as you do. So, unless you describe it to them properly, the statistical tests won’t make any sense to them and they’ll get frustrated and cry.
The description tells you what the null hypothesis being tested is. To be honest, writers don’t always do this but it’s often a good idea in those situations where some ambiguity exists, or when you can’t rely on your readership being intimately familiar with the statistical tools that you’re using. Quite often the reader might not know (or remember) all the details of the test that your using, so it’s a kind of politeness to “remind” them! As far as the goodness-of-fit test goes, you can usually rely on a scientific audience knowing how it works (since it’s covered in most intro stats classes). However, it’s still a good idea to be explicit about stating the null hypothesis (briefly!) because the null hypothesis can be different depending on what you’re using the test for. For instance, in the cards example my null hypothesis was that all the four suit probabilities were identical (i.e., P₁ = P₂ = P₃ = P₄ = 0.25), but there’s nothing special about that hypothesis. I could just as easily have tested the null hypothesis that P₁ = 0.7 and P₂ = P₃ = P₄ = 0.1 using a goodness-of-fit test. So it’s helpful to the reader if you explain to them what your null hypothesis was. Also, notice that I described the null hypothesis in words, not in maths. That’s perfectly acceptable. You can describe it in maths if you like, but since most readers find words easier to read than symbols, most writers tend to describe the null using words if they can.
A “stat block” is included. When reporting the results of the test itself, I didn’t just say that the result was significant, I included a “stat block” (i.e., the dense mathematical-looking part in the parentheses) which reports all the “key” statistical information. For the χ²-goodness-of-fit test, the information that gets reported is the test statistic (that the goodness-of-fit statistic was 8.44), the information about the distribution used in the test (χ² with 3 degrees of freedom which is usually shortened to “χ²(3)”), and then the information about whether the result was significant (in this case p < 0.05). The particular information that needs to go into the stat block is different for every test, and so each time I introduce a new test I’ll show you what the stat block should look like.[5] However, the general principle is that you should always provide enough information so that the reader could check the test results themselves if they really wanted to.
The results are interpreted. In addition to indicating that the result was significant, I provided an interpretation of the result (i.e., that people didn’t choose randomly). This is also a kindness to the reader, because it tells them something about what they should believe about what’s going on in your data. If you don’t include something like this, it’s really hard for your reader to understand what’s going on.[6]

As with everything else, your overriding concern should be that you explain things to your reader. Always remember that the point of reporting your results is to communicate to another human being. I cannot tell you just how many times I’ve seen the results section of a report or a thesis or even a scientific article that is just gibberish, because the writer has focused solely on making sure they’ve included all the numbers and forgotten to actually communicate with the human reader.

A comment on statistical notation

Satan delights equally in statistics and in quoting scripture

– H.G. Wells

If you’ve been reading very closely, and are as much of a mathematical pedant as I am, there is one thing about the way I wrote up the χ²-test in the last section that might be bugging you a little bit. There’s something that feels a bit wrong with writing “χ²(3) = 8.44”, you might be thinking. After all, it’s the goodness-of-fit statistic that is equal to 8.44, so shouldn’t I have written χ² = 8.44` or maybe GOF = 8.44? This seems to be conflating the sampling distribution (i.e., χ² with df = 3) with the test statistic (i.e., χ²). Odds are you figured it was a typo, since χ and X look pretty similar. Oddly, it’s not. Writing χ²(3) = 8.44 is essentially a highly condensed way of writing “the sampling distribution of the test statistic is χ²(3), and the value of the test statistic is 8.44”.

In one sense, this is kind of stupid. There are lots of different test statistics out there that turn out to have a χ²-sampling-distribution. The χ²-statistic that we’ve used for our goodness-of-fit test is only one of many (albeit one of the most commonly encountered ones). In a sensible, perfectly organised world we’d always have a separate name for the test statistic and the sampling distribution. That way, the stat block itself would tell you exactly what it was that the researcher had calculated. Sometimes this happens. For instance, the test statistic used in the Pearson goodness-of-fit test is written χ², but there’s a closely related test known as the G-test (Sokal & Rohlf, 2011),[7] in which the test statistic is written as G. As it happens, the Pearson goodness-of-fit test and the G-test both test the same null hypothesis, and the sampling distribution is exactly the same (i.e., a χ²-distribution with k - 1 degrees of freedom). If I’d done a G-test for the cards data rather than a goodness-of-fit test, then I’d have ended up with a test statistic of G = 8.65, which is slightly different from the χ² = 8.44 value that I got earlier and which produces a slightly smaller p-value of p = 0.034. Suppose that the convention was to report the test statistic, then the sampling distribution, and then the p-value. If that were true, then these two situations would produce different stat blocks: my original result would be written χ² = 8.44, χ²(3), p = 0.038, whereas the new version using the G-test would be written as G = 8.65, χ²(3),*p* = 0.034. However, using the condensed reporting standard, the original result is written χ²(3) = 8.44, p = 0.038, and the new one is written χ²(3) = 8.65,*p* = 0.034, and so it’s actually unclear which test I actually ran.

So why don’t we live in a world in which the contents of the stat block uniquely specifies what tests were ran? The deep reason is that life is messy. We (as users of statistical tools) want it to be nice and neat and organised. We want it to be designed, as if it were a product, but that’s not how life works. Statistics is an intellectual discipline just as much as any other one, and as such it’s a massively distributed, partly-collaborative and partly-competitive project that no-one really understands completely. The things that you and I use as data analysis tools weren’t created by an Act of the Gods of Statistics. They were invented by lots of different people, published as papers in academic journals, implemented, corrected and modified by lots of other people, and then explained to students in textbooks by someone else. As a consequence, there’s a lot of test statistics that don’t even have names, and as a consequence they’re just given the same name as the corresponding sampling distribution. As we’ll see later, any test statistic that follows a χ² distribution is commonly called a “χ²-statistic”, anything that follows a t-distribution is called a “t-statistic”, and so on. But, as the χ² versus G example illustrates, two different things with the same sampling distribution are still, well, different.

As a consequence, it’s sometimes a good idea to be clear about what the actual test was that you ran, especially if you’re doing something unusual. If you just say “χ²-test” it’s not actually clear what test you’re talking about. Although, since the two most common χ² tests are the goodness-of-fit test and the test of independence (or association), most readers with stats training can probably guess. Nevertheless, it’s something to be aware of.

[3]

I feel obliged to point out that this is an over-simplification. It works nicely for quite a few situations, but every now and then we’ll come across degrees of freedom values that aren’t whole numbers. Don’t let this worry you too much; when you come across this just remind yourself that “degrees of freedom” is actually a bit of a messy concept, and that the nice simple story that I’m telling you here isn’t the whole story. For an introductory class it’s usually best to stick to the simple story, but I figure it’s best to warn you to expect this simple story to fall apart. If I didn’t give you this warning you might start getting confused when you see df = 3.4 or something, (incorrectly) thinking that you had misunderstood something that I’ve taught you rather than (correctly) realising that there’s something that I haven’t told you.

[6]

To some people, this advice might sound odd, or at least in conflict with the “usual” advice on how to write a technical report. Very typically, students are told that the “results” section of a report is for describing the data and reporting statistical analysis, and the “discussion” section is for providing interpretation. That’s true as far as it goes, but I think people often interpret it way too literally. The way I usually approach it is to provide a quick and simple interpretation of the data in the results section, so that my reader understands what the data are telling us. Then, in the discussion, I try to tell a bigger story about how my results fit with the rest of the scientific literature. In short, don’t let the “interpretation goes in the discussion” advice turn your results section into incomprehensible garbage. Being understood by your reader is much more important.