Forfatter av avsnitt: Danielle J. Navarro and David R. Foxcroft

Normalfordelingen

While the binomial distribution is conceptually the simplest distribution to understand, it is not the most important one. That particular honour goes to the normal distribution, also referred to as “the bell curve” or a “Gaussian distribution”. A normal distribution is described using two parameters: the mean of the distribution µ and the standard deviation of the distribution σ. The notation that we sometimes use to say that a variable X is normally distributed is as follows:

X ~ Normal(µ, σ)

Of course, that is just notation. It does not tell us anything interesting about the normal distribution itself. As was the case with the binomial distribution, I have included the formula for the normal distribution in this book, because I think it is important enough that everyone who learns statistics should at least look at it, but since this is an introductory text I do not want to focus on it, so I have tucked it away in Table 7.

Normalfordeling med gjennomsnitt μ = 0 og standardavvik σ = 1

Fig. 61 The normal distribution with mean μ = 0 and standard deviation σ = 1. The x-axis corresponds to the value of some variable, and the y-axis tells us something about how likely we are to observe that value. However, notice that the y-axis is labelled “Probability Density” and not “Probability”.

Instead of focusing on the maths, let us try to get a sense for what it means for a variable to be normally distributed. To that end, have a look at Fig. 61 which plots a normal distribution with mean µ = 0 and standard deviation σ = 1. You can see where the name “bell curve” comes from; it looks a bit like a bell. Notice that, unlike the plots that I drew to illustrate the binomial distribution, the picture of the normal distribution in Fig. 61 shows a smooth curve instead of “histogram-like” bars. This is not an arbitrary choice, the normal distribution is continuous whereas the binomial is discrete.[1] For instance, in the dice rolling example from the last section it was possible to get three skulls or four skulls, but impossible to get 3.9 skulls. The figures that I drew in the previous section reflected this fact. In Fig. 59, for instance, there is a bar located at X = 3 and another one at X = 4 but there is nothing in between. Continuous quantities do not have this constraint. For instance, suppose we are talking about the weather. The temperature on a pleasant spring day could be 23 degrees, 24 degrees, 23.9 degrees, or anything in between since temperature is a continuous variable continuous. And so a normal distribution might be quite appropriate for describing spring temperatures.[2]

Normalfordeling: σ = 1 og µ = 4 (heltrukket) eller µ = 7 (stiplet)

Fig. 62 Illustrasjon av hva som skjer når man endrer gjennomsnittet i en normalfordeling. Den heltrukne linjen viser en normalfordeling med et gjennomsnitt på μ = 4. Den stiplede linjen viser en normalfordeling med et gjennomsnitt på μ = 7. I begge tilfeller er standardavviket σ = 1. Ikke overraskende har de to fordelingene samme form, men den stiplede linjen er forskjøvet mot høyre.

With this in mind, let us see if we can not get an intuition for how the normal distribution works. First, let us have a look at what happens when we play around with the parameters of the distribution. To that end, Fig. 62 plots normal distributions that have different means but have the same standard deviation. As you might expect, all of these distributions have the same “width”. The only difference between them is that they have been shifted to the left or to the right. In every other respect they are identical. In contrast, if we increase the standard deviation while keeping the mean constant, the peak of the distribution stays in the same place but the distribution gets wider, as you can see in Fig. 63.

Normalfordeling: µ = 5 og σ = 1 (heltrukken) eller σ = 2 (stiplet)

Fig. 63 Illustrasjon av hva som skjer når du endrer standardavviket til en normalfordeling. Begge fordelingene i denne figuren har et gjennomsnitt på μ = 5, men de har ulike standardavvik. Den heltrukne linjen viser en fordeling med standardavviket σ = 1, og den stiplede linjen viser en fordeling med standardavviket σ = 2. Begge fordelingene er altså «sentrert» på samme sted, men den stiplede linjen er bredere enn den heltrukne.

Notice, though, that when we widen the distribution the height of the peak shrinks. This has to happen, in the same way that the heights of the bars that we used to draw a discrete binomial distribution have to sum to 1, the total area under the curve for the normal distribution must equal 1.

Normalfordeling: areal under kurven for 1 og 2 SD

Fig. 64 Arealet under kurven forteller deg sannsynligheten for at en observasjon faller innenfor et bestemt område. De heltrukne linjene viser normalfordelinger med gjennomsnitt μ = 0 og standardavvik σ = 1. De skraverte områdene illustrerer «arealene under kurven» for to viktige tilfeller. I det venstre panelet ser vi at det er 68,3% sjanse for at en observasjon vil falle innenfor ett standardavvik fra gjennomsnittet. I det høyre panelet ser vi at det er 95,4% sjanse for at en observasjon vil falle innenfor to standardavvik fra gjennomsnittet.

Arealet under kurven for 1 SD på grensen til gjennomsnittet og ved halene

Fig. 65 To eksempler til på ideen om «arealet under kurven». Det er 15,9% sjanse for at en observasjon ligger ett standardavvik under gjennomsnittet eller mindre (venstre panel), og 34,1% sjanse for at observasjonen ligger et sted mellom ett standardavvik under gjennomsnittet og gjennomsnittet (høyre panel). Legg merke til at hvis du legger sammen disse to tallene, får du 15,9% + 34,1% = 50%. For normalfordelte data er det 50% sjanse for at en observasjon faller under gjennomsnittet. Og det innebærer selvfølgelig også at det er 50% sjanse for at den faller over gjennomsnittet.

Before moving on, I want to point out one important characteristic of the normal distribution. Irrespective of what the actual mean and standard deviation are, 68.3% of the area falls within 1 standard deviation of the mean. Similarly, 95.4% of the distribution falls within two standard deviations of the mean, and 99.7% of the distribution is within three standard deviations. This idea is illustrated in Fig. 64, see also Fig. 65.

Sannsynlighetstetthet

There is something I have been trying to hide throughout my discussion of the normal distribution, something that some introductory textbooks omit completely. They might be right to do so. This “thing” that I am hiding is weird and counter-intuitive even by the admittedly distorted standards that apply in statistics. Fortunately, it is not something that you need to understand at a deep level in order to do basic statistics. Rather, it is something that starts to become important later on when you move beyond the basics. So, if it does not make complete sense, do not worry too much, but try to make sure that you follow the gist of it.

Throughout my discussion of the normal distribution there is been one or two things that do not quite make sense. Perhaps you noticed that the y-axis in these figures is labelled “Probability Density” rather than “Density”. Maybe you noticed that I used p(X) instead of P(X) when giving the formula for the normal distribution.

As it turns out, what is presented here is not actually a probability, it is something else. To understand what that something is you have to spend a little time thinking about what it really means to say that X is a continuous variable continuous. Let us say we are talking about the temperature outside. The thermometer tells me it is 23 degrees, but I know that is not really true. It is not exactly 23 degrees. Maybe it is 23.1 degrees, I think to myself. But I know that that is not really true either because it might actually be 23.09 degrees. But I know that… well, you get the idea. The tricky thing with genuinely continuous quantities is that you never really know exactly what they are.

Now think about what this implies when we talk about probabilities. Suppose that tomorrow’s maximum temperature is sampled from a normal distribution with mean 23 and standard deviation 1. What is the probability that the temperature will be exactly 23 degrees? The answer is “zero”, or possibly “a number so close to zero that it might as well be zero”. Why is this? It is like trying to throw a dart at an infinitely small dart board. No matter how good your aim, you will never hit it. In real life you will never get a value of exactly 23. It will always be something like 23.1 or 22.99998 or suchlike. In other words, it is completely meaningless to talk about the probability that the temperature is exactly 23 degrees. However, in everyday language if I told you that it was 23 degrees outside and it turned out to be 22.9998 degrees you probably would not call me a liar. Because in everyday language “23 degrees” usually means something like “somewhere between 22.5 and 23.5 degrees”. And while it does not feel very meaningful to ask about the probability that the temperature is exactly 23 degrees, it does seem sensible to ask about the probability that the temperature lies between 22.5 and 23.5, or between 20 and 30, or any other range of temperatures.

The point of this discussion is to make clear that when we are talking about continuous distributions it is not meaningful to talk about the probability of a specific value. However, what we can talk about is the probability that the value lies within a particular range of values. To find out the probability associated with a particular range what you need to do is calculate the “area under the curve”. We have seen this concept already, in Fig. 64 the shaded areas shown depict genuine probabilities (e.g., in the left panel of Fig. 64 it shows the probability of observing a value that falls within one standard deviation of the mean).

That explains part of the story. I have explained a little bit about how continuous probability distributions should be interpreted (i.e., area under the curve is the key thing). But what does the formula for p(x) that I described earlier actually mean? Obviously, p*(x) does not describe a probability, but what is it? The name for this quantity p(x) is a probability density, and in terms of the plots we have been drawing it corresponds to the height of the curve. The densities themselves are not meaningful in and of themselves, but they are “rigged” to ensure that the area under the curve is always interpretable as genuine probabilities. To be honest, that is about as much as you really need to know for now.[3]