Section author: Danielle J. Navarro and David R. Foxcroft

Assessing the reliability of a measurement¶

At this point we’ve thought a little bit about how to operationalise a theoretical construct and thereby create a psychological measure. And we’ve seen that by applying psychological measures we end up with variables, which can come in many different types. At this point, we should start discussing the obvious question: is the measurement any good? We’ll do this in terms of two related ideas: reliability and validity. Put simply, the reliability of a measure tells you how precisely you are measuring something, whereas the validity of a measure tells you how accurate the measure is. In this section I’ll talk about reliability; we’ll talk about validity in section Experimental and non-experimental research.

Reliability is actually a very simple concept. It refers to the repeatability or consistency of your measurement. The measurement of my weight by means of a “bathroom scale” is very reliable. If I step on and off the scales over and over again, it’ll keep giving me the same answer. Measuring my intelligence by means of “asking my mum” is very unreliable. Some days she tells me I’m a bit thick, and other days she tells me I’m a complete idiot. Notice that this concept of reliability is different to the question of whether the measurements are correct (the correctness of a measurement relates to it’s validity). If I’m holding a sack of potatos when I step on and off the bathroom scales the measurement will still be reliable: it will always give me the same answer. However, this highly reliable answer doesn’t match up to my true weight at all, therefore it’s wrong. In technical terms, this is a reliable but invalid measurement. Similarly, whilst my mum’s estimate of my intelligence is a bit unreliable, she might be right. Maybe I’m just not too bright, and so while her estimate of my intelligence fluctuates pretty wildly from day to day, it’s basically right. That would be an unreliable but valid measure. Of course, if my mum’s estimates are too unreliable it’s going to be very hard to figure out which one of her many claims about my intelligence is actually the right one. To some extent, then, a very unreliable measure tends to end up being invalid for practical purposes; so much so that many people would say that reliability is necessary (but not sufficient) to ensure validity.

Okay, now that we’re clear on the distinction between reliability and validity, let’s have a think about the different ways in which we might measure reliability:

Test-retest reliability. This relates to consistency over time. If we repeat the measurement at a later date do we get a the same answer?
Inter-rater reliability. This relates to consistency across people. If someone else repeats the measurement (e.g., someone else rates my intelligence) will they produce the same answer?
Parallel forms reliability. This relates to consistency across theoretically-equivalent measurements. If I use a different set of bathroom scales to measure my weight does it give the same answer?
Internal consistency reliability. If a measurement is constructed from lots of different parts that perform similar functions (e.g., a personality questionnaire result is added up across several questions) do the individual parts tend to give similar answers. We’ll look at this particular form of reliability later in the book, in section Internal consistency reliability analysis.

Not all measurements need to possess all forms of reliability. For instance, educational assessment can be thought of as a form of measurement. One of the subjects that I teach, Computational Cognitive Science, has an assessment structure that has a research component and an exam component (plus other things). The exam component is intended to measure something different from the research component, so the assessment as a whole has low internal consistency. However, within the exam there are several questions that are intended to (approximately) measure the same things, and those tend to produce similar outcomes. So the exam on its own has a fairly high internal consistency. Which is as it should be. You should only demand reliability in those situations where you want to be measuring the same thing!