Avtor sekcije: Danielle J. Navarro and David R. Foxcroft

Box plots

Druga alternativa histogramom je diagram box plot, včasih imenovan tudi diagram „box and whiskers“. Tako kot histogrami so najprimernejši za podatke v intervalni ali razmernostni lestvici. Ideja škatlastega diagrama je zagotoviti preprost vizualni prikaz mediane, medkvartilnega razpona in razpona podatkov. In ker to storijo na precej kompakten način, so škatlasti diagrami postali zelo priljubljena statistična grafika, zlasti v raziskovalni fazi analize podatkov, ko poskušate podatke razumeti sami. Oglejmo si, kako delujejo, pri čemer kot primer ponovno uporabimo spremenljivko afl.margins iz niza podatkov aflsmall_margins.

Box plot of the ``afl.margins`` variable — Fig. 23 Box plot of the `afl.margins` variable from the `aflsmall_margins` data set plotted in jamovi

Najlažji način, da opišete, kako izgleda škatlasta ploskev, je, da jo narišete. Kliknite potrditveno polje Box plot in dobili boste izris, prikazan spodaj desno od Fig. 23. jamovi so narisali najosnovnejšo možno škatlasto zgodbo. Ko pogledate ta prikaz, si ga morate razlagati takole: debela črta na sredini polja je mediana; sam okvir obsega razpon od 25. percentila do 75. percentila; in „brki“ segajo do najbolj skrajne podatkovne točke, ki ne presega določene meje. Privzeto je ta vrednost 1,5-kratnik interkvartilnega razpona (IQR), izračunanega kot 25. percentil - (1,5 * IQR) za spodnjo mejo in 75. percentil + (1,5 * IQR) za zgornjo mejo. Vsako opazovanje, katerega vrednost je zunaj tega obsega, je narisano kot krog ali pika, namesto da bi bilo pokrito z brki, in se običajno imenuje izstopanje. Za naše podatke o robovih AFL obstajata dve opazki, ki spadata zunaj tega obsega, in ti opazki sta narisani kot pike (zgornja meja je 107 in če pogledamo stolpec podatkov v preglednici, sta dve opazki z vrednostmi, višjimi od te, 108 in 116, torej to so pike).

Violin plots

Različica tradicionalnega škatlastega diagrama je violinski diagram. Diagrami violin so podobni škatlastim diagramom, le da prikazujejo tudi gostoto verjetnosti jedra podatkov pri različnih vrednostih. Običajno violinski diagrami vključujejo oznako za mediano podatkov in polje, ki označuje medkvartilni razpon, kot pri standardnih škatlastih diagramih. V jamovi lahko takšno funkcionalnost dosežete tako, da označite obe potrditveni polji Violin in Box plot. Oglejte si Fig. 24, ki ima vključeno tudi potrditveno polje Data, da se na grafikonu prikažejo dejanske podatkovne točke. Vendar je po mojem mnenju zaradi tega graf nekoliko preveč obremenjen. Jasnost je preprostost, zato je v praksi morda bolje uporabiti preprost okvirni graf.

Violin plot of the ``afl.margins`` variable — Fig. 24 Violin plot of the `afl.margins` variable from the `aflsmall_margins` file plotted in jamovi, alsow showing a box plot and data points

Drawing multiple box plots

Še zadnja stvar. Kaj pa, če želite narisati več škatlastih grafikonov hkrati? Recimo, da želim ločene škatlaste diagrame, ki prikazujejo marže AFL ne le za leto 2010, temveč za vsako leto med letoma 1987 in 2010. Za to moramo najprej poiskati podatke. Ti so shranjeni v podatkovnem nizu aflmarginbyyear. Zato ga naložimo v jamovi in si oglejmo, kaj je v njem. Videli boste, da gre za precej velik nabor podatkov. Vsebuje 4296 iger in spremenljivke, ki nas zanimajo. Želimo, da jamovi nariše škatlaste grafe za spremenljivko margin, vendar ločeno za vsako leto. To lahko storimo tako, da spremenljivko leto premaknemo čez v polje Razdelitev po, kot v Fig. 25.

``Split by`` box — Fig. 25 jamovi screen shot showing the `Split by` box

The result is shown in Fig. 26. This version of the box plot, split by year, gives a sense of why it’s sometimes useful to choose box plots instead of histograms. It’s possible to get a good sense of what the data look like from year to year without getting overwhelmed with too much detail. Now imagine what would have happened if I’d tried to cram 24 histograms into this space: no chance at all that the reader is going to learn anything useful.

Multiple box plots: ``margin`` split by ``year`` from |aflmarginbyyear| — Fig. 26 Multiple box plots created in jamovi, for the variables `margin` split by `year` in the `aflmarginbyyear` data set

Using box plots to detect outliers

Because the box plot automatically separates out those observations that lie outside a certain range, depicting them with a dot in jamovi, people often use them as an informal method for detecting outliers: observations that are “suspiciously” distant from the rest of the data. Here’s an example. Suppose that I’d drawn the box plot for the afl.margins variable and it came up looking like Fig. 27.

Box plot of the ``afl.margins`` variable with outliers — Fig. 27 Box plot of the `afl.margins` variable showing two very suspicious outliers

It’s pretty clear that something funny is going on with two of the observations. Apparently, there were two games in which the margin was over 300 points! That doesn’t sound right to me. Now that I’ve become suspicious it’s time to look a bit more closely at the data. In jamovi you can quickly find out which of these observations are suspicious and then you can go back to the raw data to see if there has been a mistake in data entry. To do this you need to set up a filter so that only those observations with values over a certain threshold are included. In our example, the threshold is over 300, so that is the filter we will create. First, click on the Filters button at the top of the jamovi window, and then type margin > 300 into the filter field, as in Fig. 28.

This filter creates a new column in the spreadsheet view where only those observations that pass the filter are included. One neat way to quickly identify which observations these are is to tell jamovi to produce a Frequency table (in the Exploration → Descriptives window) for the ID variable (which must be a nominal variable otherwise the Frequency table is not produced). In Fig. 29 you can see that the ID values for the observations where the margin was over 300 are 14 and 134. These are suspicious cases, or observations, where you should go back to the original data source to find out what is going on.

Fig. 29 Frequency table for ID showing the ID numbers for the two suspicious outliers: 14 and 134

Usually you find that someone has just typed in the wrong number. Whilst this might seem like a silly example, I should stress that this kind of thing actually happens a lot. Real world data sets are often riddled with stupid errors, especially when someone had to type something into a computer at some point. In fact, there’s actually a name for this phase of data analysis and in practice it can take up a huge chunk of our time: data cleaning. It involves searching for typing mistakes (“typos”), missing data and all sorts of other obnoxious errors in raw data files.

For less extreme values, even if they are flagged in a a box plot as outliers, the decision about whether to include outliers or exclude them in any analysis depends heavily on why you think the data look they way they do and what you want to use the data for. You really need to exercise good judgement here. If the outlier looks legitimate to you, then keep it. In any case, I’ll return to the topic again in section Model checking.