Avtor sekcije: Danielle J. Navarro and David R. Foxcroft
Box plots
Druga alternativa histogramom je diagram box plot, včasih imenovan tudi diagram „box and whiskers“. Tako kot histogrami so najprimernejši za podatke v intervalni ali razmernostni lestvici. Ideja škatlastega diagrama je zagotoviti preprost vizualni prikaz mediane, medkvartilnega razpona in razpona podatkov. In ker to storijo na precej kompakten način, so škatlasti diagrami postali zelo priljubljena statistična grafika, zlasti v raziskovalni fazi analize podatkov, ko poskušate podatke razumeti sami. Oglejmo si, kako delujejo, pri čemer kot primer ponovno uporabimo spremenljivko afl.margins
iz niza podatkov aflsmall_margins
.
Najlažji način, da opišete, kako izgleda škatlasta ploskev, je, da jo narišete. Kliknite potrditveno polje Box plot
in dobili boste izris, prikazan spodaj desno od Fig. 23. jamovi so narisali najosnovnejšo možno škatlasto zgodbo. Ko pogledate ta prikaz, si ga morate razlagati takole: debela črta na sredini polja je mediana; sam okvir obsega razpon od 25. percentila do 75. percentila; in „brki“ segajo do najbolj skrajne podatkovne točke, ki ne presega določene meje. Privzeto je ta vrednost 1,5-kratnik interkvartilnega razpona (IQR), izračunanega kot 25. percentil - (1,5 * IQR) za spodnjo mejo in 75. percentil + (1,5 * IQR) za zgornjo mejo. Vsako opazovanje, katerega vrednost je zunaj tega obsega, je narisano kot krog ali pika, namesto da bi bilo pokrito z brki, in se običajno imenuje izstopanje. Za naše podatke o robovih AFL obstajata dve opazki, ki spadata zunaj tega obsega, in ti opazki sta narisani kot pike (zgornja meja je 107 in če pogledamo stolpec podatkov v preglednici, sta dve opazki z vrednostmi, višjimi od te, 108 in 116, torej to so pike).
Violin plots
Različica tradicionalnega škatlastega diagrama je violinski diagram. Diagrami violin so podobni škatlastim diagramom, le da prikazujejo tudi gostoto verjetnosti jedra podatkov pri različnih vrednostih. Običajno violinski diagrami vključujejo oznako za mediano podatkov in polje, ki označuje medkvartilni razpon, kot pri standardnih škatlastih diagramih. V jamovi lahko takšno funkcionalnost dosežete tako, da označite obe potrditveni polji Violin
in Box plot
. Oglejte si Fig. 24, ki ima vključeno tudi potrditveno polje Data
, da se na grafikonu prikažejo dejanske podatkovne točke. Vendar je po mojem mnenju zaradi tega graf nekoliko preveč obremenjen. Jasnost je preprostost, zato je v praksi morda bolje uporabiti preprost okvirni graf.
Drawing multiple box plots
Še zadnja stvar. Kaj pa, če želite narisati več škatlastih grafikonov hkrati? Recimo, da želim ločene škatlaste diagrame, ki prikazujejo marže AFL ne le za leto 2010, temveč za vsako leto med letoma 1987 in 2010. Za to moramo najprej poiskati podatke. Ti so shranjeni v podatkovnem nizu aflmarginbyyear
. Zato ga naložimo v jamovi in si oglejmo, kaj je v njem. Videli boste, da gre za precej velik nabor podatkov. Vsebuje 4296 iger in spremenljivke, ki nas zanimajo. Želimo, da jamovi nariše škatlaste grafe za spremenljivko margin
, vendar ločeno za vsako leto
. To lahko storimo tako, da spremenljivko leto
premaknemo čez v polje Razdelitev po
, kot v Fig. 25.
The result is shown in Fig. 26. This version of the box plot, split by year, gives a sense of why it’s sometimes useful to choose box plots instead of histograms. It’s possible to get a good sense of what the data look like from year to year without getting overwhelmed with too much detail. Now imagine what would have happened if I’d tried to cram 24 histograms into this space: no chance at all that the reader is going to learn anything useful.
Using box plots to detect outliers
Because the box plot automatically separates out those observations that lie
outside a certain range, depicting them with a dot in jamovi, people often use
them as an informal method for detecting outliers: observations that are
“suspiciously” distant from the rest of the data. Here’s an example. Suppose
that I’d drawn the box plot for the afl.margins
variable and it came up
looking like Fig. 27.
It’s pretty clear that
something funny is going on with two of the observations. Apparently,
there were two games in which the margin was over 300 points! That
doesn’t sound right to me. Now that I’ve become suspicious it’s time to
look a bit more closely at the data. In jamovi you can quickly find out
which of these observations are suspicious and then you can go back to
the raw data to see if there has been a mistake in data entry. To do
this you need to set up a filter so that only those observations with
values over a certain threshold are included. In our example, the
threshold is over 300, so that is the filter we will create. First,
click on the Filters
button at the top of the jamovi window, and then
type margin > 300
into the filter field, as in Fig. 28.
This filter creates a new column in the spreadsheet view where only those
observations that pass the filter are included. One neat way to quickly
identify which observations these are is to tell jamovi to produce a
Frequency table
(in the Exploration
→ Descriptives
window) for the
ID
variable (which must be a nominal variable otherwise the
Frequency table is not produced). In Fig. 29 you can see that the
ID values for the observations where the margin was over 300 are 14 and
134. These are suspicious cases, or observations, where you should go back
to the original data source to find out what is going on.
Usually you find that someone has just typed in the wrong number. Whilst this might seem like a silly example, I should stress that this kind of thing actually happens a lot. Real world data sets are often riddled with stupid errors, especially when someone had to type something into a computer at some point. In fact, there’s actually a name for this phase of data analysis and in practice it can take up a huge chunk of our time: data cleaning. It involves searching for typing mistakes (“typos”), missing data and all sorts of other obnoxious errors in raw data files.
For less extreme values, even if they are flagged in a a box plot as outliers, the decision about whether to include outliers or exclude them in any analysis depends heavily on why you think the data look they way they do and what you want to use the data for. You really need to exercise good judgement here. If the outlier looks legitimate to you, then keep it. In any case, I’ll return to the topic again in section Model checking.