Often, we want to see whether a variable is distributed normally. For example, we might want to check whether the residuals from a linear model are normally distributed.
There are two general ways to do this: Numerically and graphically. Unfortunately, the numerical methods all suffer from the problem of not having an intuitive effect size and having p-values that (like all p-values) are affected by sample size. In terms of graphical methods, some recommend the histogram, but it is not great – there are better methods. First, let’s look at two distributions that are both bell shaped, but one is normal and the other is a t distribution with 5 degrees of freedom. We will look at cases with N = 10, 100 and 10,000.
One statistical test of normality is the Kolmoogorov Smirnov test. When we test t5 vs. normal, the p value is 0.95 for N = 5, 0.21 for N = 100 and 0.000008 for N – 10000. But the departure from normality is the same.
Histograms with N = 10 make little sense. However, we can look at the histogram of the t5 distribution for N = 100:
Is that normal? Now, if we change the breaks and start point so that there are breaks every 2 units from -10 to 10 we get:
That certainly doesn’t look normal! But it’s the same data. But if we simply change the limits to -7 and 7, all of a sudden it looks pretty normal:
But if we make a lot of breaks, then it stops looking normal and even seems like it has some pattern: .
Similar problems occur if the sample size is large.
Other people recommend overlaid density plots. Here is a graph of the t-distribution and the normal distribution with N = 100:
That is certainly more informative than the histograms, but, just as histograms have problems with bin width, density plots have problems with smoothness. If we make the plot less smooth we get:
where even the normal doesn’t look normal!
What to do? I suggest two possibilities. The one that is easier to understand is parallel box plots. The one that is most informative is the quantile normal plot. Here’s a bare-bones parallel box plot with N = 100.
(The normal distribution is on the right). The same plot with larger N looks quite similar. We can clearly see that the t5 distribution is much more spread out than the normal. This wasn’t so clear in the other plots.
The quantile normal plot for N = 100 looks like this:
If the t5 distribution was normal, all the points would be very close to the line. Reading these plots, though, takes a bit of practice.