In any form of regression model, we often think of the effects as *additive*. That is, we suppose that the effect of one variable can be added to the effect of another to get an accurate model. This is never strictly true, but how true is it? Is it true enough? How can we tell? Read more!

I will be giving a 4 hour course at SESUG in Savannah this fall.

The course is titled: Lies, damn lies and…. SAS to the Rescue!

It is designed for people who don’t know a lot of statistics but have to read statistics, interpret statistics and/or supervise statisticians and data analysts.

Cluster analysis is a set of methods for finding subjects (people, corporations, drugs, whatever) that “go together” in terms of some set of variables. There are a lot of different methods and it can be hard to know when you have good clusters. There are various statistical measures that attempt to do this, but they aren’t very intuitive.

Rather than use one of these, I prefer the following:

Do a lot of different clustering. Look at the clusters from each. Try to name them. Now, if your colleagues say “Yeah! That’s right!” to the name scheme, you have a good clustering. If they say something like “well….I dunno….that doesn’t seem right, somehow” then you still have work to do.

One question that sometimes arises in doing statistical analysis is whether to use a sophisticated method that is (in one way or another) more appropriate than a more typical method. The reason for its appropriateness might be that the usual method violates assumptions (e.g. we should use robust regression rather than OLS regression in some cases), answers the question better (e.g. we might use quantile regression instead of OLS regression in some cases), or is more efficient.

But the reviewers and editors at a journal may not know of the new method and may have issues with it. It might even lead to the paper being rejected.

What are your thoughts on this?

Often, we want to see whether a variable is distributed normally. For example, we might want to check whether the residuals from a linear model are normally distributed.

There are two general ways to do this: Numerically and graphically. Unfortunately, the numerical methods all suffer from the problem of not having an intuitive effect size and having p-values that (like all p-values) are affected by sample size. In terms of graphical methods, some recommend the histogram, but it is not great – there are better methods. First, let’s look at two distributions that are both bell shaped, but one is normal and the other is a t distribution with 5 degrees of freedom. We will look at cases with N = 10, 100 and 10,000.

One statistical test of normality is the Kolmoogorov Smirnov test. When we test t5 vs. normal, the p value is 0.95 for N = 5, 0.21 for N = 100 and 0.000008 for N – 10000. But the departure from normality is the same.

Histograms with N = 10 make little sense. However, we can look at the histogram of the t5 distribution for N = 100:

Is that normal? Now, if we change the breaks and start point so that there are breaks every 2 units from -10 to 10 we get:

That certainly doesn’t look normal! But it’s the same data. But if we simply change the limits to -7 and 7, all of a sudden it looks pretty normal:

But if we make a **lot** of breaks, then it stops looking normal and even seems like it has some pattern: .

Similar problems occur if the sample size is large.

Other people recommend overlaid density plots. Here is a graph of the t-distribution and the normal distribution with N = 100:

That is certainly more informative than the histograms, but, just as histograms have problems with bin width, density plots have problems with smoothness. If we make the plot less smooth we get:

where even the normal doesn’t look normal!

What to do? I suggest two possibilities. The one that is easier to understand is parallel box plots. The one that is most informative is the quantile normal plot. Here’s a bare-bones parallel box plot with N = 100.

(The normal distribution is on the right). The same plot with larger N looks quite similar. We can clearly see that the t5 distribution is much more spread out than the normal. This wasn’t so clear in the other plots.

The quantile normal plot for N = 100 looks like this:

If the t5 distribution was normal, all the points would be very close to the line. Reading these plots, though, takes a bit of practice.