When we have quantitative data, one thing we often want to know is where the center is, and, for that, we can look at the mean, median, mode, trimmed mean, and other measures. But we also want to know how spread out the numbers are. Are they all clustered near the median? Or are they all over the place? This can be very important. For example, if you were a 9th grade math teacher, then you would have very different classes if one had scores on a previous test like this:
9.1 9.0 8.9 8.9 9.1 9.0 9.1 8.9 9.0 8.9 9.2 8.8 8.8 9.1
And another had
10.0 8.0 9.0 10.2 7.8 9.0 9.0 7.2 10.8 10.0 8.0 7.5 10.5
Even though both have means right around 9.0.
The most common measures of spread
There are several good measures of spread, and they are good in different situations, and for different purposes. Among them are the variance, the standard deviation, the range, and the interquartile range. Let’s look at each.
The variance is defined by a formula
where is the mean and N is the sample size
In words, it is the average squared deviation from the mean.
The standard deviation is simply the square root of the variance.
The range is the largest and smallest values. The interquartile range is the range from the 1st to the 3rd quartiles, that is, from the number that is smaller than all but 25% of the numbers, to the number that is larger than all but 25%.
All of these can easily be calculated with any statistical package, such as R, SAS, STATA, or SPSS. It’s a little trickier to decide when each should be used. The variance is rarely used in descriptive statistics, because it is in different units than the values themselves. For example, if you have data on the weight of 100 people, the variance will not be in pounds, but in pounds squared. That’s not intuitive for most people. Variances, though, are essential in some other statistical procedures. Much more common is the standard deviation. The standard deviation is good when the distribution of the data is roughly symmetric – that is, when the median is close to the mean. A general rule is that if the mean is meaningful, so is the standard deviation.
The range is almost always a little bit useful … at least to check that it makes sense. For example, if you look at some data on the weights of American adults and find that one of your subjects has a weight of 2,000 pounds, you know there is a problem! But the range is usually not that useful, beyond that. The interquartile range is underused. When you report the median, you should probably report the interquartile range as well. The most common example is income, where the median is almost always reported, but the IQR rarely is.
Specialties: Regression, logistic regression, cluster analysis, statistical graphics, quantile regression.