Category: Blog

Scatterplots and enhancements

By , January 21, 2011 7:19 am

When you have two numeric variables and are interested in the relationship between them, the basic statistical graph is the scatterplot.  These can be good, but there are ways to  enhance them and there are also alternatives which can be better in some circumstances.  In some circumstances, scatterplots can be problematic, and there are ways to deal with these problems.  In this post, I show SAS code to create a basic scatterplot and some enhanced versions.

Continue reading 'Scatterplots and enhancements'»

How to go wrong with the mean

By , October 28, 2010 9:35 am

Some Things to Avoid

The average, or mean, is one of the simplest statistics there is. You have a bunch of numbers, you add them up, divide by how many there are, and …. That’s it! How could you go wrong with the average (mean)? Well…. It’s surprisingly easy to do so.

First, a good example, to set the stage. If you weigh 5 people (Peter, John, Mary, Ed and Sally) and they weigh 180, 190, 130, 186 and 100 pounds, you can just add that up, divide by 5 and you’ve got the average weight for those 5 people. That’s the mean.

Now, what can go wrong?

Averaging rates is a bad idea. For example, suppose I drive to work at a constant speed of 60 miles per hour and drive home at 40 miles per hour, over the same route. What’s my average speed? EASY! The mean of two numbers (60+40)/2 = 50.. So, my average speed is 50, right? Wrong. Let’s say it’s 60 miles to work. Then the trip to work takes me 1 hour, the trip home takes me 1.5 hours, total time is 2.5 hours to drive 120 miles. 120/2.5 = 48, not 50. Or let’s say that Bob is a professional baseball player. He bats .200 for the first half of the season, and .400 for the second half. So, his average for the whole season must be …. .300, right? WRONG. In fact, there is not enough information. You can’t find his overall average from the information given. For example, maybe in the first half he comes to bad 100 times and gets 20 hits; in the second half, he comes to bat 500 times, and gets 200 hits. Then, for the season, he has 600 at bats and 220 hits, and his average is .367.

Averaging times is also problematic. Let’s say you want to find out the average time you went to bed in the last week, and you record: 10 PM, 10PM, 11PM, 1AM, 2AM, 10PM and 10 PM.   How to find the average? (10 + 10 + 11 + 1 + 2 + 10 + 10)/7 = 7.71 ? Huh? Between 7 and 8 O’clock?  Maybe the problem is AM and PM.  Let’s go to a 24 hour clock.  (22+ 22 + 23 + 1 + 2 + 22 + 22)/7 = 16.25 … around 4 PM !?!?

The right way to solve this is to take (e.g) hours past the previous noon. So 10, 10, 11, 13, 15, 10, 10 and now the average is 11.28, or just about 11:15. That makes sense.

If there are extreme values, often called outliers, then the mean can be, if not exactly wrong, then certainly misleading. If you are figuring the average height of a group of college students, and your sample happens include the center on the basketball team, who is 7’2″ tall, then your average won’t be a very good representation of the real average height at your school.

So, even with the mean, you can go wrong.

Multinomial and ordinal logistic regression using PROC LOGISTIC

By , October 4, 2010 10:32 am

This is a paper for NESUG (NorthEast SAS Users’ Group) 2010, which you can see as a PDF articleNESUG2010

Super simple macros to make a statistician’s life easier

By , September 4, 2010 6:46 pm

I will be presenting this at NESUG in November in Baltimore

Macros can be a very complex topic, but some very simple macros can make life easier for a data analyst or statistician. I give a very basic introduction to macros from the perspective of a data analyst, and present some macros I have found useful. I include only certain types of macros, deliberately choosing the options I find easiest to understand and use. Again, this is a paper intended for statisticians and data analysts, not programmers. I am following the KISS principle: Keep It Simple, Statistician!

Continue reading 'Super simple macros to make a statistician’s life easier'»

SAS tip: Why you always should use a RUN statement

By , July 18, 2010 5:23 pm

OK, there are lots of places where it’s written that using RUN statements makes code look cleaner, but that invocation of another PROC statement makes the previous PROC get submitted. So…. It sounds like that RUN statement is a sort of esthetic extra.

But it can bite you

Continue reading 'SAS tip: Why you always should use a RUN statement'»

Panorama Theme by Themocracy