Category: Basic Statistics

Some thoughts on observational studies

By , January 21, 2012 7:46 am

Note: This is a brief introduction to observational designs. For more on this type of study, see books by Paul Rosenbaum: Observational Studies and, for a less mathematical approach Design of Observational Studies

In statistics and research design, there are two types of study: Experiments and observational studies. Some people also use the term “quasi-experiment” but I do not like it.  In an experiment, the key thing is <em>randomization</em>. We assign subjects (e.g. people) to different conditions (e.g. drug and placebo) randomly.  Often, though, such assignment is not possible or not ethical. In social sciences, it is rarely possible. We cannot, for example, randomly assign people to different levels of education. We can only observe relationships between (say) education and political party.
Continue reading 'Some thoughts on observational studies'»

How to go wrong with the mean

By , October 28, 2010 9:35 am

Some Things to Avoid

The average, or mean, is one of the simplest statistics there is. You have a bunch of numbers, you add them up, divide by how many there are, and …. That’s it! How could you go wrong with the average (mean)? Well…. It’s surprisingly easy to do so.

First, a good example, to set the stage. If you weigh 5 people (Peter, John, Mary, Ed and Sally) and they weigh 180, 190, 130, 186 and 100 pounds, you can just add that up, divide by 5 and you’ve got the average weight for those 5 people. That’s the mean.

Now, what can go wrong?

Averaging rates is a bad idea. For example, suppose I drive to work at a constant speed of 60 miles per hour and drive home at 40 miles per hour, over the same route. What’s my average speed? EASY! The mean of two numbers (60+40)/2 = 50.. So, my average speed is 50, right? Wrong. Let’s say it’s 60 miles to work. Then the trip to work takes me 1 hour, the trip home takes me 1.5 hours, total time is 2.5 hours to drive 120 miles. 120/2.5 = 48, not 50. Or let’s say that Bob is a professional baseball player. He bats .200 for the first half of the season, and .400 for the second half. So, his average for the whole season must be …. .300, right? WRONG. In fact, there is not enough information. You can’t find his overall average from the information given. For example, maybe in the first half he comes to bad 100 times and gets 20 hits; in the second half, he comes to bat 500 times, and gets 200 hits. Then, for the season, he has 600 at bats and 220 hits, and his average is .367.

Averaging times is also problematic. Let’s say you want to find out the average time you went to bed in the last week, and you record: 10 PM, 10PM, 11PM, 1AM, 2AM, 10PM and 10 PM.   How to find the average? (10 + 10 + 11 + 1 + 2 + 10 + 10)/7 = 7.71 ? Huh? Between 7 and 8 O’clock?  Maybe the problem is AM and PM.  Let’s go to a 24 hour clock.  (22+ 22 + 23 + 1 + 2 + 22 + 22)/7 = 16.25 … around 4 PM !?!?

The right way to solve this is to take (e.g) hours past the previous noon. So 10, 10, 11, 13, 15, 10, 10 and now the average is 11.28, or just about 11:15. That makes sense.

If there are extreme values, often called outliers, then the mean can be, if not exactly wrong, then certainly misleading. If you are figuring the average height of a group of college students, and your sample happens include the center on the basketball team, who is 7’2″ tall, then your average won’t be a very good representation of the real average height at your school.

So, even with the mean, you can go wrong.

Why grant writers need statisticians

By , April 23, 2010 1:35 pm

There are many reasons to write a grant, and many places to apply for one – from small grants for a few thousand dollars, to multi-year grants for many millions of dollars.  If your grant involves any sort of data analysis or data collection, even something very simple, it can be worth your while to consult with a statistician.  It is better to consult early in the process.  Although consulting costs money in the short term, it can save you a lot of time and money in the long term, and can improve your chances of getting a grant.

Some ways a statistician can help a grant writer -

1) Finding instruments – not all statisticians can do this, but many (including myself) can.  There are a huge array of psychological instruments out there.

2) Making data collection appropriate – when people come to me with data, it’s often collected in ways that make it hard to analyze.  Then I spend hours manipulating the data into the proper format.  If they had come to me before starting, it would have taken me a lot less time to show them a better way.

3) Power analysis.  Many federal agencies such as the National Institute of Health actually require power analysis.  Even if you aren’t required to do one, it can be very helpful to do so – to see how many subjects you will need to detect various effects.

4) Analysis plan.  If you come to the statistician (such as me) early, then he or she or I can help you answer the questions you want to ask, rather than the questions that the statistical techniques you are familiar with can answer.  There is a wide range of statistical techniques out there, and it’s better to let the substantive questions drive the analysis then the other way round.  A good carpenter has a big set of tools; but if you are not a carpenter, you may only have a few.

5) Doing the actual analysis – Once you get your grant, and start collecting data, you’ll want to analyze it. A good statistician can do it accurately and quickly, and show you the results in ways you understand

p-values and modus tollens

By , April 14, 2010 1:29 pm

Modus tollens in logic
In logic, there is an argument style called modus tollens:

If  H0 then R
Not R
therefore
Not H0

This is a valid argument.

Modus tollens misapplied to p-values
Some people mis-apply this to p-values, saying:

If H0 then probably not R
Not R
therefore
Probably not H0

This is not valid.
Continue reading 'p-values and modus tollens'»

My own rules of data analysis

By , April 14, 2010 1:04 pm

The answer you get depends on the question you ask

In many substantive fields, students take one, two, or perhaps three statistical courses during graduate school.  These typically cover things such as descriptive statistics, ANOVA, regression, and perhaps a couple variants of regression such as logistic regression.  These are good tools for many purposes, but it’s a very limited toolbox.  This limits the number of questions you can ask.  Perhaps the really interesting substantive question is one that you can’t answer with those methods.  But if you ask a statistician or data analyst, you may find that the right method to answer your question does exist.

You can’t see something you’re not looking for

The more specific your question, the better you will be able to answer it; but if it’s too specific, you may miss something else.  Researchers need to learn to adapt the focus of their investigations.

If you’re not surprised, you haven’t learned anything (well, not much, anyway)

Isaac Asimov once said “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny …’”. That is, surprising.  It’s fine to confirm what you already suspected, but the real advances are made when you find things you did not expect.

and

Any analysis worth doing can be done in more than one way

This gets back to the toolbox – Which method should I use? but, even within a method, there are often options.  Should I transform variables?  Which covariates should I include?  How complex should my model be? What effect sizes should I report?

Often, these and other related questions do not have simple answers, but rather a range of reasonable choices.