Dependent and independent data

By , February 21, 2010 1:41 pm

Often, when reading a statistics book, you will see some variation on the phrase “independent data“.  Many models assume that the data are independent.  Sometimes this is abbreviated as part of the acronym  iid which means independent and identically distributed.

You may get confused between this and the case of independent and dependent variables, which I discussed here. But the two ideas are quite different. Continue reading 'Dependent and independent data'»

How to ask a statistics question

By , February 20, 2010 2:37 pm

To get a good answer, you must write a good question. Answering a statistics question without context is like boxing blindfolded. You might knock your opponent out, or you might break your hand on the ring post.

What goes into a good question?

1. Tell us the PROBLEM you are trying to solve. That is, the substantive problem, not the statistical aspects.

2. Tell us what math and statistics you know. If you’ve had one course in Introductory Stat, then it won’t make sense for us to give you an answer full of mixed model theory and matrix algebra. On the other hand, if you’ve got several courses or lots of experience, then we can assume you know some basics.

3. Tell us what data you have, where it came from, what is missing, how many variables, what are the Dependent Variables (DVs) and Independent Variables (IVs) – if any, and anything else we need to know about the data. Also tell us which (if any) statistical software you use.

4. Are you thinking of hiring a consultant, or do you just want pointers in some direction?

5. THEN, and ONLY THEN tell us what you’ve tried, why you aren’t happy, and so on.

Quantile regression

By , February 19, 2010 11:18 am

In ordinary regression, we are interested in modeling the mean of a continuous dependent variable as a linear function of one or more independent variables.  This is often what we do, in fact, want, and this form of regression is extremely common.

Sometimes, though, we want something else.  Sometimes the dependent variable isn’t continuous and we turn to logistic regression or some form of count regression.  Sometimes the dependent variable is censored, as a time to event, and we turn to survival analysis.

But sometimes even though the dependent variable is continuous, we are not interested in the mean, but in some other statistic about the population.  One such situation is when we want to model some quantile (also known as percentile) of the population.  That is, we might be interested not in what affects the mean, but in what affects (say) the 3rd quartile, or the 95th percentile, or some other percentile.

When might we want this?

Suppose our dependent variable is bimodal or multimodal – that is,  it has multiple “humps”.  If we knew what caused the bimodality, we could separate on that variable and do stratified analysis, but if we don’t know that, quantile regression might be good.

If our DV is highly skewed – as, for example, income is in many countries – we might be interested in what predicts the median (which is the 50th percentile) or some other quantile.

One more example is where our substantive interest is in people at the highest or lowest quantiles.  For example, if studying the spread of sexually transmitted diseases, we might record number of sexual partners that a person had in a given time period.  And we might be most interested in what predicts people with a great many partners, since they will be key parts of spreading the disease.