Suppose your dependent variable (DV) is a Likert scale or something similar. That is, it’s some sort of rating, from 1 to 5 or 1 to 7 or some such. And suppose you want to regress that on several independent variables. What should you do?
There are three broad categories of regression models that might be applicable. A lot of people routinely use linear regression (often simply called regression). Others routinely say this is incorrect, and that you should use ordinal logistic regression. And yet others will do things such as multinomial logistic regression, or collapsing the DV into two categories, and then doing binary logistic. Which is right?
The short answer to this is to quote Sir David Cox
There are no routine statistical questions, only questionable statistical routines
Let’s get more specific. Suppose you are a doctor studying back pain, and suppose your DV is response to a scale:
How much pain are you in on a typical day
1 – None
2 – Barely noticeable
3 – Moderate
4 – Severe
5 – Excruciating
and your independent variables are things like age, sex, injury status, time since injury and so on.
If one is strict about it, linear regression requires a continuous DV – and we do not have one, at least as we’ve measured it, although it could be argued that there is a latent underlying variable here that is continuous. But you’d be hard pressed to prove that the difference between “none” and “barely noticeable”
is the same as that between (say) “moderate” and “severe”. Technically, if you follow Steven’s categories of nominal, ordinal, interval, ratio, your DV is ordinal, and should be analyzed with some form of ordinal logistic regression.
But the most common type (by far) of ordinal logistic regression is the proportional hazards model, which assumes proportional hazards. That assumption might be violated, in which case, you might want to use multinomial logistic.
Since those are relatively unusual methods, some people just collapse the categories into (say) “severe’” or “excruciating” vs. anything less than that.
Which is right?
The great advantages of linear regression are its ease of interpretation and its familiarity. But it might be wrong.
Ordinal logistic is more likely to be correct, but is less known and harder to understand.
Multinomial logistic is even harder to understand, and is a very complex model, with many parameters to estimate.
Collapsing the variable will only very rarely be correct. It throws away information, and that’s rarely a good thing to do.
So, here’s what I recommend:
Do ordinal logistic regression and test the assumptions. Then if the assumptions are met, also do linear and regression and compare the results by making a scatterplot of one set of predicted values vs. the other. If they are very similar (YOU decide. Statistical analysis requires thought and judgment) then go with linear regression. If the assumptions are NOT met, then also do multinomial logistic regression, and compare those two sets of results, opting for the simpler ordinal model if results are very similar.