PROC LOGISTIC: Reference coding and effect coding
Description of the problem with effect coding
When you have a categorical independent variable with more than 2 levels, you need to define it with a CLASS statement. In PROC GLM the default coding for this is dummy coding. In PROC LOGISTIC, it’s effect coding. To me, effect coding is quite unnatural.
Effect coding compares each level to the grand mean, and mirrors ANOVA coding; this seems natural to me in ANOVA, but very counter intuitive here. Reference (or dummy) coding, compares each level to one “reference” level. I find this easier to understand, and it mirrors what I see in most reports of logistic regression results.
Example of the problem of effect coding
Continuing with the same example of modeling probability of infection, suppose you now have race/ethnicity as an IV, with 6 categories, as defined by the Census Bureau: White, Black/African American, Hispanic/Latino, American Indian/Alaskan Native, Native Hawaiian or other Pacific Islander, and Asian.
proc logistic data = today2;
class race;
model disease(event = '1') = race;
weight weight;
run;
and get parameter estimates that include (among much else):
| Parameter | df | Estimate |
| Intercept | 1 | -0.8527 |
| Race AIAN | 1 | -0.0636 |
| Race AfrA | 1 | -0.7568 |
| Race Asian | 1 | 0.1595 |
| Race Latino | 1 | 0.5650 |
| Race NHPI | 1 | 0.1565 |
and OR estimates
| Effect | Point estimate |
| race AIAN vs White | 1.000 |
| race AfrA vs White | 0.500 |
| race Asian vs White | 1.250 |
| race Lat vs White | 1.875 |
| race NHPI vs White | 1.250 |
but we know that the OR estimate should be
not 1.
Evidence of effect coding problems
The design matrix. With the default, the design matrix looks like this:
| AIAN | 1 | 0 | 0 | 0 | 0 |
| AfrA | 0 | 1 | 0 | 0 | 0 |
| Asian | 0 | 0 | 1 | 0 | 0 |
| Latino | 0 | 0 | 0 | 1 | 0 |
| NHPI | -1 | -1 | -1 | -1 | -1 |
| White | 0 | 0 | 0 | 0 | 0 |
and each parameter estimate estimates the difference between that level and the average of the other levels.
On the other hand, with dummy (or reference) coding, it looks like
| race | |||||
| AIAN | 1 | 0 | 0 | 0 | 0 |
| AfrA | 0 | 1 | 0 | 0 | 0 |
| Asian | 0 | 0 | 1 | 0 | 0 |
| Latino | 0 | 0 | 0 | 1 | 0 |
| NHPI | 0 | 0 | 0 | 0 | 1 |
| White | 0 | 0 | 0 | 0 | 0 |
and each parameter estimates the difference between that level and the reference group (in this case, White).
Solution to the effect coding problem in PROC LOGISTIC
Use the param = reference option on the class statement:
proc logistic data = today2;
class race/param = ref;
model disease(event = '1') = race;
weight weight;
run;
For more information (and other possible parameterizations) see the SAS documentation for PROC LOGISTIC, in particular the section CLASS variable parameterization in DETAILS
[...] This post was mentioned on Twitter by Samuel Allende, Peter Flom. Peter Flom said: I updated my website with #SAS PROC LOGISTIC effect coding and reference coding http://ow.ly/1xRul #statistics [...]
Is your effect coding matrix right? I think you’ve missed the -1s.
Thanks Jeremy, you are right. I will fix that
I think you should explain the difference between the coding schemes. Effect (or trinary) coding compares each level to the grand mean instead of to a reference category and basically mirrors the ANOVA model.
Paul: Good point. I will add that.
Amiable post and this fill someone in on helped me alot in my college assignement. Say thank you you as your information.
great post as usual!
Keep posting stuff like this i really like it
What happens when you don’t specify Param=ref, and merely include the reference level in the CLASS statement? For eg:
CLASS race (Ref=”White”);
This document: http://www.nesug.org/proceedings/nesug07/sa/sa11.pdf says that the reference level is then assigned -1s as coefficients. The odds ratios presented are all in reference to this level. Is the interpretation of these odds any different than the odds presented had the Param=ref option also been included in the CLASS statement?
Hi Jyoti
Yes, if you don’t include the PARAM = REF coding, you will get different results than if you do include it. That is
proc logistic data = today2;
class race (ref = "White");
model disease(event = '1') = race;
weight weight;
run;
gives different results for the parameter estimates than
proc logistic data = today2;
class race/param = ref;
model disease(event = '1') = race;
weight weight;
run;
but they give the same results for the odds ratios. In the former code, the ORs are not equal to exp(parameter).
Peter
Hello,
I really liked your post.
I would like to ask you something. Running a logistic regression where the dependent variable can only take two values, does the result change if the event is ‘A’ or ‘B’ as in the efficiency of the model and the interpretation?
Thank you very much,
Claire
Hi Claire
The signs of the parameters change, and the OR of one is the inverse of the other, but the meaning, efficiency etc. will be identical
Peter
Hi Peter,
Your post is very useful.
I have a query. I’m building a model using PROC LOGISTIC at the moment and I want to score each of the level in every characteristic without any reference level. Is there any way that I could get the result I want using SAS?
Thank you
Aida
You need to have SOME reference level for a categorical variable.
Peter
Hi Peter,
Great post, it’s very helpful.
I have a query. In a logistic model for data with more than one observation per individual the SAS code I’m runnig is able to estimate the parameters for time-variant variables but failure for time-invariant one’s wich are removed, according to SAS notes, “because of its redundancy”. It’s possible to obtain those parameters, wich option shoul I call in the PROC LOGISTIC statement? Than you.
Daniel