**Description of the problem with effect coding**

When you have a categorical independent variable with more than 2 levels, you need to define it with a CLASS statement. In PROC GLM the default coding for this is dummy coding. In PROC LOGISTIC, it’s effect coding. To me, effect coding is quite unnatural.

Effect coding compares each level to the grand mean (see my reply to Jennifer’s comment for more detail), and mirrors ANOVA coding; this seems natural to me in ANOVA, but very counter intuitive here. Reference (or dummy) coding, compares each level to one “reference” level. I find this easier to understand, and it mirrors what I see in most reports of logistic regression results.

**Example of the problem of effect coding**

Continuing with the same example of modeling probability of infection, suppose you now have race/ethnicity as an IV, with 6 categories, as defined by the Census Bureau: White, Black/African American, Hispanic/Latino, American Indian/Alaskan Native, Native Hawaiian or other Pacific Islander, and Asian.

` proc logistic data = today2;`

class race;

model disease(event = '1') = race;

weight weight;

run;

and get parameter estimates that include (among much else):

Parameter | df | Estimate |

Intercept | 1 | -0.8527 |

Race AIAN | 1 | -0.0636 |

Race AfrA | 1 | -0.7568 |

Race Asian | 1 | 0.1595 |

Race Latino | 1 | 0.5650 |

Race NHPI | 1 | 0.1565 |

and OR estimates

Effect | Point estimate |

race AIAN vs White | 1.000 |

race AfrA vs White | 0.500 |

race Asian vs White | 1.250 |

race Lat vs White | 1.875 |

race NHPI vs White | 1.250 |

but we know that the OR estimate should be not 1.

**Evidence of effect coding problems**

The design matrix. With the default, the design matrix looks like this:

AIAN | 1 | 0 | 0 | 0 | 0 |

AfrA | 0 | 1 | 0 | 0 | 0 |

Asian | 0 | 0 | 1 | 0 | 0 |

Latino | 0 | 0 | 0 | 1 | 0 |

NHPI | 0 | 0 | 0 | 0 | 1 |

White | -1 | -1 | -1 | -1 | -1 |

and each parameter estimate estimates the difference between that level and the average of the other levels.

On the other hand, with dummy (or reference) coding, it looks like

race | |||||

AIAN | 1 | 0 | 0 | 0 | 0 |

AfrA | 0 | 1 | 0 | 0 | 0 |

Asian | 0 | 0 | 1 | 0 | 0 |

Latino | 0 | 0 | 0 | 1 | 0 |

NHPI | 0 | 0 | 0 | 0 | 1 |

White | 0 | 0 | 0 | 0 | 0 |

and each parameter estimates the difference between that level and the reference group (in this case, White).

**Solution to the effect coding problem in PROC LOGISTIC**

Use the `param = reference`

option on the class statement:

`proc logistic data = today2;`

class race/param = ref;

model disease(event = '1') = race;

weight weight;

run;

For more information (and other possible parameterizations) see the SAS documentation for PROC LOGISTIC, in particular the section CLASS variable parameterization in DETAILS

I specialize in helping graduate students and researchers in psychology, education, economics and the social sciences with all aspects of statistical analysis. Many new and relatively uncommon statistical techniques are available, and these may widen the field of hypotheses you can investigate. Graphical techniques are often misapplied, but, done correctly, they can summarize a great deal of information in a single figure. ** I can help with writing papers, writing grant applications, and doing analysis for grants and research.**

** Specialties:** Regression, logistic regression, cluster analysis, statistical graphics, quantile regression.

You can **click here to email** or reach me via phone at 917-488-7176. Or if you want you can follow me on Facebook, **Twitter**, or LinkedIn.

Is your effect coding matrix right? I think you’ve missed the -1s.

Thanks Jeremy, you are right. I will fix that

I think you should explain the difference between the coding schemes. Effect (or trinary) coding compares each level to the grand mean instead of to a reference category and basically mirrors the ANOVA model.

Paul: Good point. I will add that.

Amiable post and this fill someone in on helped me alot in my college assignement. Say thank you you as your information.

What happens when you don’t specify Param=ref, and merely include the reference level in the CLASS statement? For eg:

CLASS race (Ref=”White”);

This document: http://www.nesug.org/proceedings/nesug07/sa/sa11.pdf says that the reference level is then assigned -1s as coefficients. The odds ratios presented are all in reference to this level. Is the interpretation of these odds any different than the odds presented had the Param=ref option also been included in the CLASS statement?

Hi Jyoti

Yes, if you don’t include the PARAM = REF coding, you will get different results than if you do include it. That is

`proc logistic data = today2;`

class race (ref = "White");

model disease(event = '1') = race;

weight weight;

run;

gives different results for the parameter estimates than

`proc logistic data = today2;`

class race/param = ref;

model disease(event = '1') = race;

weight weight;

run;

but they give the same results for the odds ratios. In the former code, the ORs are not equal to exp(parameter).

Peter

Hello,

I really liked your post.

I would like to ask you something. Running a logistic regression where the dependent variable can only take two values, does the result change if the event is ‘A’ or ‘B’ as in the efficiency of the model and the interpretation?

Thank you very much,

Claire

Hi Claire

The signs of the parameters change, and the OR of one is the inverse of the other, but the meaning, efficiency etc. will be identical

Peter

Hi Peter,

Your post is very useful.

I have a query. I’m building a model using PROC LOGISTIC at the moment and I want to score each of the level in every characteristic without any reference level. Is there any way that I could get the result I want using SAS?

Thank you

Aida

You need to have SOME reference level for a categorical variable.

Peter

Hi Peter,

Great post, it’s very helpful.

I have a query. In a logistic model for data with more than one observation per individual the SAS code I’m runnig is able to estimate the parameters for time-variant variables but failure for time-invariant one’s wich are removed, according to SAS notes, “because of its redundancy”. It’s possible to obtain those parameters, wich option shoul I call in the PROC LOGISTIC statement? Than you.

Daniel

Are you sure that you need some reference level for the variable if you are effect coding? While it’s useful to compare categories to the grand mean, it seems problematic that an entire category has to be eliminated. At least with dummy coding, you can compare the other categories to the reference to get a sense of how the reference category impacts the outcome in relation to other categories, but with effect coding, you have no idea about the impact of the reference category whatsoever it seems. Is there another way to do this?

Hi Jennifer

Good point.

If you look at the SAS documentation it says that the REF = option specifies the reference for effect or dummy coding. Where I went a bit wrong is in my description of effect coding. It should be that, when effect coding is used, the constant is equal to the grand mean and the coefficients of each of the other levels is equal to the difference between the mean of the group coded 1 and the grand mean. This still seems a very unintuitive way to do things.