What is multiple linear regression?

By , November 2, 2009 4:06 pm

Introduction to multiple linear regression

In an earlier article, we looked at simple linear regression, which involves one independent variable (IV) and one dependent variable (DV).
When there are more than one IVs, the method is quite similar, but instead of a scatterplot in two dimensions, we have to imagine a space with as many dimensions as there are variables, and then minimize the distances in that space. Fortunately, the computer takes care of all this, and gives us output. The only difference that need concern us is that now if there are p IVs, the equation looks like $y = b_0 + b_1x_1 + b_2x_2 + …b_px_p$. That is, each of the IVs has an associated parameter.

How multiple linear regression controls for the effects of other variables

One interesting feature of multiple linear regression is that the effect of each IV is “controlled” for the other IVs. That is, the parameter for variable $X_1$  accounts for the effect of $X_1$ on Y, assuming that $X_2$, X_3 and so on stay the same. If, for example, we were interested in people’s weights as effects.

of their age, sex, and height, then the resulting equation would show how men and women of a given age and height differ; how age is related to weight, if sex and height are kept constant, and how height is related to weight, if age and sex are kept constant.

Assumptions of multiple linear regression

Multiple linear regression (and simple linear regression as well) makes certain assumptions about the data.

1. Linearity As discussed in the previous diary, the model assumes that the relationship between the DV and the IVs can be well-estimated by a straight line

2. Normality of residuals.

Residuals refers to the distances between the line and the points. Multiple linear regression assumes that these distances are normally distributed with a mean of 0.

3. Homoscedasticity and independence of residuals

Not only must the residuals be normally distributed, they must have equal variance (that’s called homscedasticity) and they must not be related to the IVs.

Example of multiple linear regression

Here is what happened in an old dataset when I regressed weight on height, sex, age, marijuana use, cocaine or heroin use, crack use, and drug injection. This was part of a project on drug using young adults in a neighborhood in Brooklyn, NY. Drug use was coded so that people could only be in one category – the ‘hardest’ drug used.

 WT = -139.74 + 4.11* Ht -0.20*Fem + 1.31*age - 0.57*MJ - 9.66*C/H -18.57*crack - 25.75*IDU

So, the predicted weight for a 20 year old female who was 62 inches tall and used marijuana but no harder drug was

 -139.74 + 4.11*62 - 0.20 + 1.31*20 -0.57 = 140.5 pounds

For a person with all the same characteristics, only injecting drugs, the predicted weight is

 -139.74 + 4.11*62 - 0.20 + 1.31*20 -18.57  = 122.5 pounds

Leave a Reply