Linear Regression
Linear Regression
The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV.
Above is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago).
It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height.
There are various ways to model this relationship, and these can be represented by various lines, see below
OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like
Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points. Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common way is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points.
When there are more than one IVs, the method is quite similar, but instead of a scatterplot in two dimensions, we have to imagine a space with as many dimensions as there are variables, and then minimize the distances in that space. Fortunately, the computer takes care of all this, and gives us output. The only difference that need concern us is that now if there are p IVs, the equation looks like . That is, each of the IVs has an associated parameter.
How multiple linear regression controls for the effects of other variables
One interesting feature of multiple linear regression is that the effect of each IV is “controlled” for the other IVs. That is, the parameter for variable accounts for the effect of on, assuming that , and so on stay the same. If, for example, we were interested in people’s weights as effects.
Of their age, sex, and height, then the resulting equation would show how men and women of a given age and height differ; how age is related to weight, if sex and height are kept constant, and how height is related to weight, if age and sex are kept constant.
Assumptions of multiple linear regression
Multiple linear regression (and simple linear regression as well) makes certain assumptions about the data.
1. Linearity As discussed in the previous diary, the model assumes that the relationship between the DV and the IVs can be wellestimated by a straight line
2. Normality of residuals.
Residuals refers to the distances between the line and the points. Multiple linear regression assumes that these distances are normally distributed with a mean of 0.
3. Homoscedasticity and independence of residuals
Not only must the residuals be normally distributed, they must have equal variance (that’s called homscedasticity) and they must not be related to the IVs.
Featured Posts

Many times, researchers will categorize continuous variables. For example, birth weight of human infants is...

Regression refers to a collection of techniques for modeling one variable (the dependent variable or DV), as a...

[latexpage] In ordinary regression the model is: $ Y = \beta_0 + \beta_1x_1 + \beta_2_x_2 + .... +...

I will be giving a 4 hour course at SESUG in Savannah this fall. The course is titled: Lies, damn lies...

Part of the default output from PROC LOGISTIC is a table that has entries including`percent concordant' and...

Macros can be a very complex topic, but some very simple macros can make life easier for a data analyst or...

People often transform their variables. There are two sorts of transformation: One is to simply multiply all...

There is a lot of confusion about parametric vs. nonparametric statistics and tests. Some of the literature...

This is a talk that I will give at NESUG in the fall.

I introduced this series the other day. Next up in the list is "getting help". In both SAS and R, there are...

Cluster analysis is a set of methods for finding subjects (people, corporations, drugs, whatever) that "go...

BMI, or body mass index, is calculated by dividing by weight by height squared. Specifically, it's weight in...