Linear Regression
Linear Regression
The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV.
Above is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago).
It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height.
There are various ways to model this relationship, and these can be represented by various lines, see below
OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like
Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points. Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common way is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points.
When there are more than one IVs, the method is quite similar, but instead of a scatterplot in two dimensions, we have to imagine a space with as many dimensions as there are variables, and then minimize the distances in that space. Fortunately, the computer takes care of all this, and gives us output. The only difference that need concern us is that now if there are p IVs, the equation looks like . That is, each of the IVs has an associated parameter.
How multiple linear regression controls for the effects of other variables
One interesting feature of multiple linear regression is that the effect of each IV is “controlled” for the other IVs. That is, the parameter for variable accounts for the effect of on, assuming that , and so on stay the same. If, for example, we were interested in people’s weights as effects.
Of their age, sex, and height, then the resulting equation would show how men and women of a given age and height differ; how age is related to weight, if sex and height are kept constant, and how height is related to weight, if age and sex are kept constant.
Assumptions of multiple linear regression
Multiple linear regression (and simple linear regression as well) makes certain assumptions about the data.
1. Linearity As discussed in the previous diary, the model assumes that the relationship between the DV and the IVs can be wellestimated by a straight line
2. Normality of residuals.
Residuals refers to the distances between the line and the points. Multiple linear regression assumes that these distances are normally distributed with a mean of 0.
3. Homoscedasticity and independence of residuals
Not only must the residuals be normally distributed, they must have equal variance (that’s called homscedasticity) and they must not be related to the IVs.
Featured Posts

Today, we think of probability as an intensively mathematical subject. But the mathematics took a long to...

In statistics and research design, there are two types of study: Experiments and observational studies. Some...

If you are a PhD student you are probably aware that it is possible to buy a dissertation. There are lots of...

Modus tollens in logic In logic, there is an argument style called modus tollens: If H0 then R Not R...

I'll bootstrap from here to eternity 'Cause that's my statistics fraternity The idea is quite old But it...

Today, I will discuss ease of learning. Unlike the earlier post (and, I hope, most of the ones to come) this...

I got into a conversation on Twitter (find me there as @peterflomstat) about the userfriendliness of...

When you have a categorical independent variable with more than 2 levels, you need to define it with a CLASS...

Question: How does ridge regression work? My answer: OLS models are BLUE  best linear unbiased...

[latexpage] In regression models of various kinds (e.g.

These are the slides for a talk at New York Area SAS Users' Group on June 2, 2011.

A frequentist fellow named Smith Kept silent (he pleaded the fifth) When the judge inquired Re assumptions...