The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV.
Above is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago).
It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height.
There are various ways to model this relationship, and these can be represented by various lines, see below
OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like
Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points. Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common way is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points.
When there are more than one IVs, the method is quite similar, but instead of a scatterplot in two dimensions, we have to imagine a space with as many dimensions as there are variables, and then minimize the distances in that space. Fortunately, the computer takes care of all this, and gives us output. The only difference that need concern us is that now if there are p IVs, the equation looks like . That is, each of the IVs has an associated parameter.
How multiple linear regression controls for the effects of other variables
One interesting feature of multiple linear regression is that the effect of each IV is “controlled” for the other IVs. That is, the parameter for variable accounts for the effect of on, assuming that , and so on stay the same. If, for example, we were interested in people’s weights as effects.
Of their age, sex, and height, then the resulting equation would show how men and women of a given age and height differ; how age is related to weight, if sex and height are kept constant, and how height is related to weight, if age and sex are kept constant.
Assumptions of multiple linear regression
Multiple linear regression (and simple linear regression as well) makes certain assumptions about the data.
1. Linearity As discussed in the previous diary, the model assumes that the relationship between the DV and the IVs can be well-estimated by a straight line
2. Normality of residuals.
Residuals refers to the distances between the line and the points. Multiple linear regression assumes that these distances are normally distributed with a mean of 0.
3. Homoscedasticity and independence of residuals
Not only must the residuals be normally distributed, they must have equal variance (that’s called homscedasticity) and they must not be related to the IVs.
This is a talk developed by David Cassell and me, and given at NESUG and SGF and WUSS
I got into a conversation on Twitter (find me there as @peterflomstat) about the user-friendliness of...
[latexpage] In regression models of various kinds (e.g.Sensitivity and specificity are measures of the effectiveness of a diagnostic test. Most often they are used...Q: What does it mean when standard deviation is higher than the mean? My answer: It depends on the nature...Suppose your dependent variable (DV) is a Likert scale or something similar. That is, it's some sort of...The average, or mean, is one of the simplest statistics there is. You have a bunch of numbers, you add them...When you have univariate data, that is, a single measure on a variety of units, the most common statistical...
Statistical consulting for researchers in the social, behavioral and medical sciences.
Whether...When it comes to measures of central tendency or location, the arithmetic mean and the median get a lot of...What is the mean? The average, or mean, or, more formally, the arithmetic mean, is one of the simplest...PROC LOGISTIC can be used to run logistic regression on a dichotomous dependent variable. Often, these are...