The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV.
Above is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago).
It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height.
There are various ways to model this relationship, and these can be represented by various lines, see below
OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like
Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points. Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common way is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points.
When there are more than one IVs, the method is quite similar, but instead of a scatterplot in two dimensions, we have to imagine a space with as many dimensions as there are variables, and then minimize the distances in that space. Fortunately, the computer takes care of all this, and gives us output. The only difference that need concern us is that now if there are p IVs, the equation looks like . That is, each of the IVs has an associated parameter.
How multiple linear regression controls for the effects of other variables
One interesting feature of multiple linear regression is that the effect of each IV is “controlled” for the other IVs. That is, the parameter for variable accounts for the effect of on, assuming that , and so on stay the same. If, for example, we were interested in people’s weights as effects.
Of their age, sex, and height, then the resulting equation would show how men and women of a given age and height differ; how age is related to weight, if sex and height are kept constant, and how height is related to weight, if age and sex are kept constant.
Assumptions of multiple linear regression
Multiple linear regression (and simple linear regression as well) makes certain assumptions about the data.
1. Linearity As discussed in the previous diary, the model assumes that the relationship between the DV and the IVs can be well-estimated by a straight line
2. Normality of residuals.
Residuals refers to the distances between the line and the points. Multiple linear regression assumes that these distances are normally distributed with a mean of 0.
3. Homoscedasticity and independence of residuals
Not only must the residuals be normally distributed, they must have equal variance (that’s called homscedasticity) and they must not be related to the IVs.
In a previous article I looked at how to go wrong with the mean. Today, I will look at a set of alternative...
[latexpage] The chi-square test can refer to several different types of tests. Here I will discuss the...
When you have bivariate data - that is, data on two variables - either or both may be categorical or...
Cluster analysis is a set of methods for finding subjects (people, corporations, drugs, whatever) that "go...
[latexpage] Sometimes we want to compare the spread of a distribution to its mean. This can be useful when we...
Regression to the mean is a well known statistical artifact affecting correlated data that is not perfectly...
These are the slides from 4 hour course I gave at SESUG.Part of the default output from PROC LOGISTIC is a table that has entries including`percent concordant' and...On this site I have written quite a lot about regression analysis.
But what is...When I was in graduate school, one professor actually said "If you don't understand the results, just report...There is a lot of confusion about parametric vs. non-parametric statistics and tests. Some of the literature...Many times, researchers will categorize continuous variables. For example, birth weight of human infants is...