Regression refers to a collection of techniques for modeling one variable (the dependent variable or DV), as a function of some other variables (the independent variables or IVs). Different regression techniques should be applied for different types of DVs. If the DV is a dichotomy (like living vs. dead), then the most common method is logistic regression. If the DV has multiple categories (e.g. Republican, Democrat, Independent) then the usual method is either multinomial or ordinal logistic regression. If the DV is a count (such as number of times something happens) then there are Poisson regression and negative binomial regression. If the DV is a time to an event (such as time to death) then there are a range of techniques known as survival analysis. There are other varieties too. But the most common type of DV is one that is continuous, or nearly so, such as weight, IQ, income, and so on.
What is simple linear regression?
The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV. Here is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago).
It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height.
There are various ways to model this relationship, and these can be represented by various lines (see here ). OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like
Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points. Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common way is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points.
This seems, at first, needlessly complex: Why squared? If one simply summed the distances, some would be negative (i.e the point is below the line) and some positive (i.e. the point is above the line) and the total would be 0 for any line. By squaring the distances, all are positive. One could, instead, take the sum of the absolute values of the distances; this is, in fact, a good method. The reason it isn’t used is historical and technical: Without computers, it is much easier to estimate the line based on least squares. Another way of thinking about this is to imagine that the scatterplot was a piece of wood, and that each of the points was a nail sticking up from that piece of wood. Then, if we got a rod with a lot of nails, and tied rubber bands from each nail on the board to each nail on the rod, the rod would show the least squares line. In the figure, this is the black line. The red line is a nonparametric curve, which simply attempts to get close to the points, without being too bumpy and without assuming anything.
What can go wrong in simple linear regression?
F.J. Anscombe came up with 4 sets of data, each of which he fit with simple linear regression; each has the same slope (b), the same intercept (a), and a lot of other things in common. But three show things that can go wrong in simple linear regression. Here’s the plot . The graph in the upper left is fine. The one in the upper right is a nonlinear relationship, and a straight line fits very badly; simple linear regression is a bad choice here. The one on the bottom left and bottom right show the effects of outliers or extreme points, that have too much influence on the results.
There are other potential problems, too. No statistical method should be applied without knowing about its assumptions and what happens when they are violated.
Specialties: Regression, logistic regression, cluster analysis, statistical graphics, quantile regression.