Peter Flom’s statistics 101: Dependent and independent data
Often, when reading a statistics book, you will see some variation on the phrase “independent data“. Many models assume that the data are independent. Sometimes this is abbreviated as part of the acronym iid which means independent and identically distributed.
You may get confused between this and the case of independent and dependent variables. But the two ideas are quite different.
When we say data are independent, we mean that the data for different subjects do not depend on each other. When we say a variable is independent we mean that it does not depend on another variable for the same subject.
For instance, if we are trying to predict the weight of adult humans, we might gather a sample of adults, and collect various bits of information – height, weight, sex, age, and perhaps many others. Weight is a dependent variable because it depends on the other variables – taller people tend to be heavier; men tend to be heavier than women, and so on. But the data are independent if the weight and other variables for one person aren’t related to those for another.
Sometimes, though, the data are dependent . One example is if we measured some variables on a bunch of children, but chose kids who were in particular classes in particular schools: Kids in a class are likely to be more similar to each other than kids in different classes. If we wanted to look at effectiveness of teaching methods, then we would have a problem, because kids in a particular class are not only taught by the same method they are taught by the same teacher. That makes it dependent data. If we only looked at one child from each class, that would be independent data, but it would be a very inefficient way to gather the data.
Another example is if we look at both members of a married couple. If we ask husbands and wives how long they have been married, the numbers are likely to be related (they probably won’t match exactly, but they will be close). So, what one subject (say, Jill, wife of Jim) says is related to what another subject says (i.e. Jim, the husband of Jill). That’s dependent data again.
Another example is when we measure the same person (or other subject) more than once. If I give a bunch of students a midterm and a final, their final grade is likely to depend on their midterm grade, not just because of a general relationship between the two grades, but because it is the same person. That’s the other common kind of dependent data. If we looked only at some kids for the midterm and others for the final, that would be at least somewhat independent data, but it would be a silly design.
There are ways to deal with dependent data but first you have to recognize them.