Scatterplots – dealing with overplotting

By , January 24, 2011 5:20 pm

In a previous post, I dealt with some SAS code for scatterplots. Various problems can arise when using scatterplots. One of them is overplotting, where two or more data are the same point.

There are a variety of ways of dealing with this.

First, let’s create some data on heights and weights. These aren’t real, and aren’t exactly accurate, but they will give you the idea.


data htwt;
do i = 1 to 50000;
ht = round(rannor(1828282)*3) + 66;
wt = round(ht*2 + ht**2*.01 + 10*rannor(12802194),1);
jitht = ht+rannor(1818282);
jitwt = wt+rannor(199328282);
output;
end;
run;

The variables ht and wt reflect height and weight as most people report them in the USA – to the nearest inch and pound, respectively. We will get to jitwt and jitht later.

When you have only 100 data points, a plain scatterplot is OK


title 'Simple scatter plot where N = 100';
ods graphics on / imagefmt=jpg;
ods html path="c:\personal\Graphics" (url=none) file="simplescatter.html";
proc sgplot data = htwt;
scatter x = ht y = wt;
where i < 100;
run;
ods html close;
ods graphics off;

produces
A simple scatterplot

When you have more points, say 500, overplotting becomes a problem:
Scatterplot with overplotting

We can deal with this by jittering – adding small amounts of random noise to the data. That’s what jitht and jitwt are.

Scatterplot with jitter

But if we have even more data, say N = 10,000, jittering is not enough:
Scatterplot with jitter not enough

We can change the plotting character and its size
ods html path="c:\personal\Graphics" (url=none) file="jitscatter3.html";
proc sgplot data = htwt;
scatter x = jitht y = jitwt/ markerattrs = (size = 2 symbol = circlefilled);
where i < 10000;
run;
ods html close;

producing
Scatterplot with small dots

Or, we can go to a parallel box plot

ods html ;
proc sgplot data = htwt;
vbox wt/category = ht spread;
where i < 10000;
run;
ods html close;

One problem with this plot is that you lose information on the frequency of the different heights. But you can get that too with a bit of fancy footwork


* Getting fancy;
proc template;
define statgraph fancybox;
begingraph;
entrytitle "Box plot w/histogram";
layout lattice/rows = 2 columns = 1 order = columnmajor rowweights = (.8 .2);
columnaxes;
columnaxis /griddisplay = on;
columnaxis /label = '' griddisplay = on;
endcolumnaxes;
boxplot x = ht y = wt;
histogram ht;
endlayout;
endgraph;
end;
run;

ods html ;
title ‘Fancy box plot’;
proc sgrender data = htwt template = fancybox;
run;
ods html close;

producing

Box plot with histogram

7 Responses to “Scatterplots – dealing with overplotting”

  1. That’s an awesome demonstration. I don’t think I would have thought of the boxplot like that. I’m guessing that only works because the heights are whole numbers only. Or is there a way to round them if not?

    Karen

    PS. I think your second graph is identical to the first.

  2. [...] This post was mentioned on Twitter by Karen Grace-Martin, David Napoli. David Napoli said: Great insight on jitter and lattice technique RT @PeterFlomStat Scatterplots – dealing with overplotting #SAS http://ow.ly/3Jr8W [...]

  3. Peter Flom says:

    Karen

    Thanks!
    You’re right that the box plot only works with round numbers. You can definitely round numbers in SAS – see the ROUND function. I will check if I inserted the same graph twice.

    Peter

  4. Rick Wicklin says:

    A quick-and-dirty way to see what is going on:

    proc sgscatter data = htwt;
    matrix ht wt / diagonal=(histogram kernel);
    run;

    If the density of the points is a primary concern, than PROC KDE is your friend. You can get a contour plot of the 2D density, with or without a scatter plot overlay:

    proc kde data = htwt;
    bivar ht wt / plots=(contour contourscatter);
    run;

  5. Peter Flom says:

    Rick

    Very useful to know about. I wouldn’t have thought of a scatterplot matrix with only 2 variables, but, yes, it lets you use that /diagonal.

    PROC KDE is another thing I need to learn more about. For this post, I was concentrating on the SG PROCs, but, of course, there are lots of SAS PROCs that have built in ODS graphics.

  6. AnnMaria says:

    I really like the jittering with the smaller points. I don’t know why I never thought of that. Awesome tip

  7. Peter Flom says:

    Thanks AnnMaria!

Leave a Reply

Panorama Theme by Themocracy