In a previous post, I dealt with some SAS code for scatterplots. Various problems can arise when using scatterplots. One of them is overplotting, where two or more data are the same point.

There are a variety of ways of dealing with this.

First, let’s create some data on heights and weights. These aren’t real, and aren’t exactly accurate, but they will give you the idea.

data htwt;

do i = 1 to 50000;

ht = round(rannor(1828282)*3) + 66;

wt = round(ht*2 + ht**2*.01 + 10*rannor(12802194),1);

jitht = ht+rannor(1818282);

jitwt = wt+rannor(199328282);

output;

end;

run;

The variables ht and wt reflect height and weight as most people report them in the USA – to the nearest inch and pound, respectively. We will get to jitwt and jitht later.

When you have only 100 data points, a plain scatterplot is OK

title 'Simple scatter plot where N = 100';

ods graphics on / imagefmt=jpg;

ods html path="c:\personal\Graphics" (url=none) file="simplescatter.html";

proc sgplot data = htwt;

scatter x = ht y = wt;

where i < 100;

run;

ods html close;

ods graphics off;

produces

When you have more points, say 500, overplotting becomes a problem:

We can deal with this by jittering – adding small amounts of random noise to the data. That’s what jitht and jitwt are.

But if we have even more data, say N = 10,000, jittering is not enough:

We can change the plotting character and its size

`ods html path="c:\personal\Graphics" (url=none) file="jitscatter3.html";`

proc sgplot data = htwt;

scatter x = jitht y = jitwt/ markerattrs = (size = 2 symbol = circlefilled);

where i < 10000;

run;

ods html close;

producing

Or, we can go to a parallel box plot

ods html ;

proc sgplot data = htwt;

vbox wt/category = ht spread;

where i < 10000;

run;

ods html close;

One problem with this plot is that you lose information on the frequency of the different heights. But you can get that too with a bit of fancy footwork

* Getting fancy;

proc template;

define statgraph fancybox;

begingraph;

entrytitle "Box plot w/histogram";

layout lattice/rows = 2 columns = 1 order = columnmajor rowweights = (.8 .2);

columnaxes;

columnaxis /griddisplay = on;

columnaxis /label = '' griddisplay = on;

endcolumnaxes;

boxplot x = ht y = wt;

histogram ht;

endlayout;

endgraph;

end;

run;

ods html ;

title ‘Fancy box plot’;

proc sgrender data = htwt template = fancybox;

run;

ods html close;

producing

I specialize in helping graduate students and researchers in psychology, education, economics and the social sciences with all aspects of statistical analysis. Many new and relatively uncommon statistical techniques are available, and these may widen the field of hypotheses you can investigate. Graphical techniques are often misapplied, but, done correctly, they can summarize a great deal of information in a single figure. ** I can help with writing papers, writing grant applications, and doing analysis for grants and research.**

** Specialties:** Regression, logistic regression, cluster analysis, statistical graphics, quantile regression.

You can **click here to email** or reach me via phone at 917-488-7176. Or if you want you can follow me on Facebook, **Twitter**, or LinkedIn.

That’s an awesome demonstration. I don’t think I would have thought of the boxplot like that. I’m guessing that only works because the heights are whole numbers only. Or is there a way to round them if not?

Karen

PS. I think your second graph is identical to the first.

Karen

Thanks!

You’re right that the box plot only works with round numbers. You can definitely round numbers in SAS – see the ROUND function. I will check if I inserted the same graph twice.

Peter

A quick-and-dirty way to see what is going on:

proc sgscatter data = htwt;

matrix ht wt / diagonal=(histogram kernel);

run;

If the density of the points is a primary concern, than PROC KDE is your friend. You can get a contour plot of the 2D density, with or without a scatter plot overlay:

proc kde data = htwt;

bivar ht wt / plots=(contour contourscatter);

run;

Rick

Very useful to know about. I wouldn’t have thought of a scatterplot matrix with only 2 variables, but, yes, it lets you use that /diagonal.

PROC KDE is another thing I need to learn more about. For this post, I was concentrating on the SG PROCs, but, of course, there are lots of SAS PROCs that have built in ODS graphics.

I really like the jittering with the smaller points. I don’t know why I never thought of that. Awesome tip

Thanks AnnMaria!

Hexbinning is also useful with a lot of point collisions. (SPSS Statistics supports this natively). If the variables are categorical, heat maps and flux plots can be useful.