Question: Is the method of determining outliers by flagging data that are more than 3 standard deviations away from the mean an effective method?
My answer: There is no very good automatic method of finding outliers.
Let’s look at your proposed method: More than 3 SD away is an outlier. Well, if your sample is very large (as, today, many samples are) then even if the population is normally distributed you would expect some cases more than 3 SD away. For instance, if you have a sample of 10,000,000 (which isn’t that big, by the standards of big data) then you would expect about 25,000 cases that would be more than 3 SD away. It would be absolutely freakishly amazing to find no cases that far away. It would mean there were inliers.
But, of course, many distributions are not normal. Many have heavier tails or right or left skew. That would increase the number of outliers by your rule.
And that’s just univariate outliers. What about bivariate cases (not to mention multivariate)? In a sample of the USA population, 12 year olds are not outliers and widows are not outliers, but 12 year old widows are definitely outliers.