Lies, BLEEP lies, and Statistics
Mark Twain is often credited as saying that there are "lies, damn lies, and statistics" but as someone who tutors in statistics, I see it more as there are people who tell lies and lies with statistics. Statistics themselves are only numbers, and while calculations can be mistaken, the wrong formulas can be used and yes numbers can be used to mislead people; the numbers themselves do not lie. The problem for most people with statistics is that it is an unusual way to think about and manipulate numbers.
This week, I have been helping a student better understand the implications of an average of a sample, also referred to as the mean in order to prepare for an upcoming standardized test. Generally, a sample consists of individuals 1, 2, 3, …, n, who each have some numerical characteristic x1, x2, x3, …, xn. For example, a sample of individual's resting heart rate (measured in beats per minute, bpm) could be as follows:
Individual 1 has a rate of 43 bpm, 2 has 47, 3 has 52, 4 has 33, …, and n has z. An average then is the sum of all the individual rates (Sum) divided by the count of individuals (Count). So, given the numbers above and n=5 and the average is 45, I can calculate z. Sum is equal 43+47+52+33+n, the count is equal to 5, and the average is equal to 45, so the Sum is equal to 5x45=225 and 225-43-47-52-33=50=z.
Notice, however, some properties of the average. First of all, unless all the sample values are equal, there is always a smallest and largest number, and the average is always between them. I can always rearrange the samples in order from smallest to largest (e.g. 33, 43, 47, 50, 52). And no matter how many 33's I add to the sample, as long as there is one number greater than 33, the average will always be between 33 and the largest number, and similarly, no matter how many 52's I add to the sample, the average will always be between 52 and the smallest number. Thus, by definition, it is impossible for everyone to be above or below average. Thus, qualitatively, being above average, at average, or below average, is not of itself good or bad, it is merely a mathematical way to compare an individual sample value against a value that mathematically summarizes a central point of all the sample values.
No sooner do I work on this, than the local paper editorializes indicating a complete lack of understanding of this very principle. The complaint was that Wisconsin's reported drunk driving rate was over 20% (i.e. 1 out of 5 people surveyed admitted to driving while intoxicated in the last year) and that this was the highest such rate among all 50 states in the U.S. Now, to be clear, a 20% rate is qualitatively scary, but out of 50 states, some state has to have the highest rate, so by definition it can say nothing additionally worse to say that the 20% rate was the worst among the 50 states, because some rate was going to be worst. The 20% rate only means something in comparison to the other values, or their summary, the mean. After all, how you would evaluate that result if the sample of the 50 states reported values between 18 and 20 versus if the reported values were between 5 and 20 percent? In the first group, it would mean very little to say that 20% was the worst score because at worst it is only slightly higher than its mean (at least 18%), whereas in the second group, the difference could be much larger, and by implication more improvement was possible. Thus, the 20% value only provides additional information in the context of the average, and other measures of how the data is spread out.