Significance testing is often seen as the point at which statistics courses turn difficult. Often, this is because people are given a poor grounding in the logic behind statistics, and especially the statistics of significance testing. This post is intended to help provide a basic understanding.

Suppose we have a population with a mean IQ of 100 and a standard deviation of 15. If I take a sample of 100 people at random, I would expect their mean to be relatively close to 100, but not necessarily exactly equal to 100. It may be slightly higher (say, 100.4) or slightly lower (98.9), just due to random chance variation, measurement error, sampling error, and the like. In other words, when we sample, we expect some variability around the population mean.

When we do significance testing, we are testing to see how different a given sample mean is from the population mean. For instance, if I expect my educational intervention to enhance IQ, I'd be looking to see if a sample subjected to that intervention would have a significantly higher IQ than average after receiving it. But, as I just said, we expect some variability between measurements just due to random variation. My class may have an IQ of 102.3 after receiving my educational intervention, but is that due to the intervention or just due to measurement error? I have no way of knowing just by looking at it.

What most significance tests amount to is simply this:

Test Statistic = (Size of effect) ÷ (Size of effect due to random error).

The numerator (Size of effect) is sometimes called the "effect size," while the denominator (Size of effect due to random error) is often called the "standard error," or something along those lines. So the test statistic is, essentially, a ratio: it tells you how much larger your effect is than can reasonably be attributed just to random-chance variation.

Here's an example. Suppose my class' IQ goes up from 100 to 106, so I have an effect size of 6 points. Suppose, furthermore, that on the basis of the central limit theorem, I can deduce that the standard error is about 2.8 points -- in other words, that 2.8 points represents the average amount of variation between successive samples. If I divide my effect size by the standard error, I get 6 ÷ 2.8, or approximately 2.14. This tells me that my effect size is 2.14 times larger than would be expected just due to random chance. If this ratio is sufficiently large, we can conclude it's because the effect is not just due to random-chance variation, but due to some real effect of the intervention.

How large does this ratio have to be before we can conclude that the effect is real and not just an illusion produced by random sampling error? That depends on a number of factors, including the size of the sample and how confident you want to be that your results are real and not just a product of sampling error. Smaller differences require larger test statistics in order to achieve "significance," while a greater desired level of confidence also requires a larger test statistic or sample size. Thus, for instance, to detect an effect of about one-half a standard deviation with 95% confidence, we'd need 128 participants, but for 99% confidence, we'd need 192 people.

All test statistics are simply a special case of this principle: the size of the effect divided by the size of the effect that can be attributed to random error. In some cases, these ratios can be negative, but this only implies the direction of the effect (e.g., that the mean IQ decreased, rather than increased); it does not mean that the ratio itself is negative.