[Edit, Sept. 25, 2016: In retrospect, this is a confusing post. The overall point is fine, but my contrived illustration is not a good one.]
At an otherwise excellent talk some time ago, the speaker put up a graph like this (look below — not the cheetah)…
…and said that the two sets of data points, from two different experimental treatments, “weren’t statistically different.” If I were drinking coffee, I would have spit it out in shock. Obviously, the two sets are different. Why would anyone think otherwise? As usual, the answer relates to the nonsense of p-values and significance testing.
The speaker, following the usual rituals, performed a t-test on the two sets of data points. This gave a p-value greater than 0.05, for which the customary (and wrong) interpretation is that there’s less than a 95% chance that treatment #2 has a discernibly different effect on the property being measured than treatment #1 does, and hence they’re not “significantly different.” Or, in other words, that there’s a less than a 95% probability that the average effect of treatment #2 is different than the the average effect of treatment #1. (I’m being deliberately opaque about the experiment both because it’s irrelevant and to spare the speaker some embarassment if he/she finds this post.) There are many problems with this reasoning, but to illustrate one that’s especially important here, and that doesn’t get mentioned enough in discussions of flawed statistics, here’s a little puzzle:
The graph I showed above has two sets of points. Comparing the two with a two-sample t-test gives p = 0.086. (Really, it does; see note [1] if you want the numbers.) I generated each set of points with a simple draw of random numbers. Repeating this 100 times, we find p < 0.05 in 19% of instances. 81% of the time, we get the “insignificant” result that p>0.05. However, the actual average value of set #2 is greater than the mean of set #1 99% of the time! How could this be? How could calculating the p-value imply that measuring these two quantities should, most of the time, show “no effect,” while in actuality set #2 is almost always greater on average than set #1? (If you’re suspcious of my numbers, see note [1].) Everyone who uses p-values or similar statistics should be able to come up with a plausible resolution to this puzzle. (I would think this would make a good homework or discussion problem in a biostatistics class, but I don’t really know much about those sorts of classes.) You, dear reader, can sit back and ponder this, and then read on…

What’s the solution to our mystery? [Edit: (#i is added Sept. 25, 2016)] (i) First of all, we should keep in mind that the interpretation above is (deliberately) wrong: there’s no general relationship between the p-value and the range of outcomes under the true distribution. Rather, the p-value would tell us about the range of outcomes if set #2 had the same distribution as set #1 (i.e. the null hypothesis). So, there’s nothing surprising or unsurprising about the number of p set 1 cases. You may wonder, as I do, why people bother testing the null hypothesis so much, when it doesn’t tell us much of anything, but that’s a separate question. (ii) A more subtle point, less often discussed, is that I never said the random numbers were drawn from Gaussian distributions. Both sets come from log-normal distributions (i.e. I generated Gaussian-distributed random numbers and then took the exponential of each) [2]. “That’s not fair!” you might say, “Distributions are supposed to be Gaussian.” No, I’ll reply; nature doesn’t make such promises. In fact, in many contexts that I run into, the data involve such things as counts of small numbers of cells, whose distributions, due for example to the impossibility of negative numbers, can’t possibly be Gaussian. Nonetheless, people plug away with t-tests and p-values, even though there’s no way they could mean what we want them to mean.
You might think that the solution to this is to use a test that doesn’t assume some form for the distribution, like a Kolmogorov-Smirnov test. This doesn’t really help; for the above simulated data, a Kolmogorov-Smirnov test gives 29% of the instances having p < 0.05, an improvement over 19%, but still silly. So what should one do?
The message of this post is that one’s data always has some noise distribution — we’re sampling from some probability distribution set by the underlying biology or physics convolved with the uncertainty of our measurement, and the properties of this distribution determine what we can infer from the data. We might not know what this distribution is,but it’s there nonetheless! One of our tasks as scientists is to figure out this distribution, either through an understanding of the underlying processes or via measurement. If we can’t do that, we should at least be aware of how the inferences we’re drawing depend on the noise model, because they always do, whether we’re aware of it or not.
Noise distributions guide everything in nature — the variances, the randomness, the rare events and the subtle deviations from singular values shape the course of events as much as the average values and deterministic cartoons. As Trotsky said, “You may not be interested in the dialectic, but the dialectic is interested in you.” The same holds for noise.
Today’s illustration…
… a King Cheetah, a drawing from a drawing in the wonderful “Cats of Africa” by Paul Bosman. You may be wondering: why does it have stripes? Apparently, this rare variety of cheetah does. You may also be wondering: do King Cheetahs also have funny looking noses? No; that’s a flaw of my drawing.
Notes
[1] I’ve tabulated all the outputs of my random draws in this .csv file. Column 1 is the first instance of treatment 1, column 2 is the first instance of treatment 2, column 3 is the second instance of treatment 1, column 4 is the second instance of treatment 2 etc. The data shown in the plot above, therefore, are from the first two columns of this array.
The data are the exponentials of random numbers drawn from a Gaussian distribution with {mean, std. dev.} = {-0.1,0.5} and {0.4, 1.5} for treatments 1 and 2, respectively, with 10 datapoints in each instance.
[2] In case you’re curious, here are the histograms of all the values of each set…
… and here’s some quickly written MATLAB code.