[Note: a long post of interest only to people who care about data analysis and bad statistics, and maybe about the distant stars influencing your life.]
By now, we should all be able to list the many reasons that p-values (or null-hypothesis-significance-testing, NHST) are awful:
- that “statistical significance” has nothing to do with effect size or actual significance
- that “p < 0.05” tells one nothing about the likelihood of a scientific hypothesis being true
- that “statistical significance” is simply a measure of measurability, or sample size
- that the p-value is itself a stochastic variable and is therefore especially unreliable for small sample sizes
… and so on. From a recent news piece in Nature:
P values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor’s new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”, presumably for the acronym it would yield. — Regina Nuzzo, 12 February 2014.
Yet another problem with p-values / NHST, that struck me during a really terrible talk I endured at a recent conference, is that p-values promote nonsensical binary thinking. What do I mean by this?
A sort-of-made-up case study
Consider, for example, some tumorous cells that we can treat with drugs 1 and 2, either alone or in combination. We can make measurements of growth under our various drug treatment conditions. Suppose our measurements give us the following graph:
… from which we tell the following story: When administered on their own, drugs 1 and 2 are ineffective — tumor growth isn’t statistically different than the control cells (p > 0.05, 2 sample t-test). However, when the drugs are administered together, they clearly affect the cancer (p < 0.05); in fact, the p-value is very small (0.002!). This indicates a clear synergy between the two drugs: together they have a much stronger effect than each alone does. (And that, of course, is what the speaker claimed.)
I’ll pause while you ponder why this is nonsense.
Another interpretation of this graph is that the “treatments 1 and 2” data are exactly what we’d expect for drugs that don’t interact at all. Treatment 1 and Treatment 2 alone each increase growth by some factor relative to the control, and there’s noise in the measurements. The two drugs together give a larger, simply multiplicative effect, and the signal relative to the noise is higher (and the p-value is lower) simply because the product of 1’s and 2’s effects is larger than each of their effects alone.
I made up the graph above, but it looks just like the “important” graphs in the talk. How did I make it up? The control dataset is random numbers drawn from a normal distribution with mean 1.0 and standard deviation 0.75, with N=10 measurements. Drug 1 and drug 2’s “data” are also from normal distributions with the same N
and the same standard deviation, but with a mean of 2.0 and a standard deviation of 2.0 x 0.75. (In other words, each drug enhances the growth by a factor of 2.0, and has a simple multiplicative effect on the spread of values; there are, of course, other noise models one could imagine.) The combined treatement is drawn from a distribution of mean 4.0 (= 2 x 2), again with the same number of measurements and the same noise standard deviation 4.0 x 0.75. In other words, the simplest model of a simple effect. One can simulate this ad nauseum to get a sense of how the measurements might be expected to look. [Corrections made Jan. 9, 2017; the graphs and the overall point of the post are unaffected.]
Did I pick a particular outcome of this simulation to make a dramatic graph? Of course, but it’s not un-representative. In fact, of the cases in which Treatment 1 and Treatment 2 each have p>0.05, over 70% have p<0.05 for Treatment 1 x Treatment 2 ! Put differently, conditional on looking for each drug having an “insignificant” effect alone, there’s a 70% chance of the two together having a “significant” effect not because they’re acting together, but just because multiplying two numbers greater than one gives a larger number, and a larger number is more easily distinguished from 1!
(Of course, I could have made different assumptions about noise, made up different effect sizes, etc. The reader can feel free to play with this…)
The real point here, though, is not that one must be aware of the subtleties of p-values, or that one should simulate one’s data (both true), but that one shouldn’t make binary (true/false) statements about data. Sadly, I see this all the time in biology — a perverse desire to say “this treatment has an effect,” or “… doesn’t have an effect”, rather than “the magnitude of the effect is [x +/- y]”. Why is this so terrible?
On binary statements
First, because “binary” statements can’t be sensibly combined. If this were computer science, or mathematical logic, they could — we all know how to apply ANDs and ORs and NOTs, etc., to boolean variables — but for measurements with noise, especially poorly characterized noise, this is hopeless.
Second, it is almost never necessary to combine boolean statements. Only at the “end” of some scientific inquiry, deciding, for example, whether we use what we know to prescribe or not prescribe some drug, could we need a binary, true/false value. Before that, to build from one study to another, we necessarily lose information in going from a quantitative measure (like an effect size) to a binary one (p < 0.05, “there is an effect”).
Third, everything always has an effect. Over a hundred years ago, Émile Borel showed that the motion of a gram mass a few light years away perturbs the trajectories of molecules in a gas enough to prevent prediction of their paths for more than a few seconds . It’s absurd to think that anything exists in isolation, or that any treatment really has “zero” effect, certainly not in the messy world of living things. Our task, always, is to quantify the size of an effect, or the value of a parameter, whether this is the resistivity of a metal or the toxicity of a drug.
I can understand the temptation for binary statements — they make things simpler, and it would be easier to wade through the billions of papers published every week if we could distill them into blunt bullet points — but they’ll lead, necessarily, to nonsense.
It would have been interesting to discuss this with the speaker, but he was one of those who dash into a week-long meeting just before his talk and dash off immediately after it. Why chat with us when there’s cancer to be cured!
Today’s illustration: A pepper (watercolor and, mostly, colored pencil.) Based on an illustration from a book whose title and author I can’t remember. Peppers start with “p” — this is a true statement.
Update: Andrew Gelman nicely notes that he has written about similar issues of binary thinking. (I would bet that I read these long ago, and internalized their messages!) See:
The second post I think is particularly interesting; I disagree with a very interesting sentence in it, that “We live in an additive world that our minds try to model Booleanly,” but I won’t bother commenting for now..
 E. Borel, Le Hasard, 1914. The discussion is quite unclear. The topic has been dealt with many times; see e.g. L. Brillouin, Poincare’s theorem and uncertainty in classical mechanics. Inf. Control. 5, 223–245 (1962). (Link)
11 thoughts on “How do I hate p-values? Let me count the ways…”
Of course the real horror is that p = .06 is considered a horror while p = .05 is great.
I was thinking exactly that when I saw the pumpkin photo!
Thanks – just found this (via Andrew Gelman’s blog). I make many of these points to clinicians (and statisticians) that I work with on a regular basis, so it’s great to have somewhere else that I can point them, if only so they can see that it isn’t just me that thinks these things!
Thanks! A lot of points like this come up in things like journal clubs that I’m involved with, but they’re never written down, I think both because it’s time-consuming, and also because it’s hard to criticize papers publicly (i.e. outside something like a journal club) without seeming mean-spirited.