It’s again time for The Year in Books! (Previous posts: 2019, 2018, 2017, 2016, and 2015.) 2020 was challenging in many ways. One consequence of the Covid-19 pandemic was the difficulty of getting books, very minor of course compared to many misfortunes both personal and global, but annoying nonetheless. The Eugene Public Library has been closed to visitors since March, putting a long pause in the kids’ and my ritual of visiting almost every Saturday for the past fourteen years. (Libraries in other parts of the country are open, and one could have imagined policies here that would allow browsing without sitting, perhaps for limited numbers of people, especially given that Oregon’s Covid-19 cases are the fourth-lowest per capita of the fifty states. It has become clear to me, though, that I place a much higher value on libraries and schools than most people do. And the downtown library faces challenges common to urban libraries, which I’m sure complicate opening.) The university libraries have been closed as well for most of the time, though they were open for parts of Fall. It’s possible to request books for pickup; we’ve made good use of this, especially from the public library. Still, it’s not the same as browsing the stacks, especially to leaf through possibilities or pick up books at random. There are e-books of course, but they’re not as satisfying as books on paper. Despite all this, the number of books I read this year is about the same as in past years, and many have been excellent.

I think I read more dull books than usual this year — things that I’d rate as 3 stars out of 5, not terrible but not great. Like last year, there are several books that other people like a lot that I don’t, like Elena Ferrante’s *My Brilliant Friend* (2012), which blandly lists events happening to interesting characters, *The Good Soldier* (1915) by Ford Madox Ford, conversely a stylish book about boring people, *Rabbit, Run* by John Updike (1960), which allows me to cross off John Updike from my list of writers I should read, Saul Bellow’s *Herzog* (1964), which I read two-thirds of before deciding that there are many better ways to spend time, and *The Bridge of San Luis Rey* by Thornton Wilder (1927), which was wildly popular when published, and which won the Pulitzer Prize.

But it wasn’t all dull! There were some excellent books as well. Two highlights were written recently: *A Gentleman in Moscow* by Amor Towles (2016), about a Russian aristocrat held captive in a hotel when the revolution comes, watching four decades of history and becoming an adoptive father to a young girl, and *Circe* by Madeline Miller (2018) , a fresh retelling of the Greek myth from the witch’s perspective, managing to be grand but also intimate.

I re-read Raymond Chandler’s *The Big Sleep,* which was as great as I remembered, then realized that I hadn’t read the subsequent novels featuring hard-boiled detective Philip Marlowe. The next in the series, *Farewell, My Lovely* (1940) is even better — beautifully written, puzzling but not needlessly convoluted, and with insights into corruption and friendship. The next, *The High Window* isn’t as good.

Two other favorites of the past year are both set in India:

Many years ago I read *Swami and Friends* by R. K. Narayan (1935), but this year I read aloud to my younger son. It’s about a ten year old boy in a small town in Tamil Nadu — his experiences, friends, yearnings, adventures, and misadventures. The first few chapters are a bit dull, but after that it’s wonderful, capturing not only the odd logic of childhood, but the characteristics of school and life in a very different era. It’s often very funny, sometimes shocking, and always warm. It reminds me a lot of Tom Sawyer, both in its spirited but sometimes delusional protagonist, but also in illustrating how different the past is from the present, with childhoods with vastly greater freedom than is now the case, but also with far more violence and cruelty.

Just yesterday, I finished Rudyard Kipling’s *Kim* (1901) — a remarkable book, exciting, dense, and colorful. It’s the story of teenage Kim, an orphan in end-of-nineteenth-century Lahore, who wanders India with a Tibetan Buddhist lama he befriends, and also becomes a spy enmeshed in the “great game” of colonial powers. The contrast between the worldly tactics of espionage and the ethereal concerns of the lama is striking and often comical. Throughout, the novel has a remarkable warmth towards all its characters and towards India. In this it’s rather like a children’s novel, which this is often classified as, but it’s an excellent read for adults as well; I’m not surprised that it’s on many “best novels” lists. (See for example this, and the others noted in it.) The e-book edition I read has a 30-page glossary at the end, which I didn’t notice until I swiping the last page of the text — another drawback of not having a physical book in hand.

My two non-fiction favorites of the past year are comedian Trevor Noah’s *Born a Crime* (2016), a memoir about growing up in South Africa during the end of Apartheid and the time shortly after. It’s an amazing book — fast-moving, insightful, and a record of a childhood and adolescence packed with far more experiences, good, bad, and bizarre, than a hundred “normal” lives. It’s structured as several short essays, roughly chronological. I’ve never watched or heard Trevor Noah, and I picked this up because my older son was assigned it in high school (Grade 9). (He liked it as well.)

It’s possible to find e-books randomly, too, and from a spy genre list I came across *I Was a Spy!* by Marthe McKenna (1932). It’s the riveting, fast-paced memoir of a Belgian woman who becomes a spy for the Allies during WWI. It deserves to be better known. Not only does it paint a picture of the hardships of life during occupation by enemy troops, it sketches how an ordinary, though clever and brave, person ends up in espionage. I wouldn’t be so fond of the book if it weren’t true — in several places I would have been incredulous about the plot if it were fiction! The writing is very spare and could benefit from more description of places and people, but it serves its purpose of telling a fascinating story.

I can’t predict what the coming year will bring, but hopefully it won’t be too long before I’m back inside the library. I should avoid dull books, and perhaps be more zealous about leaving dull books unfinished — there are plenty of good ones out there!

I’ve been revising book illustrations and so haven’t painted anything else decent recently. Here’s a quickly painted zebrafish.

— Raghuveer Parthasarathy, January 1, 2021

]]>A while ago I announced that I’ve been working on a popular science book about biophysics, and a few months ago I noted that the first draft is done. Since people have asked about the present state of the book, I thought I’d make a quick post about it. Princeton University Press sent the manuscript to external readers, and I was delighted to learn a few days ago that the readers as well as the Editorial Board were unanimously positive in their assessments. They like the text as well as the illustrations, which is heartening. All the lights are green, though there are some revisions to do.

If you haven’t read the earlier posts or heard from me about the book, **here’s a short summary. **Physical principles govern and unify the dazzling variety of forms, behaviors, and interactions that living things display. The book explores this biophysical perspective, not only illustrating how it lets us make sense of natural phenomena like the packaging of DNA in viruses or the proper placement of your limbs, but also how it enables technologies like DNA sequencing and genome editing (e.g. CRISPR). We also learn why elephants can’t jump. By the end, we’ve hopefully gained a deeper appreciation of the wonders of life, and life’s connections to more general laws of nature.

The section that will take the most thought is the introduction, which was one of the hardest parts to write. Ideally, an introduction should hook the reader, explain the rationale for writing, and give a sense of the structure of the book without being tedious or opaque. I find this especially challenging because I generally don’t care about introductions; I often skip them. (My editor was scandalized by this admission.) Nonetheless, the introduction will need some re-thinking in the coming weeks — more by far than any other part of the book

A change I’ve started, though I don’t know how it will end, is to scrap the title of the book. It was *Building Life,* which I was never very fond of. (This was actually the third or fourth title I came up with over the years.) By the time of the first complete draft this summer, I changed it to *Sculpting Life,* or technically *Sculpting Life: How universal rules and random chance shape the living world.* I find it annoying that every book these days has a subtitle, but trying to change this is hopeless. Neither my editor nor I really like *Sculpting Life*, though . What’s something catchy but elegant that conveys “Biophysics?” Suggestions are welcome!

* See here for the reference

What’s the current timeline? About nine months separates the final version of the manuscript from the book’s debut as something people can buy. The two possible targets for publication are early November 2021 and January 2022. The former requires me to get all my revisions done in the next few weeks, which is daunting. The latter gives me a bit more breathing room. We’ll see what happens…

As noted in the original announcement, if you’d like to get an email when the book is published, or be notified if anything significant changes, you can **enter your email address** in this form.

If you can’t remember if you signed up: if you did, you got an email about this post!

There are a handful of illustrations I will revise. My least favorite of them all was a drawing I made of DNA wrapped around histone proteins. I re-did this recently, and I like the new version much better. Here’s a photo, not a nice scan, of the old (colored pencil, above) and new (watercolor, below) versions.

**— Raghuveer Parthasarathy, December 18, 2020**

This week: the last of my commentaries on Makin and Orban de Xivry’s Common Statistical Mistakes! (Previous posts: #1-2, #3 , #4, #5, #6, #7, #8.) I’m lumping together comments on “Mistake #9” (Over-interpreting non-significant results) and “Mistake #10” (Correlation and causation), as well as concluding remarks, writing one long post instead of two or three smaller ones for a variety of reasons: next week’s journal club meeting is the last of the term, Mistake #10 doesn’t require much commentary, and I’m growing tired of painting seashores.

The mistake is believing that the failure of a hypothesis test (i.e. p > 0.05) indicates the absence of an effect. This is a bizarre thing to believe, yet it is sadly common. To illustrate, here’s some real data on children’s heart rates, specifically the difference in heart rate from the resting value when the kids were anticipating, but hadn’t yet started, exercising. (See Note [1] for the source.)

Comparing the data to zero with a t-test, we find p = 0.116. Would we conclude that there is no effect of anticipation on heart rate? The above data, however, are just a random subset of all the kids’ measurements. Here’s the complete data:

Now, p = 0.0015, i.e. it’s “significant.” Do we believe that an anticipatory effect on heart rate has magically appeared? Of course not; it was there all along. Once again, we must keep at the forefront of our thoughts that a p-value tell us **nothing** about effect sizes, or about the absence of effects. In fact, the mean of the subset’s values is 6.9 beats per minute, close to that of the full set’s 7.2. (The uncertainty of the mean is, of course, lower in the latter case.) If our goal were, as it should be, to estimate the effect size, rather than make unsound statements about whether effects “exist” or not, we would not trip ourselves into illogical and contradictory conclusions.

As a second illustration, let’s simulate some data in which the true effect sizes are 4, 2, and 1 for the blue, red, and yellow points, respectively (left to right in the plots). In other words, the blue set demonstrates the largest effect. The data are drawn from Gaussian distributions with standard deviations 4 ,1.5, and 0.4, respectively.

In the instance shown above, with N = 7 data points each, the p-values (for comparison to zero) are 0.17, 0.06, and 0.0007, respectively — only the yellow set gets the “*” of p<0.05. Suppose each set measures some phenotype following the knockout of some gene; would we conclude that only the yellow gene indicates an important gene for the process we’re studying? Of course not… but I’ve seen this *often* in papers. In addition to mis-understanding what a p-value means, this mistake presumes we’re at a special point in time, where exactly this amount of data is the right amount, because what would happen if we were to take more data? Think about this, while I put up a random draw with the same parameters but N = 30:

Here, p = 3.1 x 10^{-8}, 1.0 x 10^{-8}, and 1.4 x 10^{-11}, respectively. All are “significant,” which shouldn’t surprise us because the p-value is always a decreasing function of N. If we take more data, p will *always* drop below 0.05. (Pedants may point out that this is only true if the true effect size is nonzero, but in real life, the effect size is always nonzero. Even the motion of stars a billion light years away pushes blocks on earth; surely, whatever treatment you apply has some indirect, n-th order effect on your system that’s at least as large.)

Returning to our smaller *N* case, it would be fine to say that the most precisely measurable effect is the yellow one, and therefore focus further attention on that, but it’s wrong to say that it’s the only effect, or the dominant effect, or the most significant effect (in the conventional use of the term).

Makin and Orban de Xivry write as a solution to this mistake that “an important first step is to report effect sizes together with p-values in order to provide information about the magnitude of the effect.” And, of course, “researchers should not over-interpret non-significant results and only describe them as non-significant.” It’s worth noting that the opposite of this mistake is also a mistake: over-interpreting statistically significant results. “Statistical significance” isn’t the same as “significance.”

I think we all know that correlation doesn’t equal causation. (See also #4, Spurious Correlations.) Sometimes people seem to ignore this. This isn’t so much a statistical mistake as a failure of scientific reasoning.

The question of how one *can* infer causation is a difficult one. Thankfully, in experimental fields, one can at least try to perform controlled experiments, altering only the variable of interest and examining its effects. Some fields aren’t so fortunate, relying solely on observation, though sometimes “natural experiments” in which some stimulus is added or removed are possible. (For a thought-provoking recent example on car seats and birth rates, see here.) There’s a lot out there on inferring causation in more challenging contexts, intersecting philophical questions about what causality means. This looks fascinating, but I’ve never had time to look into it. Perhaps someday… (On the list, “What if” by Hernán & Robins, and stuff by Judea Pearl.)

I wrote in the first post that statistical mistakes, at least of the sort that plague a lot of contemporary research, are almost never about math. As we’ve seen, discussion of these errors hasn’t required equations or derivations. Rather, the issues involve more fundamental relationships between signals, noise, and how we draw conclusions from them.

As Makin and Orban de Xivry note, their list is very closely tied to the process of null-hypothesis significance testing, and “a key assumption that underlies this list is that significance testing (as indicated by the p-value) is meaningful for scientific inferences.” Many would argue that significance testing is *not* meaningful. Some would counter that by arguing that if used properly, p-values can be informative. Others (including me) would counter that by noting first that they convey no information that isn’t better conveyed by effect sizes and confidence intervals, and second that they are so amenable to mis-use that they *invite* unreliable and incorrect science.

More broadly, I wonder whether the underlying problem is a desire for certainty that is unwarranted by the messiness of real data. We seek strong conclusions and turn to statistical tools that seem to promise clear-cut answers, not thinking enough about what the tools really mean, and what the data are really like. Perhaps thinking about the ways these tools can be mis-applied can foster a broader understanding of how to tackle subtle, complex problems. (See also Discussion Topic #3.)

**Discussion topic 1.** I claim that one of the reasons Mistake #9 is so prevalent is because of the unfortunate term “significant” to denote data with p<0.05. Do you agree? What would be a better term?

**Discussion topic 2.** Suppose null-hypothesis significance testing didn’t exist. Which of the 10 common mistakes would still exist, perhaps in some modified form?

**Discussion topic 3.** My conclusions section is pretty minimal. What’s missing in it? Perhaps comment on the publication process, how we teach statistics, etc. (Do we foster in students a clear understanding that noise is real? I suggest activities that involve measuring things.)

**Exercise 9.1** Download the heart rate data noted in [1] below. Repeatedly take random subsets of the data of size *N*, and calculate the mean change in heart rate from anticipating exercise, and the mean p-value, as a function of *N*. Plot this, with error bars indicating the standard deviation.

**Exercise 9.2** As in 9.1, plot p-values vs. *N*, but this time for simulated data as in the blue/red/yellow plots, with the mean and standard deviation values given in the post.

This seashore painting is based on a photo I took last week at Heceta Head — the photo is below. Though I’m done painting seashores for a while, it’s been enjoyable, and I’ve learned that I can use white ink.

— **Raghuveer Parthasarathy,** November 29, 2020

[1] From Louise Robson, University of Sheffield, very nicely described and made available at https://www.sheffield.ac.uk/polopoly_fs/1.33424!/file/Heart-teachers-notes.doc

]]>This week’s installment of comments on Makin and Orban de Xivry’s Common Statistical Mistakes deals with #8: **Failure to Correct for Multiple Comparisons.** (Previous posts: #1-2, #3 , #4, #5, #6, #7.)

Makin and Orban de Xivry’s description is rather complex, but the error is a simple one. To illustrate: suppose we have a control group (gray in the plot below) and several different test groups, for example populations exposed to different drugs, or different genes whose expression levels we’re measuring, or pixels of different intensities in an MRI image. We see one of these (orange) is “significantly different” if it’s compared to the control set, with p < 0.05, and we conclude that we’ve found something meaningful.

The problem is essentially the same as the previous topic of p-hacking. (I would have grouped these into one item if I were writing the *eLife* article.) This is convenient, since I spent too much time on last week’s post and so can spend less time on this one! I’ll phrase the underlying issue this way: A p-value isn’t simply a property of the data alone, but also carries with it aspects of the *process.*

The process, in the example above, is looking at multiple measurements. We can’t just ask, “is the probability less than 0.05 that *this dataset* would show a particular value?” Rather, we have to ask, “is the probability less than 0.05 that *any of several datasets* would show a particular value?” The odds of rolling a die and getting a “6” are 1/6; the odds of rolling 6 dice and finding at least one “6” are much higher (67%).

Here’s another way to think about this: In the plot above, the mean and standard deviation of the orange points are 1.46 and 0.82, respectively; the control’s are 0.64 and 0.75. If I had only measured the control and orange datasets…

… the mean and standard deviation of the orange points are 1.46 and 0.82, respectively; the control’s are 0.64 and 0.75. Obviously, nothing has changed. However, in the latter case, p = 0.03; there’s a 3% chance that two draws from an identical random number generator (as I used) would give values as different as we see. In the former case, *the numbers are exactly the same,* but the true p-value, not a pairwise comparison but the actual probability of finding a dataset that’s “significantly different” than the control, is much more than 3%. (See Exercise 2.) For the p-value to mean anything, we have to account for the multiple comparisons we could make.

There are statistical treatments for correcting p-values for multiple comparisons, but less abstractly one could simulate the properties of multiple random processes as occur in one’s experiment (e.g. the 5 test groups above) and calculate what the likelihood of various observations is. What does p < 0.05 actually look like? Even better, one can use the multiple comparisons as a starting point for new experiments that take away this freedom, experiments that look specifically at the group that may be demonstrating an effect.

**Discussion topic 1.** I claim that this is the same issue as p-hacking. Do you agree or disagree? Explain.

**Discussion topic 2.** Why is it difficult to detect whether researchers have committed this error when reading papers?

**Exercise 8.1** You roll twenty-one twenty-one-sided dice. The probability that one of them shows “12” is therefore 1/21, which is less than 0.05. What’s the probability that at least one of the dice shows “12”? (You should be able to quickly write an equation for the exact solution.).

**Exercise 8.2** Recreate the graphs above, where each set of points is 10 values drawn from a Gaussian distribution of mean = std. dev. = 1 . Simulating this many times, what’s the probability that at least one dataset has a mean differs from the control set’s mean by 0.82 or more (the difference between orange and gray above)..

Water, shore. You know the routine.

— Raghuveer Parthasarathy, November 20, 2020

]]>This week’s commentary on Makin and Orban de Xivry’s Common Statistical Mistakes covers #7: **Flexibility of Analysis: p-Hacking.** (Previous posts: #1-2, #3 , #4, #5, #6.)

I feel like this has been discussed *ad nauseum,** yet the problem still exists. The issue is that flexibility in how one analyzes data, even seemingly innocuous flexibility, can falsely lead to a “p < 0.05” claim of statistical significance. Even if applied properly, p-values are a mediocre measure, an easily mis-interpreted statistic that reveals nothing about the effect sizes and uncertainties one should care about. But it’s worse than that: p-values are easy to apply improperly. (* See e.g. Kerr 1998, Simmons 2011, Head 2015, Gelman 2013 and many, many more.)

Suppose, for concreteness, that one measures a bunch of values as shown (blue points), perhaps the change in people’s heart rate after taking some drug, and one wants to assess whether the signal is “significantly” different from zero. Suppose further that p > 0.05 for the dataset — a “non-significant result.” I did, in fact, draw the simulated points from a distribution with zero mean, so the true effect size is zero. How might one massage the points to yield significance?

An obviously awful approach would be to simply remove the low points, perhaps arguing that they’re outliers or otherwise bad, shifting the mean above zero and p to something less than 0.05. But one doesn’t have to be so deliberate to get a similar result.

Suppose instead that one decides, in retrospect, that one should have considered men and women separately. Slicing the data this way, one finds that for women (x’s in the plot), the subset of data compared to zero gives p<0.05, and one publishes the result that there’s a significant effect of the drug on women.

This is deeply wrong. Why? The probability that a bunch of random points with zero mean gives p<0.05 is 5%, by definition. The probability that *either* those points or a *random* subset of those points gives p<0.05 is something more than 5%. (It couldn’t be less!) Again using our simulated example we could ask: what’s the probability that if we generate 30 random points, and then pick half the datapoints at random, *either* the full set or the subset gives p < 0.05 when compared to zero in a simple t-test? The answer is about 8%. (It’s very easy to simulate.) Seeing “p < 0.05” in a paper, in other words, only means p<0.05 if the researchers didn’t have the flexibility to decide on their analysis after the fact. (Or, if they described and corrected for the more flexible analysis.)

There are many ways to p-hack. One that I wrote about a long time ago (2013!) is to keep running your experiment! Suppose p > 0.05; why not keep collecting data, and then stop if the new, larger set of data gives p < 0.05? After all, it’s definitely true that more data, and more information, are good! It’s worth pausing and thinking about this, if you haven’t before.

The answer is that the p-value is itself a random number, and fluctuates with the fluctuations of the data. Again by chance, we might find p < 0.05, and the freedom to stop whenever we want makes it more than 5% likely that somewhere along the way we’ll find p < 0.05. It’s easy to imagine, however, a researcher not realizing this, and happily publishing a “discovery.” It’s also easy to imagine the analysis method not being transparent to other researchers — when’s the last time you read a paper that described the authors’ decisions about when to stop collecting data? The answer is never. The reader is simply presented with a p-value.

Again, this problem of p-hacking is well known. There’s even an xkcd comic about it! (It’s a great comic, though one should realize that the p-hacking is there even if we don’t search through all 20 M&M colors…) It goes by many names: data dredging, data slicing, HARKing, and my favorite, the Garden of Forking Paths (from Andrew Gelman; see e.g. here, or many blog posts). There are always many paths the researcher analyzing data can take; the *existence* of these paths must influence our interpretation of likelihoods.

As Makin and Orban de Xivry note, one solution to the problem of p-hacking is to pre-register one’s experimental design and analysis methods, and stick to the plan. This avoids the possibility of forking paths. It does, unfortunately, also make it harder to discover (true) unexpected things.

Another approach, that Andrew Gelman often notes, is to account for the multiple paths in one’s analysis. In other words, actually calculate the probability that the process you used — a flexible stopping time, for example, shows an effect even if the true effect size is zero.

Yet another approach is to do a new experiment, with the analysis path suggested by previous exploration now fixed. (This is similar to pre-registration.)

And, of course, one can do science that doesn’t rely on p-values, and push (somehow) for a world in which publication isn’t tied to silly significance tests.

**Discussion topic 1.** Here we have two examples of flexibility in data analysis. Think of others.

**Discussion topic 2.** If you read p = 0.04 in a paper, do you believe that there’s a only 4% chance that the observed effect size could arise by chance? Why or why not?

**Exercise 7.1** In the first example above, I stated that there’s an 8% chance that either a random subset of half the datapoints or the full dataset shows p < 0.05, for a t-test of Gaussian points of mean zero and standard deviation 1. Reproduce this. Then: what if I allow myself to take two random subsets? Three? Four? Show how this probability increases with the number of slices I allow myself to make.

**Exercise 7.2** Collecting more data gives more information, and allows better estimates of whatever it is one is trying to measure. As noted above, collecting more data can make it easier to reach *false* conclusions of statistical significance. Explain how both of these statements can be true.

Another rocky shore. This one took much longer than I expected, without a great payoff. I should have made use of the flexibility of stopping times and left it unfinished. On the other hand, spending bits of time on it offered some respite during an extremely packed week. I based the painting on this video of a rocky shore. It’s amazing that so much technology is being harnessed to bring a 10 hour video of a shoreline to us, for free.

— Raghuveer Parthasarathy, November 13, 2020

]]>Next in our series of commentaries on Makin and Orban de Xivry’s Common Statistical Mistakes, #6: **Circular Analysis.** (Previous posts: #1-2, #3 , #4, #5.)

I was thinking of skipping this one entirely. It’s less dramatic than #5 or the upcoming #7, I’m not sure I fully understand the authors’ intent, and my seashore painting is a step down from last week’s. But, having spent more time than I should mulling it over, I think there’s an important general point of which this mistake is a manifestation. And besides, I should be consistent in my commenting.

The mistake of “Circular Analysis” lies in using overlapping criteria for selecting data to analyze and for the analysis itself. To illustrate this cryptic statement, consider the following datapoints, let’s say representing immune cell concentration in the blood of 40 individuals:

Each individual is treated with some drug, and then the immune cell concentration is again measured; we wish to know how much it has changed. (Please, dear reader, don’t ask if the concentration “is different.”) The new, post-treatment values are shown in the right hand plot. If they look similar to the old ones, that’s because the effect size is zero; I simulated 40 points with exactly the same mean (100) as the other set.

Perhaps, one imagines, there are individuals with normally high immune cell levels, or we care about values above some threshold. We take the top 25% of the pre-treatment set (orange borders, below), assess what happens to *these* individuals post-treatment (orange lines)…

… and conclude (**wrongly**) that there’s a large effect! In the instance shown, the difference between the subset’s pre- and post-values is about 20, with p = 0.006 (i.e. “significant”).

If you’re thinking that there’s nothing particularly “circular” about this mistake, I agree. The term apparently precedes Makin and Orban de Xivry; they cite a 2009 neuroscience paper that uses this phrasing. That paper, *cited over 2000 times to date*, argues that this major flaw is common in neuroscience, especially in imaging-based analysis (e.g. fMRI).

More importantly, you may be thinking: “Isn’t this just regression to the mean?” I agree with that, too. Makin and Orban de Xivry comment that this is similar, and also note a resemblance to overfitting. Overfitting is applying a model that’s more complex than the data warrants. The orange curve below, for example, is terrible, even though it fits the data points better than the red line. The orange curve follows the wiggles of noise as if they were real, and so will be less robust to new data.

That, in essence, is the deeper message of this “statistical mistake” and those like it: The flawed analyses acts as if noise doesn’t exist, as if each bump in the data were a pure reflection of the signal we’re trying to measure, rather than being influenced by all the randomness present in reality.

The solution to all these problems isn’t some better statistical algorithm, but a better understanding of what science and data *are.* Even in the best controlled context, noise exists. There’s an undergraduate physics lab here at Oregon in which students measure the position of a particular brick on a wall. It should shock no one that there is scatter in the measurements, despite bricks being immobile. In anything more complex, whether its neuroimaging, the microbiome, or whatever your field is, innumerable subtle forces give even more variation. One’s measurements incorporate this noise, and one should never forget this.

* Off topic: *I’ll throw in a pointer to my book announcement page, which I should update at some point with news and a detailed chapter list. One task: thinking of a better title! Feel free to weigh in, or sign up for announcements.

**Discussion topic.** Comment on similarities and differences between “circular analysis,” regression to the mean, and overfitting. Do you agree that the underlying cause of each of these is the same?

**Exercise 6.1** In the simulated example above, the true effect size is zero; both sets of numbers are drawn from a Gaussian distribution with mean 100 and standard deviation 20. Suppose we really do want to quantify the treatment response of the “high level” individuals, say those with levels above 110, without being deluded by noise. Design an experiment that allows this. Simulate its outcomes, and comment on the sample sizes needed to uncover true effects of various sizes. (Yes, this is vague, and unlike the other exercises I haven’t worked through it myself.)

A rocky shore, with a circular pool.

— Raghuveer Parthasarathy, November 5, 2020

]]>The microscopic world is deeply unintuitive, in large part due to the incessant randomness of Brownian motion. I was reminded of this when a colleague (Ben McMorran) pointed me to the timely topic of face masks and the physics that governs how they work. (Ben is collecting readings on this subject to discuss with undergraduates.) **How does the capture efficiency of a facemask depend on the size of the particles moving through it?** The answer is not at all obvious, though it’s simple in retrospect, and the question led to a nice discussion plus a homework problem I wrote for my biophysics class. I thought it worth writing up, in case it’s of use or of interest to others.

A cloth face mask is a fabric of fibers that passing particles might adhere to. Or, the particles might pass through. How does the likelihood of capture depend on a particle’s size?

First, as I asked my students to do, make a guess. Sketch a graph of what you imagine the capture efficiency vs. particle radius looks like. (*Do this!*)

Here are some shapes people sketched:

Before thinking about this, I probably would have guessed “*A*,” a monotonic function of size with larger objects being easier to capture than smaller ones.

In fact, the true curve is like “*B*,” with a minimum in the middle. I’ll show actual data below, but here’s a schematic diagram from a document by 3M:

Remarkably, the capture efficiency is non-monotonic. As particles get large, they’re easier to capture, but as they get very small, they’re also easier to capture.

The reason is **Brownian motion:** the unavoidable random meandering of particles is greater the smaller they are, enhancing their likelihood of hitting a fiber. Particles can also hit via simple impact, and this is more likely for larger particles. Therefore very small particles are well-captured, very large particles are well-captured, and there’s a regime in between at which we have a minimum of our capture efficiency. There’s a lot of physics in the details of the actual capture, but that doesn’t matter for the basic picture sketched here. These mechanisms, plus some neat long-range electrostatics, are qualitatively described in this excellent “Minute Physics” video.

The great place to pause is here, and given the statement above, try **constructing a model yourself** that gives an equation for capture efficiency vs. particle radius, as a function of whatever parameters of masks, flows, and properties of nature you think are relevant. Then plot your equation, calculate where any minima are located, and plug in numbers to estimate the particle size that’s hardest to catch. (I did this; it was fun.) Assume that fibers are perfectly sticky; anything that hits them irrevocably adheres.

To encourage you to **pause and do this, **before sketching a model myself and unduly influencing your thoughts, I’ll break up the page with (*1*) a pointer to my book announcement page, which I should update with news and a detailed chapter list at some point, and (*2*) this wonderful video of non-random motion, from Yoann Bourgeois & CCN2:

This exercise spurred me to write a homework problem that provides a bit more structure. It goes like this:

Consider the diagram shown below, a schematic cross section of a mask with a particle approaching it from the left. The large circles are the fibers of the mask; suppose their diameter and edge-to-edge separation are both *L*. The small black circle is a spherical particle of radius *a*, moving to the right with velocity *v*. The mask extends to the right, to a total thickness *NL*. The dashes indicate streamlines (i.e. indicating how a massless point would flow).

Let’s *very roughly* estimate the capture efficiency of the mask — i.e. the probability that the particle will stick to the mask, rather than passing through. Suppose that any contact leads to irreversible adhesion [*]. There are two ways the particle can hit a fiber:

(1) **Impact: **leaving a streamline, since real particles have mass, and running into a fiber. Roughly, this probability p_{1} is proportional to the ratio of the cross-sectional area of the particle to the cross-sectional area of the fiber. (Ask yourself why this is reasonable.)

(2) **Diffusion:** meandering to a fiber via Brownian motion. Imagine a particle as illustrated; t_{a} is the time needed to travel “forward” by one fiber spacing; t_{D} is the typical time needed to diffuse “sideways” by one fiber spacing, thereby hitting a fiber. The probability for being captured is roughly p_{2} = t_{a} / t_{D} — i.e. if this ratio is small, we’re likely to pass through before wandering laterally and hitting a fiber.

**(a)** Does *N* matter? Why or why not? Suggestion: think separately whether the particle size dependence should involve *N*, and about whether the overall probability of capture from the whole fiber should involve *N*.

**(b)** Roughly, the total capture probability p = p_{1} + p_{2}. (Obviously, this is a decent approximation only for small p, and it has to “max out” at p=1; you can impose this ceiling by hand.) Write an expression for p as a function of *a, L*, the viscosity of air, temperature, and flow velocity. Keep in mind that this is a rough, order-of-magnitude estimate. Assess whether p has a maximum or minimum as a function of particle radius, and write an expression for this a value.

**(c) **Using 2 x 10^{-5} kg m^{-1} s^{-1} for the viscosity of air, *v* = 0.1 m/s and as a typical flow velocity of one’s breath, and *L* = 10 microns for the fiber diameter, plot p_{1}, p_{2}, and the total p vs. *a,* and calculate the numerical value of the *a* of any extremal point. (I suggest a log-log plot.)

**(d) **Compare your answer with the Figures 2 and 3 of

C. D. Zangmeister, J. G. Radney, E. P. Vicenzi, J. L. Weaver, Filtration Efficiencies of Nanoscale Aerosol by Cloth Mask Materials Used to Slow the Spread of SARS-CoV-2. ACS Nano. 14, 9188–9200 (2020). Link

Your comparison shouldn’t be detailed; note the shape of the curve, any extrema, and whether it depends on *N*. (Photos of fibers are in Figure 1, by the way.)

Here’s Figure 3B, filtration efficiency (FE) vs. particle size:

If you’ve done the exercise correctly, you should be amazed!

If you want to know more about the physics of filtration, see the reference cited in this paper. “Contact” involves a lot of neat physics, much of which is described in the book by Israelachvilli noted in the footnote. Our point here is simply to illustrate how dominant diffusion is in setting the overall behavior of this system, as is the case for nearly everything microscopic!

I thought it was a fun exercise!

A facemask, of course! It’s not a self-portrait; my colored pencil sketch is based on this photograph of some random guy.

— Raghuveer Parthasarathy, November 2, 2020

*** **[Footnote] van der Waals forces are very strong at short range, and are always attractive. For much more on this topic, see *Intermolecular and Surface Forces* by Jacob Israelachvili, an excellent book.

Continuing our series of commentaries on Makin and Orban de Xivry’s article on common Statistical Mistakes, let’s look at #5: **Small Samples.** (Previous posts: #1-2, #3 , #4.)

This issue is simple but profound, and its prevalence is, I’ll argue, tied to more fundamental problems with how we do science. The mistake: drawing conclusions from studies with small numbers of samples (“small *N*“). This comes up all the time in one of my own areas of interest, the gut microbiome, the literature on which is filled with studies involving very few subjects, whether animals or people.

The reason small samples are a problem should be obvious: with few samples, our confidence in any estimate must be weak; fluctuations are large for small *N*. If we flip a fair coin 10 times, the odds of getting 6 heads are about 20%; if we flip it 100 times, the odds of getting 60 heads are about 1%. It would be foolish to conclude from the former case that the coin is unfair. There are related, less obvious problems with small samples. For example, it seems common for people to forget that p-values themselves are random variables, i.e. they are statistical estimates with an uncertainty associated with them; these uncertainties are large when *N* is small.

Makin and Orban de Xivry focus on another, also not well appreciated, problem with small samples: false-positive effect sizes are *larger* than they would be for larger *N*. They note the significance fallacy: “If the effect size is that big with a small sample, it can only be true.” This is easy to see by writing a simple simulation. Here’s mine, in which we consider a random Gaussian variable *A* of mean 0 (i.e. zero true effect size) and standard deviation 1, and simulate 1000 instances each at *N* from 3 to 100. For each instance we calculate a p-value from a t-test, and for those that give p < 0.05 (about 5% of the instances, of course), we calculate the mean of *A.* We then calculate the average of the absolute value of all these means as a measure of the magnitude of (false) effect size. In other words, if by chance we found p<0.05, here’s the average effect magnitude we’d think we’ve found, as a function of the number of samples. The resulting plot:

We can easily find a large, spurious signal when our datapoints are few.

Makin and Orban de Xivry note that a solution to this problem is to calculate what statistical power is possible given one’s sample size. That’s true, but an even better solution is to use more samples in your study.

Everyone knows this. (Well, perhaps not, but let’s be optimistic.) Increasing *N* is much easier said than done, however. Experiments are very, very hard. Running a research lab, I am constantly reminded of this, and I think it’s something that’s difficult for those doing purely theoretical or computational work to grasp. A thousand things can go wrong during each experiment; each datapoint is the result of lots of hard work, careful thought, and luck.

But it is nonetheless true that there is always an alternative to doing a small *N* study and drawing unjustifiable conclusions from it, and that’s not to do the study at all. There are plenty of questions that are unanswerable, or whose answers must await experiments that are capable of tackling them. Many decades elapsed between the hypothesis of gravitational waves and their detection, not because no one wanted to announce their discovery, but because it had to wait for the tools that made it possible to exist. It doesn’t help science to do a mediocre, inconclusive study in the interim.

This runs into the unfortunate reality of how a lot of science is done. We’re influenced by limitations of money and time — for example, needing to get a dissertation written or a grant renewal submitted this year rather than ten years from now, with a university lab of a few people rather than a CERN-sized 10,000. I often wonder if many of the current replication problems in science are due to a mismatch of scale. Important, interesting questions in psychology, for example, involve such complex, noisy systems (i.e. us) that one would expect to need *N*s of tens of thousands, rather than a handful of volunteer subjects, but this is not what our structure is set up to study.

Another solution would be to retain the small *N* studies, but be much more restrained about claims drawn from them. One can imagine a world in which stating: “here are a few well-collected datapoints, perhaps eventually they’ll contribute to a clear story” would be valued, though it would perhaps be a world even more flooded data than ours is.

**Discussion topic.** Would your favorite question related to the gut microbiome (or whatever your field is) be better tackled by 10 research groups of ten people each, or one group of a hundred? One can make good case for either — try debating the question!

**Exercise 5.1** Write your own simulation that reproduces my graph above. Or write something else that illustrates the *N*-dependence of false “effect sizes.”

**Exercise 5.2** Describe additional problems with small *N* studies, not noted above.

Instead of waves on a beach, I painted waves on a rocky shore, inspired by this. It turned out better than the sand and foam of the earlier ones in this series. Will I stick to rocks, or revisit sand for the next painting? Stay tuned!

— Raghuveer Parthasarathy, October 27, 2020

]]>Continuing our series — see here for Part 1, and Part 2 — let’s look at Makin and Orban de Xivry’s Statistical Mistake #4: **Spurious Correlations.**

This one is easy to understand, though nonetheless common. The authors refer to situations like the one illustrated in their Figure 2, shown below, in which the correlation calculated between two variables is driven solely by a few, or even one, data points. In other words, the correlation coefficient doesn’t really represent the relationship among the data.

I’d call this “non-robust” rather than “spurious” — I associate spurious correlations with relationships that are entirely coincidental — but I don’t know if there’s a real definition of the term. When I hear “spurious correlations,” the first thing that pops to mind is a wonderful collection with that name of true, but absurd, correlations like this one (r=0.9545):

In any case, the point is that one should be wary of correlation values that rely on a small fraction of the data. Equivalently, one should be aware that simply calculating a correlation coefficient doesn’t in any way guard against this mistake.

Makin and Orban de Xivry note that there’s not necessarily anything wrong with the far-out points — they may be perfectly valid observations and not “outliers” to be removed. I was reminded of this general issue when discussing the classic Luria-Delbrück experiment in my biophysics class this week, in which rare data points far from the mean are the “smoking gun” of spontaneous mutations driving selection in a population. The “mistake” of spurious correlations arises not from the data, but from the meaning one assigns to a correlation coefficient, which is just a particular way of characterizing the data.

There are two solutions to this problem. One is to use more robust assessments of correlations, some of which are listed in the article. The more important, and even simpler, solution is to **plot your data.** This seems obvious, and thankfully it is quite common throughout the natural sciences. Seeing the spatial arrangement of the points lets us grasp their relationship much better than seeing an *r* value. I’ve been surprised, however, to routinely see lists of correlation coefficients, with no graphs to be found, in papers in economics and other social sciences.

I don’t have a lot to say about this mistake. The solution of plotting things is a powerful and general one. Of course, making plots doesn’t stop people from irresponsible analyses, but at least it makes the flaws more evident.

Not long ago I learned that “regression discontinuity analysis” is a thing — separately fitting curves to either side of some datapoint to infer the magnitude of a step that occurs at that point. In principle this is fine; in practice, noise will make it ripe for overfitting, in a way that is reminiscent of these spurious correlations. See here for an example. Again, plotting makes the flaws clear.

**Discussion topic:** What are we trying to convey when stating a (Pearson) correlation coefficient? Can one go “backwards” frorm an *r* value to an accurate picture of what the data look like?

**Exercises:** I couldn’t come up with any non-boring simulation exercises. I sketched one in which we generate a handful of datapoints of the form *y = 1 – x + noise* with x between 0 and 1, toss in another datapoint at (1, *y_outlier*), and plot the resulting *r* vs *y_outlier.* It’s not that exciting, however.

**Exercise 4.1** Come up with a good simulation exercise related to this point. (Maybe see above.)

**Exercise 4.2** Find examples of non-robust correlations in papers. It has definitely come up in our journal club, but I’ve neglected to make notes of when.

I warned you that some of these posts would be rather minimal!

Yet another attempt at waves on a shore. I still can’t the foam to look decent, but at least this time there are turtles!

— Raghuveer Parthasarathy, October 22, 2020

]]>Continuing from the first post in this series, let’s look at Makin and Orban de Xivry’s Statistical Mistake #3: **Inflating the units of analysis.** The issue: What is *N*? In other words, how many independent data points are there, for whatever statistical analysis one wants to do? *N* is often mistakenly made larger than it should legitimately be. How?

Makin and Orban de Xivry given an example using pre- and post- values for some treatment applied to individuals. I’ll give a different, and perhaps even simpler, example: Suppose there are two groups of people, A and B. We want to see if the two groups’ weights “differ significantly” from each other. Again, this is a terrible question to ask, since every group is different from every other. Really, the only meaningful question to ask is *by how much* do they differ, but let’s ignore that in this post. For each of the 20 people in each group we take three measurements, perhaps on different days to average over variation. The results are shown here, where each individual’s measurements are the same color, and are connected by lines. The individuals’ averages are denoted by solid circles. If we perform a two-sample t-test using each measurement separately, so *N* = 60, we find *p* = 0.03, i.e. “significant.” If we use each person’s mean value, so *N* = 20, we find *p* = 0.20, i.e. “not significant.” Which is correct?

The latter analysis, of course, with *N = 20,* is correct. The repeated measurements of the same individual, while they help pin down that individual’s value, do not contribute *independent* data points with which to compare A and B.

Makin and Orban de Xivry point out that the fix to this mistake is to “consider the appropriate unit of analysis. If a study aims to understand group effects, then the unit of analysis should reflect the variance across subjects, not within subjects.” They also note that mixed-effect models are useful, “where researchers can define the variability within subjects as a fixed effect, and the between-subject variability as a random effect.”

About “the appropriate unit of analysis,” I’ll note that this isn’t always obvious. A good philosophy, however, is to be conservative, using the smallest and most robust measure of how much data you have. Does your conclusion actually depend on stretching *N* as much as you can? If so, you may want to rethink your study.

**Discussion topic:** As with mistake #2, why might someone make this mistake? What is its appeal? (Suggestion: note how *p* depends on *N*.)

**Here are some exercises to help develop intuition:**

**Exercise 3.1** Create simulated datasets to illustrate this mistake. More concretely, reproduce the scenario of my example above, in which the mean value for the people in group A is 100 and group B is 85; the standard deviations of the individuals’ means is 20 in each case; and the standard deviation within individuals is 4 in each case. Simulate this 1000 times; what does the distribution of p-values look like for the “correct” and “incorrect” interpretations described above? Is the average p-value over the 1000 trials “significant” or “insignificant” in each case? In what fraction of trials is p<0.05 for one interpretation and >0.05 for the other? (Did I cherry pick my illustration from my 1000 trials?)

**Exercise 3.2** An interesting place this issue arises is in the comparison of growth measurements. Imagine, for example, assessing whether some treatment applied at day 0 alters the growth rate of an animal, with the animals’ size measured at days 1, 2, 3, 4, … How does one assess the treatment effect? Come up with scenarios in which the researcher makes “Mistake #3.”

Another attempt at waves on a shore.

— Raghuveer Parthasarathy, October 15, 2020

]]>