I work mostly with Biologists and MDs who seem to love their work. A typical project I’ve consulted for involves the collection of something called single-cell RNAseq data, which basically means collecting information about gene expression for each of the many cells (often greater than 10,000 cells) in a given sample. This technology still isn’t cheap, so getting samples from multiple people is far from trivial. If the goal was to compare the mean gene expression of a particular gene between a healthy control group and some diseased population, would 10,000 cells from one healthy patient and one patient with a disease be sufficient to make this comparison?
The first time I encountered this question, I thought the answer was straightforward (one sample per group does not a meaningful statistical analysis make), but I wasn’t sure how to convince my non-statistician colleagues that this was the case. The hypothesis question was targeted at the patient level of our data, so you needed more patients, not just more cells. That’s when I decided to run a simple simulation and “prove” to them that my intuition was right!
To show the issue, imagine I have two patients, one from a healthy control group and one from a disease group, and I’m only interested in one particular gene. In this scenario, let’s say that the true mean expression of this gene is the same in both groups. To simulate this, I draw two means from the same Poisson($\lambda$) distribution. Call these two observed draws HC for healthy control patient one’s mean and D for disease patient two’s mean. Then for each person I “sequence” a bunch of cells, where each cell’s gene expression is simulated by draws from either a Poisson(HC) distribution for patient one or a Poisson(D) distribution for patient two. Lastly, I take the mean observed expression values among the cells in each patient and do an unpaired t-test, where the sample sizes used are the total number of cells simulated above. Our goal was to test whether the disease and healthy control groups have different mean gene expressions, and since these two populations share the same mean we would hope that our test fails to reject the null hypothesis that the means are equal. However, if you repeat the simulation above many times with a reasonable number of cells (see code, but I think even 1000 cells is on the low end of most studies), the fraction of rejections of the null hypothesis is somewhere around 80% of all tests! Since the null hypothesis is true in every simulation run, this represents the false positive rate (FPR), and 80% is well above the intended FPR rate of 5%.
This seemed like a good start towards proving my intuition, but I also wanted to show how adding more samples would bring the FPR back to its intended level. The first thing I did was to take 20 samples from each population and repeat the same process as above, using all the cells from these far more representative samples of each group to rerun the tests… but the FPR went up to 90%! Shame and despair crept in, and I fantasized about pursuing a professional badminton career about a decade too late in life. Then I realized I forgot my own intuition: what you needed was more measures at the sample level, not at the cell level. I needed to get sample means from each patient within each group, then average over these to run my t-test. The true sample size was the number of patients, not the number of cells. The latter were merely used to form the observations I needed to run the right test for the question I truly wanted to answer. And now… an FPR of .05 (well, sometimes in the range of .04-.06 depending on how many simulations I ran, but close enough to partially remove my shame from before)!
AK Statistical/Data Science Journey
"We simulate things, not because it is hard, but because Math is sometimes harder to do or explain to others." – JFK if he had been an applied statistician
Posted in Statistics and Data Science
Leave a comment