Sunday, October 31, 2010

The Hazard of High Throughput

We live in a high throughput age. Science is no exception. Microarrays, high throughput sequencing and spectrophotometers generate data on a scale and scope that would have taken years, decades, or centuries with the the old generation of technology. We are generating data on a scale that could not be conceived of a decade ago.

Great power... let's see what comes next....

The challenge, of course, is with the analysis. The data now generated is so massive that it cannot be simultaneously visualized or processed in its raw form. Nor can t-tests get you where you want to go with statistical analysis, so many are the comparisons being made.

Enter the world of the Bonferoni correction and the Benjamini-Hochberg false discovery rate. These statistical methods allow us to sift through such enormous data sets to focus on results that are significantly different from random expectation.

The hazard comes with these methods' complexity and somewhat obscure statistical assumptions. Many scientists are very well versed in the hypotheses of their discipline, but less so on the mathematics. There are so many ways to go wrong in applying these methods in a cookie cutter way that it boggles the mind. Along the lines of "Correlation implies causation" there are other such gems as: "Difference in significance does not imply significant difference".

This last mistake was featured prominently in stage 1 of a statistical analysis in a manuscript from a good lab that just passed my boss' desk. This lab has produced prominent publications in the past, and I was surprised to see this in their analysis.

What most surprised me was that the statistical method, buried deep within the methods section at the end of the manuscript, did not arouse the ire of my boss or the other lab member who read the paper. It was viewed as a good enough an answer to a tough problem. That, plus the prominence of the last author, led to a minor note somewhere in the review.

Reviews don't have dissenting opinions, but let me put one here. Statistical methods are important. So important that, in a paper that uses a high throughput method at its start, they often form the backbone for every follow up experiment. They should not be relegated to a footnote in the back and they should, wherever possible, be declared before the data are even generated, to avoid the nefarious problems of overfitting.

Papers that use statistic need statistically minded reviewers. If we aren't careful, we'll be fooled by randomness.

1 comment: