Wednesday, December 23, 2009

Paradigm 4

If there's one thing I hate, it's the monotony of generating data. I'm working on running sequencing on 48 samples right now, and let me tell you that is not a bag of laughs. I am all too aware that what I am doing could probably be done faster and better by a halfway decently designed machine, and consequently I just sit there daydreaming of a world in which I have that machine. This leads to pipetting errors which leads to more frustration which, as you can imagine, leads to more daydreaming. It's a cycle of violence on the microliter scale.

Which is all just an intro to explaining why I loved this article in the New York Times on the data deluge. Sequencing machines (among other things) are generating so much data that it's actually the analysis that becomes the limiting step. Eventually there will be so much data output that there will be little need for pipetters and a great need for analyzers. That's right, the limiting reagent is actually human brain hours.

Mind you not just any human brain hours will do. To understand this data we need people well versed in computer science. Computers, after all, are the only things capable of reading off the billions of bases involved in anything approaching reasonable time. A human reading of just a single human genome at one base per second (no breaks, no sleeping) would take over 95 years to complete.

Our limiting reagent brain needs to be versed in statistics, to allow for the fact that any comparisons made on the genomic scale and possibly between large populations. Signal is well hidden by noise, and the noise isn't even necessarily as random as we would like it to be. After all, this is a living, breathing genome we're talking about, not a string of A's, C's, G's and T's as we often imagine it.

Which brings us to the third necessity. The analytical brain that we need also, ideally, should have a strong understanding of molecular biology and the biology of any disease in question. Cells are complicated. Ridiculously complicated. But we do know a pretty enormous amount about how they operate. A thorough understanding of this prior knowledge helps us ask more pointed questions of the data in hand.

Have I overdetermined the system yet? Probably. In the end, deciphering this data is going to take a lot of collaboration. I've seen a lot of attempts at all-in-one prepackaged analysis engines for sequencing data. None of them, so far, looks very impressive. Moreover understanding the output of such packages is its own special challenge, since their inner workings is often closed source or poorly documented. Thus it's often hard to trust or interpret the results that you don't generate yourself.

So will this data flood answer the big questions of our age? Are we going to find the cure for cancer? Perhaps at least the cure for a cancer? If we do it will be not only because of our ability to design and execute good experiments, but also our creativity in sifting the results.

No comments:

Post a Comment