Friday, February 5, 2010

Tool Time

To start my note off on a tangent, I want to recommend the "getting to work early" paradigm. Arriving to work before 7am finds very few souls clogging the arteries of the building, and very few distractions to sidetrack this grad student's taxed neural network. It is a time for settling in and thinking about the big picture. It is also a time for using all those adverbs that your PI has stricken from your science writing. Even the word "abrogate" gets boring if repeated endlessly like a sitcom laugh track.

So let me expound, veritably explicate, upon the following question: What are the bioinformatic tools that I wish I had for my research? The answer comes like a dam burst. There is simply too much material to stay above water.

  • Multiple Reference Short Read Mapping: By this I mean a tool that corrals the multiplicity of reference human genomes and variant annotations and links them together for read mapping. This might sound silly, since most reference mappers can handle a SNP here or there. But there are a number of "everything that can go wrong will" sorts of scenarios, where a common SNP variant or two can lead to horrifically erroneous mapping. This propagates into very confusing results down the line. Such results have to be carefully untangled by hand before they reveal their fundamentally invalid core. With the tools we now have available in the human genome, multiple reference mapping is becoming a must have app. So if you're out there WashU, Broad, Sanger, BGI, hear my prayer.
  • Base Quality Retouching: Like the photographs that grace the covers of the latest supermarket magazines, the output that spools off of an Illumina GA needs a little retouching. Sometimes it needs a lot of retouching. There are the issues of PCR duplicates, nucleotide chemistry and failed cycles. These are technical problems that may or may not go away any time soon. In the meantime, we need better base quality numbers. Is that set of 8 reads calling an A instead of G a SNP? Hard to say if you can't believe your base call qualities. MarkDuplicates (picard) and the GATK coming out of the Broad might have this problem mostly solved, but they remain to be packaged into a neat little bundle and handed out like candy to the rest of us.
  • The Mapping Quality Problem: To anyone who has played with the high throughput sequencing technology should know about this problem. What does it mean that a read maps to a given location? Suppose it maps to one location perfectly, but 25 with one mismatch. Suppose instead that it had mapped to one location with one mismatch and only two with two mismatches. Which gets the better mapping quality? How are these situations even comparable? I have my own thoughts on a Bayesian way of handling this situation. Maybe just saying the word Bayesian is enough to conjure my solution, and maybe it's too naive to be useful in implementation. Regardless, we need an answer sooner rather than later, lest interesting loci perish for want of a good sequencing read to feed them.
  • The SNP caller to end all SNP callers: This really does not deserve an explanation. (1) We want SNPs. (2) We want confidence scores for those SNPs that is remotely close to correct. The first part is done, we can call SNPs, but I'll be damned if I believe the kinds of confidence scores we assign to them. Getting this problem solved really requires getting the three problems above solved first.
  • Structural Variants for the Rest of Us: The gsMapper has a nice little tool for calling structural variants. Of course, 454 reads are quite amenable to this kind of work due to their length. Paired end Illumina reads should be perfectly functional too, though. I have yet to see an easy to use structural variant caller whose results I can sink my teeth into. I've seen a number of ad hoc tools, and some very high level tools which are nearly impossible to use. To get this problem solved rightly we probably need the mapping quality problem solved first. 
  • De PseudoNovo RefSembly: No I am not just trying to smash words together to sound smart. I bring this tool up because, in my ideal world, the tools are bountiful, the data overfloweth, and every grad student is above average. In this imagined world we also have this little gem of a tool for particular problems. Sometimes you have a reference. Sometimes you have multiple references. Sometimes you have some reads that map to the reference, and some reads that you think might represent some new genetic material. You'd like to map to the genome, but you'd also like to put together those delicious additional morsels into something that approximates a meal. For this you want Ref-Sembly, a tool that uses the a priori information from the genome you are working from but also openly allows and embraces the possibility of additional sequence. Such a tool should make a best guess at what such underlying sequence is and provide information about how that sequence might connect to the reference you've dutifully provided. Currently, I think people map reads to the genome and just cram the unmapped refuse into a de novo assembler. I'm not going to say that this is wrong, but, in my heart of hearts, I don't feel that it is fully right. Assembly off of a reference needs to be more nuanced than a garbage compactor. 
  • A Visualization Suite that Doesn't Crash My Computer When I Try to Look at Tens of Thousands of Reads: Does my request defy the bounds of computer science? Is my measly 8 gigs of RAM insufficient for your hungry java app? All I know is this: there is currently no replacement for putting eyes on data. I can see an indel coming from a mile away if I can visualize my reads. IGV is my current tool of choice, but it craps out (for me) when the coverage gets deep. Unfortunately this is where I need the tool the most. Maybe the answer is that I should get some more sticks of RAM, but I have to imagine that the coverage is only going to get deeper, and the problem will continue to mount. 
  • A Visualization Suite that Produces Poster-Ready Images: UCSC genome browser comes close. Very very close. But the difficulty in customizing the visualization and the the granularity of the images (with their horrific font) makes this a step down from my ideal. If only there was a "whimsical" button to enhance the graphic appeal of the data it already displays, then I think we would be there. If I am looking across 100 kb of sequence I need my exons to have a little more flair then a vertical line one pixel thick. My guess is that my hypothetical reader is now laughing that I didn't notice the "Visualize, with Feeling" button, tagged with the infamous 'blink' html, that sits dead center on the home page at genome.ucsc.edu. Maybe that person will email me.
That's it for now. I think I have exorcised the adverb demon that haunts my scientific writing. I return to the keyboard and the pipette knowing that my salvation is temporary, and that the thirst for flowery exposition shall rise again.