Of the many interesting subjects at the genome informatics meeting at Cold Spring Harbor Labs, one of the more interesting surrounded the release of the latest human genome reference. I'll call it hg19.
The issue surrounding the release is as follows. Many people have been using the old reference, hg18, for years. Not only is it more familiar, but the coordinates have become a sort of cannonical framework on which we have draped massive amounts of auxilliary information. No doubt some programmers (myself included, unfortunately) have hard coded coordinates from the old genome into their utilities. Converting will be sort of like the switch away from 2 digit dates at y2k. It's a much smaller scale, but no less of a pain for those who have to do it.
What's more interesting is that the new genome, and increasingly future genomes, includes so called "alternate assemblies". These are regions of the chromosome for which different well sequenced individuals have chromosomal rearrangements and mutations large enough and different enough to require a completely different reference sequence. This, too, is likely to create a new mess of complications in dealing with the informatics side. Sure, your sequence matches to chromsome 17, but whose chromosome 17? Is it the "alternate" or the "reference" chr17? Both?
The plot will only thicken as the 1000 genomes project becomes more accessible to public use. Imagine not 2 or 3 alternates, but thousands! Moreover, what sorts of changes warrant an alternate assembly? And what frequency? How will sequences be mapped to these assemblies, in their ever expanding multiplicity? How will the assemblies be stored locally? It seems impractical to have one thousand 3 GB files floating around with individual genomes.
Of course an alternative is to keep the main reference and create a database of changes to that reference, such as dbSNP. The question there becomes how to manage large and unusual sorts of changes, such as large indels and chromosomal rearrangements. Moreover, won't we be apt to bias ourselves against the 'non-reference' alleles since all of our mapping algorithms will use the reference only?
The appropriate path to take, in my mind, is to abandon the present model of a reference genome as a flat file with a list of nucleotides. We need to think of it, increasingly, as a graph, with alternate paths through that graph representing individual human genomes. Such a model will allow us to map more easily to variant genotypes, and could compress the amount of hard disk space required store multiple genomes in our clusters.
We'd risk, of course, decreased accessibility to new users. A text file with a list of A's, T's, G's and C's is easy for people to understand and start to work with. That's why any conversion to a new standard would have to be spearheaded by a well funded group with the means to create a suite of conversion tools to make a seamless transition back to fasta files for those who are used to them. Such a group would also need to design new utilities for alignment, variant calling, visualization and downstream analysis (among other things). Moreover, they'd have to come up with a new coordinate system (scary) to map back and forth between old and new results.
This is no small task, but it needs to happen. It's time to stop treating all genomes as vaiants on Watson or Venter.
No comments:
Post a Comment