Saturday, October 31, 2009

Reference Genome(s?)

Of the many interesting subjects at the genome informatics meeting at Cold Spring Harbor Labs, one of the more interesting surrounded the release of the latest human genome reference. I'll call it hg19.

The issue surrounding the release is as follows. Many people have been using the old reference, hg18, for years. Not only is it more familiar, but the coordinates have become a sort of cannonical framework on which we have draped massive amounts of auxilliary information. No doubt some programmers (myself included, unfortunately) have hard coded coordinates from the old genome into their utilities. Converting will be sort of like the switch away from 2 digit dates at y2k. It's a much smaller scale, but no less of a pain for those who have to do it.

What's more interesting is that the new genome, and increasingly future genomes, includes so called "alternate assemblies". These are regions of the chromosome for which different well sequenced individuals have chromosomal rearrangements and mutations large enough and different enough to require a completely different reference sequence. This, too, is likely to create a new mess of complications in dealing with the informatics side. Sure, your sequence matches to chromsome 17, but whose chromosome 17? Is it the "alternate" or the "reference" chr17? Both?

The plot will only thicken as the 1000 genomes project becomes more accessible to public use. Imagine not 2 or 3 alternates, but thousands! Moreover, what sorts of changes warrant an alternate assembly? And what frequency? How will sequences be mapped to these assemblies, in their ever expanding multiplicity? How will the assemblies be stored locally? It seems impractical to have one thousand 3 GB  files floating around with individual genomes.

Of course an alternative is to keep the main reference and create a database of changes to that reference, such as dbSNP. The question there becomes how to manage large and unusual sorts of changes, such as large indels and chromosomal rearrangements. Moreover, won't we be apt to bias ourselves against the 'non-reference' alleles since all of our mapping algorithms will use the reference only?

The appropriate path to take, in my mind, is to abandon the present model of a reference genome as a flat file with a list of nucleotides. We need to think of it, increasingly, as a graph, with alternate paths through that graph representing individual human genomes. Such a model will allow us to map more easily to variant genotypes, and could compress the amount of hard disk space required store multiple genomes in our clusters.

We'd risk, of course, decreased accessibility to new users. A text file with a list of A's, T's, G's and C's is easy for people to understand and start to work with. That's why any conversion to a new standard would have to be spearheaded by a well funded group with the means to create a suite of conversion tools to make a seamless transition back to fasta files for those who are used to them. Such a group would also need to design new utilities for alignment, variant calling, visualization and downstream analysis (among other things). Moreover, they'd have to come up with a new coordinate system (scary) to map back and forth between old and new results.

This is no small task, but it needs to happen. It's time to stop treating all genomes as vaiants on Watson or Venter.

Thursday, October 29, 2009

First Post! The War on Cancer: Silver Bullets or Atom Bombs

Sitting here in the Genome Informatics meeting in Cold Spring Harbor Labs I am newly awakened to the vast broader community working on many of the scientific problems that I often feel I toil on in isolation. Moreover I'm amazed by the amount of collective knowledge and competence in dealing with the informatics problems associated with sequencing by synthesis technology specifically and genomic biology broadly.

So here we go! I'm going to take a stab and creating a public space for my thoughts and perhaps even a site for dialogue with other researchers. Which is to say: comment liberally! Correct my foolishness or debate ideas. I certainly don't expect much traffic, but leave a note if you stop in.

Now for the main course.

Silver Bullets or Atom Bombs:

In conversations with one of my previous PI earlier this week I got the chance to muse about the oncoming deluge of cancer genomics data. I think everyone in bioinformatics, and in the biology community in general, is thrilled with the enormous number of new tools now available through sequencing, but there remains the hovering question: what are we going to do with all of this new data? More to the point, what are we looking for?

Two possibilities are well known. Each stems from a sort of overarching category of cancer therapy. On the one hand you have the atom bombs. These are the old standbys of the arsenal, the drugs that blast a cell with toxic insult and hope for the best. These often target the genomic DNA, either through antimetabolites or through damaging alkylating agents. Presumably the cancerous cells, with their rapid replication, will suffer the most, but there are many victims of the therapy and the toxicity is intense.

On the other hand you have the (usually newer) silver bullets. Here I am thinking of imatinib or trastuzamab. These therapies are intended to be trained killers, entering the body and using a cancer cell's weak points to mount a targeted assault on proteins the cancer cell depends upon. Without these signals the tumors remit, often with less toxicity to the patient.

World Maps

There are two ways in which the genomic revolution could obviously help cancer treatment. The first is a sort of world map. Right now oncologists drop their atom bomb therapies based on tumor pathology, which uses information like the tumor's location and appearance under a microscope for categorization. Some categories of tumor respond very well to some therapies, others extremely poorly. But some tumor types seem to have very heterogeneous responses to the same therapy. Some patients experience full remissions while others see very little benefit. Furthermore patients differ in their ability to withstand an atom bomb therapy, with some suffering severe toxicity and dying from the treatment, rather than the disease.

For these, we need a more precise world map. If you're going to be dropping atom bombs, even as a last resort, you sure as hell would like to be dropping them in the right place. Finer scale categorization of tumors based on either their genotype or expression profiles (or both) could hone our categories and put clearer boundaries on the map. Perhaps carboplatin is a good drug for melanoma, but only if the cells have a specific cell cycle checkpoint defect. There might be no way to see such a defect just from histology.

Just as you need to have a map of where to bomb, you need a map of where NOT to bomb. You want to destroy Melanomatown but spare Bonemarrowburg. Some patients need to be spared the treatment due to toxicity susceptibilities of their normal, healthy cells. There might be genomic markers of such susceptibility in the genome.

This line of inquiry amounts to careful epidemiology and delineation of differences between tumors and individuals. No new therapies are necessary, just better characterizations of the patients and their disease.

The Golden Gun

Then there's the prospect of new targets. Perhaps now we can find new targets which, like BCR-ABL in CML, fall to a single, focused attack. To me this is the more tantalizing possibility. Perhaps we can find a whole host of tumors which have such pin point weaknesses and hit them hard. To quote Brian J. Druker et. al.

...[O]one of the major issues is to identify appropriate targets for drug development. Although the Abl kinase inhibitor has been useful for clinical concept validation, several features of CML may make the success of a kinase inhibitor as a single agent unique for this cancer. The Bcr-Abl tyrosine kinase, present in 95% of patients, is sufficient to cause the disease, and in early disease, it may represent the sole molecular abnormality. Few other malignant diseases can be ascribed to a single molecular defect in a protein kinase.  (link)

Protein kinases make nice targets because of their susceptibility to small molocule inhibitors, but they are by no means the only good target possibly. With more detailed understanding of cancer genome architecture we may find the variants which, like CML, have a "kingpin" gene regulating the whole cancer phenotype. These may have been too subtle to see by histology, but shine out in either expression analysis or genome resequencing. Once identified, we can design a slew of new bullets to go into a sort of golden gun. Simple assays can be designed to determine whether a patient will be susceptible to one of these targeted therapies, and we can hit the tumors exactly where they are weak. By hitting the kingpin of the proliferation we could reduce the possibility of drug resistance and drug toxicity.

How?


This is just my conception of the field. Many people are already discussing these issues, and I don't pretend that there's a single original idea here. But it's fun to play with big ideas, and you have to start somewhere. I've sort of answered the large question of "what are we trying to do?" but I will end with a series of questions that to me have no obvious answer.

How do we focus our efforts in finding targets?
How do we pull relevant genomic/phenotypic/histologic etc. features out of an ever growing list?
How do we target kingpin changes in cancers for which there is no easy pharmacologic target?
How does cancer 'evolve' drug resistance and how do we prevent it?
How do we categorize patients and tumors with treatment regimens, given that the N decreases the more finely we partition the individuals?
How do we make genomic information available to clinicians?

I'll leave it at that. If you read this far thanks, and I hope I didn't go overboard.