I arrived in Ottawa a day before the proper start of the Evolution 2012 meetings so as to attend the symposium hosted by the journal Molecular Ecology, which was almost entirely devoted to the joys of genome-scale data collected from wild populations of our favorite species—and what we can and can’t learn from it. This, readers of this blog will recall, is one of the biggest changes in our field in the last few years.
Alex Buerkle kicked things off with the intersting question of how much data, exactly, do we need? It’s easy (given the funding) to obtain a lot of DNA sequence fragments from next-generation sequencing (NGS) methods—but is it better to collect lots of data from a few individuals (and thereby have high confidence in the data) or collect less data from more individuals and accept that there will be some uncertainty in the data for any one individual? Buerkle argued that the second option is preferable; it’s possible to account for uncertainty in your analysis, but if you don’t sample enough individuals, you can miss rare gene variants.
There was a tension between confidence and uncertainty in these great big genetic datasets running through the whole symposium. Buerkle also noted that patterns of differentiation and diversity across the genomes of related species can be very complex—and in the question and answer session, it was pointed out that complexity and noise can be hard to differentiate.
Bryan Carstens made the case that hypothesis testing with NGS data is often limited by the use of overly simplistic, or unrealistic, null models. That is, using your data to reject a null model that isn’t very realistic anyway isn’t a good way to establish that your alternative hypothesis fits the data better. Carstens demonstrated an approach in which a dataset is fit to lots of different, related models, and then parameters—rates of gene flow between populations, times since the populations have diverged—are estimated not from only the best-fitting model, but from an average across all the models, weighted by their relative goodness of fit.
There was also some acknowledgement of the limits of big genetic datasets as purely genetic datasets. Louis Bernatchez described a hope for a future when large phenotypic datasets are used to annotate genomes—when an individual gene variant could be linked to measured phenotypes ranging from the protein sequence it produces to the physical performance that results and even the expected social ranking of individuals carrying the variant. But collecting gene products, phenotypes, and social contexts is a lot more work than collecting genotypes has become, and many of them are highly context dependent—whether that context is other genes, other traits, or other individuals in the same population.
These are topics I’ve been thinking about a lot since I started work as part of the Medicago HapMap Project—with thousands or millions of genetic markers, how do we go about identifying the ones that are potentially under selection due to climate or a disease? The easiest thing to do is just find the gene regions that show the strongest associations with a climate variable, or susceptibility to infection. But every distribution has a tail. If, in reality, no gene region is under particularly strong selectiong from ecological force X, there will still be markers that have the strongest observed association with X.
The solution seems to be to tackle the data from multiple angles—not just the raw association, but association plus population genetics parameters like pairwise divergence or diversity. And to follow up with, you know, actual experiments.