How many phylogenies are there in a genome? Lots!

One candidate for the original “evolutionary tree”—the only figure illustrating the first edition of The Origin of Species. Image via Wikimedia Commons.

Biologists have been constructing trees since before we knew why tree-shapes made such convenient organizing structures for living things. Since Charles Darwin (and Alfred Russel Wallace) first made the case that diverse groups of living species arise from common ancestors, we understand that tree-like relationships reflect this common descent, so that if we can infer the specific structure of those relationship-trees, or phylogenies, we can begin to draw conclusions about how individual species evolved to be what they are today.

Back in the day, we had to estimate phylogenies using directly observable characteristics—measurements of particular parts of particular bones (for mammals), or the shape of the antennae (for butterflies), or the capacity to synthesize lysine (for paramecia). If species that are more recently related tend, on average, to look more similar than each other, this kind of morphological data can be useful. But! When you’re setting out to reconstruct a phylogeny from a whole pile of morphological measurements—dietary preferences, tooth counts, fur color, wing length—what do you do when different traits support different relationship structures? Different traits will naturally change at different rates over evolutionary time, and it’s rarely obvious what those rates are.

Starting in about the 1970s, though, it became increasingly straightforward to directly compare the genetic codes of differnt species. That’s appealing for several reasons: first, because DNA is as direct a marker of inheritance as you can find, it provides a record that’s independent of potentially misleading, morphological similiarities. Second, because DNA sequences have only four character states—the good old “bases” adenine, guainine, thiamine, and cytosine—it’s more tractable to estimate how they change over time.

But even DNA data doesn’t guarantee you can easily figure out a phylogeny. Genetic sequences recombine in the course of sexual reproduction, and when that sexual reproduction crosses species boundaries, it can mean that different genetic regions will support different species relationships. And even when species obligingly refrain from fraternizing, if they’ve split apart from each other relatively recently, their genetic code may not yet fully reflect the split—a phenomenon called “unresolved ancestral polymorphism.”

The figure below attempts to illustrate those possible conflicts between gene trees and species trees. On the left, gray outlines trace the true history of species A, B, and C, and the blue, red, and green lines illustrate possible histories for individual genes carried by those species. On the right, the same colors are used to illustrate the species relationships you’d get if you used only the individual genes—while the blue gene would give you the correct phylogeny, the red and green ones would mislead you if you examined them on their own.

The histories of individual gene regions don't always match the histories of species. While many sections of genetic code (blue line) may reflect true historical relationships between species A, B, and C (gray outlined phylogeny), genes that are transferred between species via hybridization (red line) or that retain unresolved polymorphism (green line) can conflict with the "species tree."

The histories of individual genes don’t always match the histories of the species carrying them. While many sections of genetic code (blue line) may reflect true historical relationships between species A, B, and C (gray outlined phylogeny), genes that are transferred between species via hybridization (red line) or that retain unresolved polymorphism (green line) can conflict with the “species tree.”

So, in order to really sort out the relationships among a collection of species—especially species that are relatively recently diverged from their common ancestor, or prone to hybridization—you need to examine the relationships suggested by many genetic regions. It used to be a challenge simply to collect sequence data from multiple independent genes, but nowadays high-throughput sequencing methods are making multi-gene phylogenies possible even for any collection of species that might interest you.

In fact, it’s very nearly possible to collect more data than we know what to do with. That was the challenge my coauthors and I dealt with in a paper that’s recently been released online ahead of print in the journal Systematic Biology. We set out to infer relationships among more than two dozen species in the genus Medicago, which includes alfalfa (Medicago sativa), and my favorite “model” legume, barrel medic (M. truncatula). The Medicago HapMap project had collected whole-genome sequence data from species across the genus as part of its efforts to understand the genetic diversity of M. truncatula. (Genetic data from closely related species can provide evolutionary context for diversity within a single focal species.)

2012.10.19 - HM101

Medicago truncatula, the star of the genus (if I do say so myself).

Using that data, we identified some 87,000 individual DNA bases that varied among the sampled species—single-nucleotide polymorphisms, or SNPs. That’s not a lot in terms of actual sequence data—but considering that every one of those 87,000 SNPs is a variable character, and that most of them were probably spread far enough across the genome to have independent evolutionary histories, it contains many more independent “gene trees” than most DNA data sets used to estimate phylogenies. There are several software packages that estimate phylogenies by comparing the relationships supported by multiple independent genes—but none of them can tackle thousands of independent gene trees without using absurd computational resources.

So here’s what we did instead: we pretended that all those SNPs would support the same phylogeny—and then examined how well that phylogeny held up to closer scrutiny. First, we estimated a “genome-wide” phylogeny by treating all 87,000 sites as one big gene sequence in the program MrBayes. Then we systematically broke the data into smaller chunks—”windows” of 500 SNPs in order along the genome of Medicago truncatula—and estimated a phylogeny from each window. We then compared the phylogeny estimated from each window to the one we got from the whole dataset.

The figure below gives a taste of our results: for each window along the M. truncatula genome, it plots the number of nodes, or inferred relationships, at which the phylogeny estimated from that window differs from the whole-genome tree (in black dots) and the proportion of those “conflicting nodes” that had strong support in the MrBayes analysis. Windows with a large number of conflicting nodes, and a large proportion of strongly supported conflicting nodes, would be positively misleading if you used that window to estimate the species tree of Medicago on its own.

Measures of gene-tree conflict across the Medicago genome. From figure 4 of Yoder et al (2013).

Measures of gene-tree conflict across the Medicago genome. From figure 4 of Yoder et al (2013).

… or would they? What we found, scanning across the whole genome this way, was what seems like a lot of variation in phylogenetic signal; 332 of the 349 windows gave strong support to at least one node that differed from the genome-wide estimate. Based on previous phylogenetic studies of Medicago, this isn’t a highly surprising result—two earlier studies (by Maureira-Butler et al. and by Steele et al.) found considerable conflict among phylogenies estimated from much smaller samples of independent genes.

So the phylogeny we estimated by cramming together a whole genome’s worth of markers clearly an oversimplifies the thousands of different evolutionary histories contained in our data. Can we do better? Maybe eventually, with more computing power, or more efficient computational methods, we can start to apply models that actually account for the possibility of hybridization and unresolved diversity. One method we tried, the program SNAPP, had a lot of promise, since it’s written for exactly the kind of data we had—but it couldn’t chew through our data set, or even smaller subsets of it, quickly enough to be practical.

Even with more advanced estimation methods, though, the phylogeny of Medicago, or any group of living things, remains just that—an estimate, an educated reconstruction of relationships among the species involved. We’re getting better and better at making these reconstructions, but even when we have thousands of genetic markers to apply to our evolutionary questions, maybe it’s good for us to keep that uncertainty in mind.


Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A. & RoyChoudhury, A. 2012. Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution 29: 1917–1932. doi: 10.1093/molbev/mss086.

Hutner, S.H. & Corliss, J.O. 1976. Search for clues to the evolutionary meaning of ciliate phylogeny. The Journal of Protozoology 23: 48–56.

Maureira-Butler, I.J., Pfeil, B.E., Muangprom, A., Osborn, T.C. & Doyle, J.J. 2008. The reticulate history of Medicago (Fabaceae). Systematic Biology 57: 466–82. doi: 10.1080/10635150802172168.

Steele, K.P., Ickert-Bond, S.M., Zarre, S. & Wojciechowski, M.F. 2010. Phylogeny and character evolution in Medicago (Leguminosae): Evidence from analyses of plastid trnK/matK and nuclear GA3ox1 sequences. American Journal of Botany 97: 1142–55. doi: 10.3732/ajb.1000009.

Yoder, J.B., Briskine, R., Mudge, J., Farmer, A., Paape, T., Steele, K., et al. 2013. Phylogenetic signal variation in the genomes of Medicago (Fabaceae). Systematic Biology, doi: 10.1093/sysbio/syt009.

About these ads

7 comments on “How many phylogenies are there in a genome? Lots!

  1. [...] Doubts about Johns Hopkins research have gone unanswered, scientist says Virus and Genes Involved in Causation of Schizophrenia In Search of Energy Miracles Knome Inc., of Cambridge, Massachusetts, protests the award of a contract to Personalis, Inc., of Palo Alto, California, under request for proposals (RFP) No. VA-240-12-R-0154 (I find this fascinating) How many phylogenies are there in a genome? Lots! [...]

  2. [...] using data collected by the Medicago HapMap Project. (I described the results briefly over at Nothing in Biology Makes Sense! last week.) Like most genome projects, the MHP has its own infrastructure for making its data [...]

  3. [...] How many phylogenies are there in a genome? Lots! ( [...]

  4. Mike Harvey says:


    Have you looked at Pickrell and Pritchard’s (2012) model for constructing trees from SNP data, TreeMix? I have been playing around with it and it is very fast (runs in seconds on my simple 4-population tree with ~50k SNPs, and with 10,000 bootreps requiring about an 1.5hrs of CPU time). I am trying to decide whether to try SNAPP as well, but based on your comments about computation, my datasets will probably be intractable with that method.

    Mike Harvey

    • Yoder says:

      I didn’t run across TreeMix before publication—it looks like it’d be worth consideration, though I’m not sure our sampling scheme (1 sequence from each of 29 species) would be appropriate for a model based on allele frequencies?

      I suspect that the difficulty we ran into with SNAPP is as much about the number of terminal taxa (i.e., larger possible treespace) as it is about sheer size of the dataset. The SNAPP demonstration paper includes analysis of simulations with trees of four or eight taxa, and an empirical dataset of 69 individuals from 6 populations—so smaller trees, and/or more samples per terminal branch.

      Your dataset sounds more like that than it does like ours, so maybe SNAPP is worth a try, if only for comparison of performance. (For what it’s worth, implementing SNAPP wasn’t too painful, though it did require some manual editing of BEAST-style input files.)

  5. Mike Harvey says:

    Good point RE your sampling scheme, I doubt two alleles per tip would be sufficient, and it seems like artificially binning tips into “species” wouldn’t be a good idea. Thanks for the advice on SNAPP, I will definitely try it out.


  6. […] morphology, she also found lots of variation across the dataset in gene trees. That result which sounds familiar, and it probably means there’s been a lot of recent, maybe even ongoing, gene flow within the […]

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s