Biologists have been constructing trees since before we knew why tree-shapes made such convenient organizing structures for living things. Since Charles Darwin (and Alfred Russel Wallace) first made the case that diverse groups of living species arise from common ancestors, we understand that tree-like relationships reflect this common descent, so that if we can infer the specific structure of those relationship-trees, or phylogenies, we can begin to draw conclusions about how individual species evolved to be what they are today.
Back in the day, we had to estimate phylogenies using directly observable characteristics—measurements of particular parts of particular bones (for mammals), or the shape of the antennae (for butterflies), or the capacity to synthesize lysine (for paramecia). If species that are more recently related tend, on average, to look more similar than each other, this kind of morphological data can be useful. But! When you’re setting out to reconstruct a phylogeny from a whole pile of morphological measurements—dietary preferences, tooth counts, fur color, wing length—what do you do when different traits support different relationship structures? Different traits will naturally change at different rates over evolutionary time, and it’s rarely obvious what those rates are.
Starting in about the 1970s, though, it became increasingly straightforward to directly compare the genetic codes of differnt species. That’s appealing for several reasons: first, because DNA is as direct a marker of inheritance as you can find, it provides a record that’s independent of potentially misleading, morphological similiarities. Second, because DNA sequences have only four character states—the good old “bases” adenine, guainine, thiamine, and cytosine—it’s more tractable to estimate how they change over time.
But even DNA data doesn’t guarantee you can easily figure out a phylogeny. Genetic sequences recombine in the course of sexual reproduction, and when that sexual reproduction crosses species boundaries, it can mean that different genetic regions will support different species relationships. And even when species obligingly refrain from fraternizing, if they’ve split apart from each other relatively recently, their genetic code may not yet fully reflect the split—a phenomenon called “unresolved ancestral polymorphism.”
The figure below attempts to illustrate those possible conflicts between gene trees and species trees. On the left, gray outlines trace the true history of species A, B, and C, and the blue, red, and green lines illustrate possible histories for individual genes carried by those species. On the right, the same colors are used to illustrate the species relationships you’d get if you used only the individual genes—while the blue gene would give you the correct phylogeny, the red and green ones would mislead you if you examined them on their own.
So, in order to really sort out the relationships among a collection of species—especially species that are relatively recently diverged from their common ancestor, or prone to hybridization—you need to examine the relationships suggested by many genetic regions. It used to be a challenge simply to collect sequence data from multiple independent genes, but nowadays high-throughput sequencing methods are making multi-gene phylogenies possible even for any collection of species that might interest you.
In fact, it’s very nearly possible to collect more data than we know what to do with. That was the challenge my coauthors and I dealt with in a paper that’s recently been released online ahead of print in the journal Systematic Biology. We set out to infer relationships among more than two dozen species in the genus Medicago, which includes alfalfa (Medicago sativa), and my favorite “model” legume, barrel medic (M. truncatula). The Medicago HapMap project had collected whole-genome sequence data from species across the genus as part of its efforts to understand the genetic diversity of M. truncatula. (Genetic data from closely related species can provide evolutionary context for diversity within a single focal species.)
Using that data, we identified some 87,000 individual DNA bases that varied among the sampled species—single-nucleotide polymorphisms, or SNPs. That’s not a lot in terms of actual sequence data—but considering that every one of those 87,000 SNPs is a variable character, and that most of them were probably spread far enough across the genome to have independent evolutionary histories, it contains many more independent “gene trees” than most DNA data sets used to estimate phylogenies. There are several software packages that estimate phylogenies by comparing the relationships supported by multiple independent genes—but none of them can tackle thousands of independent gene trees without using absurd computational resources.
So here’s what we did instead: we pretended that all those SNPs would support the same phylogeny—and then examined how well that phylogeny held up to closer scrutiny. First, we estimated a “genome-wide” phylogeny by treating all 87,000 sites as one big gene sequence in the program MrBayes. Then we systematically broke the data into smaller chunks—”windows” of 500 SNPs in order along the genome of Medicago truncatula—and estimated a phylogeny from each window. We then compared the phylogeny estimated from each window to the one we got from the whole dataset.
The figure below gives a taste of our results: for each window along the M. truncatula genome, it plots the number of nodes, or inferred relationships, at which the phylogeny estimated from that window differs from the whole-genome tree (in black dots) and the proportion of those “conflicting nodes” that had strong support in the MrBayes analysis. Windows with a large number of conflicting nodes, and a large proportion of strongly supported conflicting nodes, would be positively misleading if you used that window to estimate the species tree of Medicago on its own.
… or would they? What we found, scanning across the whole genome this way, was what seems like a lot of variation in phylogenetic signal; 332 of the 349 windows gave strong support to at least one node that differed from the genome-wide estimate. Based on previous phylogenetic studies of Medicago, this isn’t a highly surprising result—two earlier studies (by Maureira-Butler et al. and by Steele et al.) found considerable conflict among phylogenies estimated from much smaller samples of independent genes.
So the phylogeny we estimated by cramming together a whole genome’s worth of markers clearly an oversimplifies the thousands of different evolutionary histories contained in our data. Can we do better? Maybe eventually, with more computing power, or more efficient computational methods, we can start to apply models that actually account for the possibility of hybridization and unresolved diversity. One method we tried, the program SNAPP, had a lot of promise, since it’s written for exactly the kind of data we had—but it couldn’t chew through our data set, or even smaller subsets of it, quickly enough to be practical.
Even with more advanced estimation methods, though, the phylogeny of Medicago, or any group of living things, remains just that—an estimate, an educated reconstruction of relationships among the species involved. We’re getting better and better at making these reconstructions, but even when we have thousands of genetic markers to apply to our evolutionary questions, maybe it’s good for us to keep that uncertainty in mind.
Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A. & RoyChoudhury, A. 2012. Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution 29: 1917–1932. doi: 10.1093/molbev/mss086.
Hutner, S.H. & Corliss, J.O. 1976. Search for clues to the evolutionary meaning of ciliate phylogeny. The Journal of Protozoology 23: 48–56.
Maureira-Butler, I.J., Pfeil, B.E., Muangprom, A., Osborn, T.C. & Doyle, J.J. 2008. The reticulate history of Medicago (Fabaceae). Systematic Biology 57: 466–82. doi: 10.1080/10635150802172168.
Steele, K.P., Ickert-Bond, S.M., Zarre, S. & Wojciechowski, M.F. 2010. Phylogeny and character evolution in Medicago (Leguminosae): Evidence from analyses of plastid trnK/matK and nuclear GA3ox1 sequences. American Journal of Botany 97: 1142–55. doi: 10.3732/ajb.1000009.
Yoder, J.B., Briskine, R., Mudge, J., Farmer, A., Paape, T., Steele, K., et al. 2013. Phylogenetic signal variation in the genomes of Medicago (Fabaceae). Systematic Biology, doi: 10.1093/sysbio/syt009.