Building a phylogenetic tree
The logic behind phylogenetic trees. How to build a tree using data about features that are present or absent in a group of organisms.
- Phylogenetic trees represent hypotheses about the evolutionary relationships among a group of organisms.
- A phylogenetic tree may be built using morphological (body shape), biochemical, behavioral, or molecular features of species or other groups.
- In building a tree, we organize species into nested groups based on shared derived traits (traits different from those of the group's ancestor).
- The sequences of genes or proteins can be compared among species and used to build phylogenetic trees. Closely related species typically have few sequence differences, while less related species tend to have more.
We're all related—and I don't just mean us humans, though that's most definitely true! Instead, all living things on Earth can trace their descent back to a common ancestor. Any smaller group of species can also trace its ancestry back to common ancestor, often a much more recent one.
Given that we can't go back in time and see how species evolved, how can we figure out how they are related to one another? In this article, we'll look at the basic methods and logic used to build phylogenetic trees, or trees that represent the evolutionary history and relationships of a group of organisms.
Overview of phylogenetic trees
In a phylogenetic tree, the species of interest are shown at the tips of the tree's branches. The branches themselves connect up in a way that represents the evolutionary history of the species—that is, how we think they evolved from a common ancestor through a series of divergence (splitting-in-two) events. At each branch point lies the most recent common ancestor shared by all of the species descended from that branch point. The lines of the tree represent long series of ancestors that extend from one species to the next.
For a more detailed explanation, check out the article on phylogenetic trees.
Even once you feel comfortable reading a phylogenetic tree, you may have the nagging question: How do you build one of these things? In this article, we'll take a closer look at how phylogenetic trees are constructed.
The idea behind tree construction
How do we build a phylogenetic tree? The underlying principle is Darwin’s idea of “descent with modification.” Basically, by looking at the pattern of modifications (novel traits) in present-day organisms, we can figure out—or at least, make hypotheses about—their path of descent from a common ancestor.
As an example, let's consider the phylogenetic tree below (which shows the evolutionary history of a made-up group of mouse-like species). We see three new traits arising at different points during the evolutionary history of the group: a fuzzy tail, big ears, and whiskers. Each new trait is shared by all of the species descended from the ancestor in which the trait arose (shown by the tick marks), but absent from the species that split off before the trait appeared.
When we are building phylogenetic trees, traits that arise during the evolution of a group and differ from the traits of the ancestor of the group are called derived traits. In our example, a fuzzy tail, big ears, and whiskers are derived traits, while a skinny tail, small ears, and lack of whiskers are ancestral traits. An important point is that a derived trait may appear through either loss or gain of a feature. For instance, if there were another change on the E lineage that resulted in loss of a tail, taillessness would be considered a derived trait.
Derived traits shared among the species or other groups in a dataset are key to helping us build trees. As shown above, shared derived traits tend to form nested patterns that provide information about when branching events occurred in the evolution of the species.
When we are building a phylogenetic tree from a dataset, our goal is to use shared derived traits in present-day species to infer the branching pattern of their evolutionary history. The trick, however, is that we can’t watch our species of interest evolving and see when new traits arose in each lineage.
Instead, we have to work backwards. That is, we have to look at our species of interest – such as A, B, C, D, and E – and figure out which traits are ancestral and which are derived. Then, we can use the shared derived traits to organize the species into nested groups like the ones shown above. A tree made in this way is a hypothesis about the evolutionary history of the species – typically, one with the simplest possible branching pattern that can explain their traits.
Example: Building a phylogenetic tree
If we were biologists building a phylogenetic tree as part of our research, we would have to pick which set of organisms to arrange into a tree. We'd also have to choose which characteristics of those organisms to base our tree on (out of their many different physical, behavioral, and biochemical features).
If we're instead building a phylogenetic trees for a class (which is probably more likely for readers of this article), odds are that we'll be given a set of characteristics, often in the form of a table, that we need to convert into a tree. For example, this table shows presence (+) or absence (0) of various features:
|Feature||Lamprey||Antelope||Bald eagle||Alligator||Sea bass|
Next, we need to know which form of each characteristic is ancestral and which is derived. For example, is the presence of lungs an ancestral trait, or is it a derived trait? As a reminder, an ancestral trait is what we think was present in the common ancestor of the species of interest. A derived trait is a form that we think arose somewhere on a lineage descended from that ancestor.
Without the ability to look into the past (which would be handy but, alas, impossible), how do we know which traits are ancestral and which derived?
- In the context of homework or a test, the question you are solving may tell you which traits are derived vs. ancestral.
- If you are doing your own research, you may have knowledge that allows you identify ancestral and derived traits (e.g., based on fossils).
- You may be given information about an outgroup, a species that's more distantly related to the species of interest than they are to one another.
If we are given an outgroup, the outgroup can serve as a proxy for the ancestral species. That is, we may be able to assume that its traits represent the ancestral form of each characteristic.
For instance, in our example (data repeated below for convenience), the lamprey, a jawless fish that lacks a true skeleton, is our outgroup. As shown in the table, the lamprey lacks all of the listed features: it has no lungs, jaws, feathers, gizzard, or fur. Based on this information, we will assume that absence of these features is ancestral, and that presence of each feature is a derived trait.
|Feature||Lamprey||Antelope||Bald eagle||Alligator||Sea bass|
Now, we can start building our tree by grouping organisms according to their shared derived features. A good place to start is by looking for the derived trait that is shared between the largest number of organisms. In this case, that's the presence of jaws: all the organisms except the outgroup species (lamprey) have jaws. So, we can start our tree by drawing the lamprey lineage branching off from the rest of the species, and we can place the appearance of jaws on the branch carrying the non-lamprey species.
Next, we can look for the derived trait shared by the next-largest group of organisms. This would be lungs, shared by the antelope, bald eagle, and alligator, but not by the sea bass. Based on this pattern, we can draw the lineage of the sea bass branching off, and we can place the appearance of lungs on the lineage leading to the antelope, bald eagle, and alligator.
Following the same pattern, we can now look for the derived trait shared by the next-largest number of organisms. That would be the gizzard, which is shared by the alligator and the bald eagle (and absent from the antelope). Based on this data, we can draw the antelope lineage branching off from the alligator and bald eagle lineage, and place the appearance of the gizzard on the latter.
What about our remaining traits of fur and feathers? These traits are derived, but they are not shared, since each is found only in a single species. Derived traits that aren't shared don't help us build a tree, but we can still place them on the tree in their most likely location. For feathers, this is on the lineage leading to the bald eagle (after divergence from the alligator). For fur, this is on the antelope lineage, after its divergence from the alligator and bald eagle.
Parsimony and pitfalls in tree construction
When we were building the tree above, we used an approach called parsimony. Parsimony essentially means that we are choosing the simplest explanation that can account for our observations. In the context of making a tree, it means that we choose the tree that requires the fewest independent genetic events (appearances or disappearances of traits) to take place.
For example, we could have also explained the pattern of traits we saw using the following tree:
This series of events also provides an evolutionary explanation for the traits we see in the five species. However, it is less parsimonious because it requires more independent changes in traits to take place. Because where we've put the sea bass, we have to hypothesize that jaws independently arose two separate times (once in the sea bass lineage, and once in the lineage leading to antelopes, bald eagles, and alligators). This gives the tree a total of tick marks, or trait change events, versus in the more parsimonious tree above.
In this example, it may seem fairly obvious that there is one best tree, and counting up the tick marks may not seem very necessary. However, when researchers make phylogenies as part of their work, they often use a large number of characteristics, and the patterns of these characteristics rarely agree with one another. Instead, there are some conflicts, where one tree would fit better with the pattern of one trait, while another tree would fit better with the pattern of another trait. In these cases, the researcher can use parsimony to choose the one tree (hypothesis) that fits the data best.
You may be wondering: Why don't the trees all agree with one another, regardless of what characteristics they're built on? After all, the evolution of a group of species did happen in one particular way in the past. The issue is that, when we build a tree, we are reconstructing that evolutionary history from incomplete and sometimes imperfect data. For instance:
- We may not always be able to distinguish features that reflect shared ancestry (homologous features) from features that are similar but arose independently (analogous features arising by convergent evolution).
- Traits can be gained and lost multiple times over the evolutionary history of a species. A species may have a derived trait, but then lose that trait (revert back to the ancestral form) over the course of evolution.
Biologists often use many different characteristics to build phylogenetic trees because of sources of error like these. Even when all of the characteristics are carefully chosen and analyzed, there is still the potential for some of them to lead to wrong conclusions (because we don't have complete information about events that happened in the past).
Using molecular data to build trees
A tool that has revolutionized, and continues to revolutionize, phylogenetic analysis is DNA sequencing. With DNA sequencing, rather than using physical or behavioral features of organisms to build trees, we can instead compare the sequences of their orthologous (evolutionarily related) genes or proteins.
The basic principle of such a comparison is similar to what we went through above: there's an ancestral form of the DNA or protein sequence, and changes may have occurred in it over evolutionary time. However, a gene or protein doesn't just correspond to a single characteristic that exists in two states.
Instead, each nucleotide of a gene or amino acid of a protein can be viewed as a separate feature, one that can flip to multiple states (e.g., A, T, C, or G for a nucleotide) via mutation. So, a gene with nucleotides in it could represent different features existing in states! The amount of information we get from sequence comparisons—and thus, the resolution we can expect to get in a phylogenetic tree—is much higher than when we're using physical traits.
To analyze sequence data and identify the most probable phylogenetic tree, biologists typically use computer programs and statistical algorithms. In general, though, when we compare the sequences of a gene or protein between species:
- A larger number of differences corresponds to less related species
- A smaller number of differences corresponds to more related species
For example, suppose we compare the beta chain of hemoglobin (the oxygen-carrying protein in blood) between humans and a variety of other species. If we compare the human and gorilla versions of the protein, we'll find only amino acid difference. If we instead compare the human and dog proteins, we'll find differences. With human versus chicken, we're up to amino acid differences, and with human versus lamprey (a jawless fish), we see differences. These numbers reflect that, among the species considered, humans are most related to the gorilla and least related to the lamprey.
You can see Sal working through an example involving phylogenetic trees and sequence data in this AP biology free response question video.
Want to join the conversation?
- One thing that I am unsure of is regarding the idea of a common ancestor. Is a common ancestor an individual or a population? Is the last universal common ancestor an individual or a population? Or is it impossible to know?
If the common ancestor of humans is two individuals, this would mean there is a theoretical 'Adam & Eve' type situation. But that would surely be too small a population from which to develop a species, there wouldn't be enough variation. So the common ancestor must refer to a population that split off to cause speciation - is that correct?(11 votes)
- Good Question! A common ancestor is a species. This may consist of multiple populations. For example, our most recent ancestor with chimps was Australopithecus afarensis. There were multiple populations of this species, so there was enough genetic diversity to evolve into both humans and chimps. Some populations gradually became human while others gradually became chimps.(9 votes)
- In the phylogenetic tree, the finished diagram with maximum parsimony, in the last step, does it matter where you branch off the alligator and where you branch off the bald eagle?(4 votes)
- It's personal preference. If you wanted, you could switch the two.(6 votes)
- Seven species ABCGEFG AND THREE ANCESTRAL TRAITS MEDIUM MOLAR ENAMEL, ROUND SHAPE OF THE ORBITS, CURLY TAIL A= med,round,curly. B= med,square,curly C= med,round,none D= med,round,none E= thick,square,curly F= med,round,none G= thin,square,none. How would I construct a phylogenetic tree?(4 votes)
- C/D/F-A-B/E-G coming off of B. My logic is C/D/F are the ancestral species. I than build the tree starting with the species with one derived trait, than two, and than three where they no longer share any of the specific ancestral traits.(2 votes)
- What is the difference between a shared derived trait and a derived trait? Do these mean the same thing (trait that appears in a clade that is different from the ancestor of a group)? Also, what is the difference between a shared ancestral trait and a homologous trait?(4 votes)
- If a phylogenetic tree is meant to be a reconstruction of an evolutionary sequence, can there be more than one correct set of relationships among a group of species?(2 votes)
- Usually the tree u can find in publication is said to have the best fit of the mathematical model in the background and its data. It is the most probable solution of a set of different solutions.(3 votes)
- Can we use extinct species as an outgroup?(2 votes)
- Good question
It really depends on genetics.
Current practice is that organisms are not divided into two categories: living and extinct.
But all ever existing species have record track and based on their physical features and mostly genetic relations put into phylogenetic trees.
Ancient fossil species of extinct species may be genetically more distant to an outgroup than extant sister species are. Meaning we can find a fossil of species more genetically distant than sister species of currently living organisms.
Which backup for what you proposed in your question.
- Can a phylogenetic tree be illustrated as lines branching off of other lines, like in this example or can they be made from brackets connecting two groups and within those groups, more brackets connecting other groups together?(1 vote)
- You can do it any way that illustrates the branching of the species from common ancestry points.(2 votes)
- If there is a change on the E lineage and the descendants of E, F and G, have no tails, since taillessness is also present in its most recent common ancestor with A, B, C, and D, should taillessness still be a derived trait? Which ancestor should I compare a species to when looking for a derived trait? Thanks in advance.(1 vote)
- In that case, taillessness would still be a derived trait. When looking for derived traits, the species should be compared with the species closest to it on the phylogenetic tree - in this case, it's D, so if D possesses a tail, tailless E has a derived trait.(2 votes)
- Which gene we basically choose to create a phylogenetic tree?(1 vote)
- If you want to compare species then you will chose a gene which you can find in all of them. Some genes can be found multiple times in each species' genome so to avoid taking the "wrong" one you should pick a gene which is only present once. This is an active area of research.(2 votes)
- What happens to the branches when a species or organism goes extinct?(1 vote)