|Home | About | Journals | Submit | Contact Us | Français|
The evolutionary history of a set of species is usually described by a rooted phylogenetic tree. Although it is generally undisputed that bifurcating speciation events and descent with modifications are major forces of evolution, there is a growing belief that reticulate events also have a role to play. Phylogenetic networks provide an alternative to phylogenetic trees and may be more suitable for data sets where evolution involves significant amounts of reticulate events, such as hybridization, horizontal gene transfer, or recombination. In this article, we give an introduction to the topic of phylogenetic networks, very briefly describing the fundamental concepts and summarizing some of the most important combinatorial methods that are available for their computation.
Phylogenetic analysis aims at uncovering the evolutionary relationships between different species or taxa in order to obtain an understanding of the evolution of life on Earth. “Phylogenetic trees” are widely used to address this task and are usually computed from molecular sequences. By definition, phylogenetic trees are well suited to represent evolutionary histories in which the main events are speciations (at the internal nodes of the tree) and descent with modification (along the edges of the tree).
However, these trees are less suited to model mechanisms of “reticulate evolution” (Sneath 1975), such as horizontal gene transfer, hybridization, recombination, or reassortment. Moreover, mechanisms such as incomplete lineage sorting or complicated patterns of gene duplication and loss can lead to incompatibilities that cannot be represented on a tree. Although the analysis of individual genes or short stretches of genomic sequences often supports a single phylogenetic tree, different genes, or sequence segments usually support different trees.
“Phylogenetic networks” provide an alternative to phylogenetic trees when analyzing data sets whose evolution involves significant amounts of reticulate events (Sneath 1975; Syvanen 1985; Delwiche and Palmer 1996; Griffiths and Marjoram 1996; Rieseberg 1997; Doolittle 1999). Moreover, even for a set of taxa that have evolved according to a tree-based model of evolution, phylogenetic networks can be usefully employed to explicitly represent conflicts in a data set that may be caused by mechanisms such as incomplete lineage sorting or by the inadequacies of an assumed evolutionary model (Huson and Bryant 2006).
Although rooted phylogenetic networks can, in theory, be used to explicitly describe evolution in the presence of reticulate events, their calculation is difficult and computational methods for doing so have not yet matured into practical and widely used tools (Hein 1993; Gusfield et al. 2003; Huson et al. 2005; Song et al. 2005; Bordewich et al. 2007; Tofigh et al. 2010). In contrast, a number of established tools for computing unrooted phylogenetic networks can be used to visualize incompatible evolutionary scenarios in phylogeny and phylogeography (Bandelt and Dress 1992; Bandelt et al. 1995, 1999; Huson 1998; Clement et al. 2000; Bryant and Moulton 2004; Huson and Bryant 2006).
In this paper, we give an introduction to the topic of phylogenetic networks, very briefly describing the fundamental concepts and summarizing some of the most important methods that are available for the computation of phylogenetic networks. In practice, most currently available algorithms for computing phylogenetic networks are based on combinatorics, so we focus on these approaches. Some approaches developed within a maximum parsimony or maximum likelihood framework can be found, for example, in Hein (1993); Jin et al. (2006a, 2006b, 2007); Dessimoz et al. (2008). Figure 1 shows the relationships between some of the concepts mentioned in this paper. For ease of exposition, some of the more technical terms in this survey are defined in table 1.
The purpose of this paper is to give a short survey of the combinatorial methods used to infer phylogenetic networks. More details on the concepts and algorithms introduced in this paper as well as biological examples of their applications can be found in (Huson et al. 2011).
In the literature, the term phylogenetic network is defined and used in a number of different ways, usually focusing on the specific type of network that an author happens to be interested in (Bandelt 1994; Gusfield et al. 2003; Linder and Rieseberg 2004). We propose the following general definition:
Definition 1 (Phylogenetic network) A phylogenetic network is any graph used to represent evolutionary relationships (either “abstractly” or “explicitly”; see below) between a set of taxa that label some of its nodes (usually the leaves).
Phylogenetic networks can be computed from a wide range of data, including multiple sequence alignments, distance matrices, set of trees, clusters, splits, rooted triplets, or unrooted quartets. As with phylogenetic trees, a first major distinction is between “unrooted” and “rooted” phylogenetic networks:
Definition 2 (Unrooted phylogenetic network) Let X be a set of taxa. An “unrooted phylogenetic network” N on X is any undirected graph whose leaves are bijectively labeled by the taxa in X.
A number of different types of unrooted phylogenetic networks are in use. In this paper, we mainly focus on the important class of “split networks” (Bandelt and Dress 1992). A second important class of unrooted phylogenetic networks are “quasi-median networks,” which can be viewed as a generalization of split networks.
A “rooted Direct Acyclic Graph (DAG)” is a directed graph that is free of directed cycles and that contains precisely one node without ancestors, called the “root.” Rooted phylogenetic networks generalize rooted phylogenetic trees:
Definition 3 (Rooted phylogenetic network). Let X be a set of taxa. A “rooted phylogenetic network” N on X is a rooted DAG where the set of leaves is bijectively labeled by the taxa in X.
For an example of unrooted and rooted phylogenetic networks, see figure 2.
The envisioned role of rooted phylogenetic networks in biology is to describe the evolution of life in a way that explicitly includes reticulate events. Ultimately, the main goal is to work out the details of a rooted phylogenetic network of life, such as popularized by Doolittle (1999).
Phylogenetic networks can be used in two different ways. The first use is as a tool for visualizing incompatible data sets in a helpful manner, in which case we speak of an “abstract” phylogenetic network. The second type of usage is as a representation of a putative evolutionary history involving reticulate events, in which case, the network is called “explicit.”
By definition, most (if not all) types of unrooted phylogenetic networks are abstract networks, as evolution is inherently rooted (and thus any unrooted phylogenetic tree is also abstract, in this sense). However, rooted phylogenetic networks can be either abstract or explicit, depending on how they are constructed and interpreted.
The necessity of distinguishing between abstract and explicit networks was pointed out in Morrison (2005). They are called implicit and explicit in Huson (2007). In Morrison (2010), abstract and explicit networks are named “data-display” networks and “evolutionary” networks, respectively.
In the literature, perhaps as many as 20 different names have been defined for different types of phylogenetic networks. A closer look reveals that some networks are named by the algorithms that compute them or by mathematical properties that define them, such as “neighbor-nets” or “median networks.” Others are named by topological constraints that are imposed on them for computational reasons, such as “galled trees,” “galled networks,” or “level-k networks.” Yet others are named by the types of evolutionary events which they model, such as “hybridization networks,” “recombination networks,” or “duplication-loss-transfer (DLT) networks.”
A number of special types of unrooted phylogenetic networks are used in practice, the most important of which we consider in detail in this paper.
The foundation for split networks was laid in Bandelt and Dress (1992). Let X be a set of taxa and assume that we are given a set of “splits” S on X, usually with a “weighting” that assigns a nonnegative weight to each split, which may represent character changes or distances or may also have a more abstract interpretation. If the set of splits S is “compatible,” then it can be represented by an unrooted phylogenetic tree, and each edge in the tree corresponds to exactly one of the splits (Buneman 1971). More generally, S can always be represented by a “split network,” which is an unrooted phylogenetic network with the property that every split S in S is represented by an array of parallel edges in N.
An example is shown in figure 3, where the three central edges highlighted in bold represent the split that separates the outgroups from the Branchiopoda. Indeed, the removal of these edges produces precisely two subtrees, one which has leaves that are labeled by Branchiopoda species and the other with leaves that are labeled by outgroup species.
Two methods for constructing split networks from weighted splits are the “convex hull algorithm” and the “circular network algorithm.” The convex hull algorithm can be applied to any set of splits S and computes a split network representing S that contains an exponential number of nodes and edges in the worst case (Bandelt et al. 1995). It is also used to compute median networks, as described below. The circular network algorithm can be applied to any set of “circular splits” and produces an “outer-labeled planar” network with only a quadratic number of nodes and edges (Dress and Huson 2004).
In many cases, direct application of the convex hull algorithm leads to an overcomplicated network. In practice, a useful heuristic is first to choose an order of the taxa such that a large subset of the given set of splits is circular. This subset of splits is then processed using the circular network algorithm to obtain an outer-labeled planar network. The remaining splits are then processed using the convex hull algorithm, which will add some nonplanar parts to the network.
A split network N can be obtained from a number of different types of data. To be more precise, the algorithms mentioned below do not compute a split network directly; rather, they all compute a set of weighted splits S. A split network N is then computed from S as described above. All splits-based algorithms discussed in this article are implemented in the program SplitsTree4 (Huson and Bryant 2006).
A number of methods exist for computing a set of weighted splits for a given distance matrix D on X. The two most important are split decomposition (Bandelt and Dress 1992) and “Neighbor-Net” (Bryant and Moulton 2004).
“Split decomposition” takes a distance matrix D on X as input and produces a set of weighted splits S on X that is “weakly compatible,” a property that ensures that the corresponding split network will not be too complicated. Indeed, in practice, the resulting split networks are often quite close to being outer-labeled planar, as they usually have only a few edges crossing over each other and do not contain any “high-dimensional cubes,” which may occur for completely unrestricted sets of splits. In practice, split decomposition is a very conservative method, in the sense that a split will only be present in the output if there is global support for it in the given data set. For large or diverse data sets, the method tends to exhibit very low resolution and thus its use is limited to small data sets of less than 100 taxa, say.
Neighbor-Net takes a distance matrix D on X as input and produces a set of weighted splits S on X that is circular and can be represented by a outer-planar split network using the circular network algorithm. Neighbor-Net is more popular than split decomposition because it is less conservative and so does not lose resolution on larger data sets. Moreover, the fact that the output of the method can always be represented by a outer-planar split network and is thus easy to visualize adds to its attraction; see figure 3.
Both network methods have the attractive property that they produce the set of splits corresponding to the correct tree when given a tree-like matrix.
Let T = (T1, … , Tk) be a collection of unrooted phylogenetic trees on X. These might be different gene trees, trees for the same gene computed using different methods, or a set of trees obtained in a Bayesian analysis, for example. Split networks can be used to visualize conflicting signals present in T.
The set of majority-consensus splits is defined as the set of all splits that are present in more than 50% of the input trees. By lowering the threshold to a proportion p of 50% or less, one obtains a set of splits Sp(T) that will not necessarily be compatible. The split network N associated with Sp(T) is called a “consensus split network” and can be used to visualize conflicting signals in a set of trees (Holland and Moulton 2003).
One of the first published applications of this method was to a collection of 106 different unrooted phylogenetic trees involving eight different yeast species (Holland et al. , gene trees from Rokas et al. ).
Figure 4 shows clearly that the gene trees disagree somewhat as to where the outgroup taxon Candida albicans attaches to the phylogeny. Moreover, they also disagree on whether Saccharomyces kudriavzevii and Saccharomyces bayanus are sister taxa.
In practice, in a collection of gene trees, the set of taxa that occurs in each tree will often differ between trees, simply because some gene sequences may not be available for all taxa. To address this, methods have been developed to compute a “super split network” for a given set of unrooted phylogenetic trees T on overlapping but nonidentical taxon sets using the “Z-closure” algorithm (Huson et al. 2004; Whitfield et al. 2008).
Assume that we are given a multiple sequence alignment M on X.
A first approach to obtaining a split network for M is to compute a set of splits that represents M using the “parsimony-splits” method (Bandelt and Dress 1993). This method takes a multiple alignment M on X as input and produces a set of weakly compatible splits S on X using a simple modification of the split decomposition algorithm. The parsimony-splits method has not been used much in the literature, probably because the resulting set of splits is usually very similar to the one obtained by the more widely known split decomposition.
Another way of computing a split network from M is first to restrict M to obtain a matrix containing only the columns in M that contain exactly two different character states and then focus on the “condensed version” of , say . Then, any column of defines a different split of the taxon set, and the set of splits obtainable in this way can be represented by a split network N. If we label each edge of the network N by the columns in the alignment that correspond to the split represented by the edge, then the resulting split network is called a median network (Bandelt et al. 1995). This construction is suitable for data sets that have very few differences in them. Hence, median networks are mainly used in phylogeography and population studies. Because parallel mutations can lead to complicated structures in such a network, the concept of a “reduced” median network was also introduced (Bandelt et al. 1995), in which one attempts to simplify the network by postulating appropriate parallel mutation events.
An example of a median network is shown in figure 5. In Casado et al. (2007), the distribution of Callicebus lugens (Platyrrhini, Primates) at the Rio Negr, in Brazil, is reported. The study focuses on eight specimens, one group of four taken from the left bank of the river and another group of four taken from the right bank. It is based on a multiple alignment M of cytochrome b DNA sequences of length 1,140. A median-network analysis shows a clear separation of the two groups. Note that only 35 columns are retained in the condensed version of M (the ones labeling the edges).
Mathematicians are interested in developing methods that infer a phylogenetic tree or network from basic building blocks. In the computation of an unrooted tree or split network, these are phylogenetic trees on sets of four taxa, sometimes called “quartet trees.” One such method is the “quartet-net” method, or “QNet,” for short (Grünewald et al. 2007). This algorithm takes a set Q of weighted quartet topologies on X as input and, using a modification of Neighbor-Net, produces a set of weighted splits S on X that is circular, and thus can be represented by an outer-planar split network. Because compatible splits are always circular, it follows that the QNet method (combined with the circular network algorithm) always computes the correct phylogenetic tree when given an input set that corresponds to a tree.
As we mentioned in the previous section, a median network can be used to visualize a set of binary characters on a set of taxa X. The concept of a quasi-median network is a generalization of the concept of a median network that was introduced to represent multistate characters. Note that, unlike median networks, quasi-median networks are not split networks. A quasi-median network is defined as a phylogenetic network, the node set of which is given by the quasi-median closure of the condensed version of M and in which any two nodes are joined by an edge if and only if the sequences associated with the nodes differ in exactly one position. The quasi-median closure is defined as the set of all sequences that can be obtained by repeatedly taking the “quasi-median of any three sequences” in the set and then adding the result to the set (see fig. 6).
In general, the quasi-median closure consists of a huge set of sequences and hence the quasi-median network for a multiple sequence alignment M of DNA sequences on X is usually too large and too complicated to be of practical interest. At the other extreme, a “minimum spanning network” (Excoffier and Smouse 1994; Bandelt et al. 1999) can be used to represent the differences between the sequences in M. This type of network is also often of limited interest because it contains one node per taxon and no additional nodes.
The “median-joining” algorithm (Bandelt et al. 1999) constructs an informative subnetwork of the full quasi-median network, repeatedly using the concept of a (relaxed) minimum spanning network and repeatedly employing the quasi-median calculation. Although the former construction, on its own, will produce too few nodes to be useful, the latter construction alone will produce too many nodes. By using both together, the median-joining method attempts to provide a useful network of intermediate size. The median-joining method is best suited for closely related sequences that have evolved without recombinations and is widely used in phylogeography and population studies, usually based on mtDNA or the Y chromosome. An application is shown in figure 7, where the cluster containing all non-African sequences attaches to only one of the clusters of African lineages. This network is thus consistent with the out-of-Africa model of human origins, suggesting that all non-African populations are derived from one African lineage.
Implementations of the median network and median-joining algorithms are provided by the programs Network (http://www.fluxus-engineering.com) and SplitsTree4.
A number of other types of unrooted phylogenetic networks are in use. We briefly describe two of them.
A haplotype network is an unrooted phylogenetic network in which the nodes represent different haplotypes within a group of (usually very closely related) taxa and the edges join those sequences or haplotypes that are very similar. The edges are usually labeled by the positions at which the joined haplotypes differ.
Both the median network computation and the median-joining algorithm can be used to compute a haplotype network. Another popular approach is the “TCS approach” (Templeton et al. 1992). It is based on the concept of statistical parsimony and aims at producing a haplotype network in which two haplotypes are joined by an edge if and only if a quantity called the “probability of parsimony” (defined in Templeton et al. , eqs. 6–8) exceeds 95% for the edge. The TCS method is similar to the (quasi-)median network method in that it attempts to place sequences onto the nodes of a network, infer additional nodes and label edges by the number of differences between different sequences. An implementation is available from: http://darwin.uvigo.es/software/tcs.html.
A reticulogram is an unrooted phylogenetic tree to which a set of auxiliary edges has been added. A reticulogram is obtained from a distance matrix D on X using the T-Rex software, which first computes a phylogenetic tree on X (using a method such as neighbor-joining) and then repeatedly adds shortcut edges to the graph until the distances between the taxa in the graph show a good fit to the distances in the original input matrix D (Makarenkov 2001). An implementation is available from: http://www.trex.uqam.ca.
Unfortunately, it is easy to construct a reticulogram R on X such that the T-Rex algorithm will fail to reconstruct R from the distance matrix DR (see fig. 8).
Let X be a set of taxa and N a rooted phylogenetic network on X. Any node of indegree ≥2 is called a “reticulate” node and all others are called “tree” nodes. Any edge leading to a reticulate node is called a reticulate edge and all others are called tree edges. Definition 3 is very general and additional requirements can be made. For example, the network can be described as “bicombining,” that is, that all reticulate nodes have indegree 2.
How do we interpret such a rooted phylogenetic network mathematically? Perhaps the most important feature of a rooted phylogenetic tree or network is the set of “clusters” that the network represents, as clusters suggest putative monophyletic groups and thus provide hypotheses about the evolutionary relatedness of the taxa under consideration. Hence, in this paper, we treat the calculation of rooted phylogenetic networks in a “cluster-centric” manner and usually interpret rooted phylogenetic networks as representations of sets of clusters.
Exactly which clusters does a rooted phylogenetic network N on X represent? This question has two different answers.
Let N be a rooted phylogenetic network on X. We use the term “hardwired clusters” to refer to the set of all clusters Chard(N) that are obtainable from a rooted phylogenetic network N in the following way: each tree edge e in N represents precisely one cluster γ(e), which is given by the set of all taxa that appear as labels of nodes below e, that is, all labels of nodes that are descendants of the target node of e.
An alternative way to define the set of clusters represented by N is to use the set of all clusters obtainable from the set of trees T(N) represented by N. We refer to this as the set Csoft(N) of clusters represented by N in the “softwired” sense. To obtain these clusters directly from the network N, one must treat the in-edges leading to each reticulation r as a set of alternatives, one of which is “on” if and only if all others are “off.” A softwired cluster C is then obtained directly from the network by first deciding, for each reticulation, which reticulation edge is on and which is off, and then collecting all taxa that are reachable below some fixed tree edge e without using any reticulation edge that is off.
To understand the relationship between Chard(N) and Csoft(N), consider figure 9.
Note that given a rooted phylogenetic network N, the set of hardwired clusters of N contains one cluster per tree edge of N, whereas the number of softwired clusters represented by N is exponential in the number of reticulations contained in N, in the worst case. Note also that given a phylogenetic tree T, it holds that hard(T) = soft(T).
Assume that we are given a set of clusters C on X. A “cluster network” N for C is a rooted phylogenetic network that represents the set of clusters on X in the hardwired sense and it can be computed efficiently using the “cluster-popping” algorithm (Huson and Rupp 2008). The number of edges that it contains is, at most, quadratic in the number of given clusters. A cluster network is an abstract phylogenetic network that can be used, for example, to provide a combined visualization of a whole set of rooted phylogenetic trees. Indeed, it has recently been shown (Huson et al. 2011) that a cluster network N that represents all clusters of a given set of trees also contains all the trees themselves, if they are bifurcating; otherwise it contains resolutions of them (for an example, see fig. 10d).
However, in practice, the resulting network may sometimes be too large and messy to be of real use. As discussed above for the consensus (super) split networks, one way to address this problem is to represent only those clusters that occur in at least p percent of the input trees, where p is a user-defined parameter. The resulting network will then no longer represent all trees in their full resolution, as some of them will occur only in a contracted form.
A number of new methods aim at constructing a rooted phylogenetic network N that represents a set of clusters in the softwired sense, motivated by the assumption that the set of clusters that a network represents is its most important feature, as argued above.
Unfortunately, in general, rooted phylogenetic networks interpreted in the softwired sense are computationally hard to work with. Indeed, even just determining whether a given rooted phylogenetic network N contains some given cluster C on X (in the softwired sense) is NP-complete (Kanj et al. 2008). To avoid these computational problems, we restrict our attention to topologically constrained classes of networks. The concepts of a galled tree (Wang et al. 2000; Gusfield et al. 2003), a galled network (Huson et al. 2009), and a level-k network (Choy et al. 2005) all put constraints on how tangled the undirected cycles in a rooted phylogenetic network may be. The algorithm presented in van Iersel et al. (2010), which aims at computing level-k networks, shows particular promise of becoming a general tool for computing rooted phylogenetic networks from different types of data.
Note that a rooted phylogenetic network that is interpreted in the softwired sense usually requires fewer edges to represent a set of clusters than a hardwired one because individual tree edges can represent more than one cluster. An example of this behavior is shown in figure 10. This implies that a rooted phylogenetic network representing all the clusters of some trees may fail to represent the trees themselves.
All cluster-based methods mentioned in this paper are implemented in the program Dendroscope2 (Huson and Scornavacca 2011).
Assume that we are given a set of taxa X that have evolved under a model of evolution that includes both speciation events and descent with modification, as usual, and, in addition, hybridization events. The evolutionary history of the taxa in X can then be represented by a rooted phylogenetic network N on X where the tree nodes correspond to speciation events and the reticulate nodes correspond to putative hybridization events (Maddison 1997; Linder and Rieseberg 2004). A rooted phylogenetic network that is interpreted in this way is called a hybridization network.
We may attempt to determine such a hybridization network computationally when given two or more gene trees on X, the topologies of which differ significantly and we suspect that these differences are created by hybridization events. The corresponding computational problem can be formulated as follows. Given a set T of two or more rooted phylogenetic trees on X, determine a rooted phylogenetic network N that contains all trees in T and has a “minimum” number of reticulate nodes. This is known to be a computationally hard problem (Bordewich and Semple 2007).
In practice, these algorithms appear to run reasonably fast in many cases. No comparable algorithm exists at present for solving the problem on more than two input trees.
An application is shown in figure 11, where we display two trees, T1 and T2, computed for 14 different species of grass (Poaceae), based on the phyB and waxy genes, respectively; see Grass Phylogeny Working Group (2001). Both the two networks shown in figure 11c and d contain both trees and have the minimum number of reticulate nodes with this property, namely three. If we assume that differences in the topology of the two trees T1 and T2 are a result of hybridization events, then, for example, the network in (c) suggests that P. glyceria is a hybrid of the lineages leading to P. melica and P. triticium. In the case of the two other putative hybrid species, P. lygeum and P. chusquea, their evolution requires the postulation of additional lineages to resolve the fact that they appear to be hybrids of recent and less recent lineages. We emphasize that neither network “proves” that hybridization is the cause of the incongruence between trees T1 and T2, and additional biological evidence is required to support suspected cases of speciation by hybridization.
Assume that we are given a set of taxa X that have evolved under a model of evolution that includes, as usual, both speciation events and descent with modification and also recombination events. The evolutionary history of the taxa in X can then be represented by a rooted phylogenetic network N on X where the tree nodes correspond to speciation events and the reticulate nodes correspond to recombination events. In addition, we require that the following two labelings are given (Griffiths and Marjoram 1996; Gusfield et al. 2003; Huson and Klöpper 2005; Song and Hein 2005):
These labels must be compatible in the sense that the sequences assigned to tree nodes of the network differ exactly by the indicated mutations, whereas the sequences assigned to reticulate nodes must be obtainable from the sequences assigned to the parent nodes by a crossover. A rooted phylogenetic network that is augmented and interpreted in this way is called a recombination network.
An early approach to the problem of computing a recombination network (Hein 1993) is based on the idea of assigning a rooted phylogenetic tree to each position of a given multiple sequence alignment M on X in a most parsimonious way and then combining all the trees into a suitable rooted phylogenetic network N. When doing this, a trade-off must be made between the number of incompatibilities between a character and its associated local tree on the one hand, and, on the other, the “recombination cost” of switching from one tree topology to a different one when going from position i to i + 1.
Although the approach has to solve two NP-hard problems and is not practical, it is conceptually appealing because it explicitly addresses the “mosaic” nature of aligned sequences: A long multiple sequence alignment consists of stretches of sequence that have evolved along a common rooted phylogenetic tree, and these stretches are separated by crossover positions at which recombinations have occurred.
More recently, Gusfield et al. (2003) have established a different approach. To obtain a problem that is computationally tractable, they restrict their attention to recombination networks that have the galled tree property. A rooted phylogenetic network N is called a galled tree if every reticulation edge contained in a nontrivial “biconnected component” of N leads to the same reticulation node r. This approach finds a recombination network that is a galled tree, if any exists, in polynomial time. An application is shown in figure 12.
Gusfield’s initial papers on galled trees (Gusfield et al. 2003; Gusfield 2005) generated a lot of interest in this topic. Other papers in this area include Gusfield and Bansal (2005); Huson et al. (2005); Huson and Klöpper (2005); Song (2006); Gusfield et al. (2007). By developing and improving the lower and upper bounds for the number of recombinations required by a data set, Song et al. (2005) have developed a new approach that can be used (in theory) to compute a minimal recombination network. A new approach aimed at a computing a putative recombination history in practice is presented in Parida et al. (2008).
Work on recombination in the context of population genetics by Hein and his colleagues goes far beyond the one approach that we described briefly above. For example, under the “coalescent-with-recombination” model of population genetics, a description of the history of n-sampled sequences going backward in time gives rise to a graph that is called an “ancestral recombination graph” (Griffiths and Marjoram 1997; Hein et al. 2005; Song and Hein 2005). This graph is used to perform statistical analyses of the inheritance and prevalence of genes in populations, and the specific topology is often treated as a nuisance variable and integrated out.
Assume that we are given a set of taxa X that have evolved under a model of evolution that includes, as usual, both speciation events and descent with modification as well as gene duplication, loss, and horizontal gene transfer events (Delwiche and Palmer 1996; Planet et al. 2003).
The associated computational problem can be formulated as follows: Given a gene tree T and a corresponding species tree Tsp, reconcile all differences between the two trees by postulating an appropriate DLT scenario. Such a scenario provides a mapping of the gene tree onto the species tree that implies certain duplication, loss, and transfer events. Because the presence of horizontal gene transfer events, this DTL scenario can be seen as a network. An example is shown in figure 13.
Recently, two fast algorithms for a inferring most parsimonious DLT scenario have been proposed. The one described in Tofigh et al. (2010) may propose scenarios that are not time consistent (because of lateral gene transfer events) and considers losses only a posteriori, whereas the other (Doyon et al. 2010) needs the species tree to be dated so as to avoid time-inconsistent scenarios.
A number of other types of rooted phylogenetic networks have been developed. We now briefly discuss three of them.
Many viruses are organized into segments of sequence and evolve both by descent with modification and also by reassortment, a process by which viruses that have coinfected a host exchange segments of their genomes. Reassortment is an important mechanism. For example, a possible route to infection of humans by avian strains of influenza A is for swine to be coinfected by avian and human viruses, which reassort to produce a new virus carrying both avian- and human-adapted genes (Castrucci et al. 1993).
A reassortment network is a directed graph in which the nodes represent viral isolates and the edges represent the evolutionary history of the viruses, including reassortment events. Edge weights reflect the edit costs of reassortment and mutation events. Such a graph is organized in layers that correspond to evolutionary stages, such as the seasons in which the viruses were isolated (Bokhari and Janies 2008).
Gene duplication is a common event in evolution and so many genes are present in multiple copies in a genome. When some taxa are represented by multiple copies of a gene in a phylogenetic tree, then the tree is called “multilabeled.” To analyze the duplication history of a gene, it may be helpful to map such a multilabeled phylogenetic tree onto a single-labeled rooted phylogenetic network so as to see which parts of the tree are similar and which are different (Huber et al. 2006). Algorithms for constructing such a network are discussed in Huson et al. (2011) and are implemented in the program Dendroscope2 (Huson and Scornavacca 2011).
As already mentioned, mathematicians are interested in developing methods that infer a phylogenetic tree or network from basic building blocks. In the computation of a rooted tree or network, these are rooted phylogenetic trees on three taxa, which are sometimes called “rooted triples.” In this context, the input is a set R of rooted triples on X, and the goal is to compute a rooted phylogenetic network N that contains all the rooted triples in R and is optimal in some sense. One possible optimality criterion is to minimize the “level” of the network N, which is defined as the maximum number of reticulation nodes contained in any biconnected component of the network (Jansson et al. 2006). In To and Habib (2009), the authors describe an algorithm that can compute the level-k network with minimum number of reticulations (if such a network exists), for every fixed k in polynomial time.
For unrooted phylogenetic networks, most of the methods mentioned here are routinely used in phylogenetic analysis or phylogeography, particularly Neighbor-Net, consensus split (super) networks and median-joining, given distances, trees, or sequences, respectively.
This is not the case for rooted phylogenetic networks. Although a number of algorithms have been described for computing rooted phylogenetic networks, some problems must be overcome. First, many of the algorithms have only proof-of-concept implementations that are not designed to be used as tools in real studies. Second, the computational problems are often hard, and the algorithms have impractical running times. Third, the calculation of rooted phylogenetic networks must be more closely linked to detailed biological models of reticulate evolution so as to produce more plausible results.
At present, none of the existing methods for computing a rooted phylogenetic method is widely or routinely used as a tool to help understand the evolutionary history of a given set of taxa in terms of mutations, speciations, and specific types of reticulate events. Although rooted phylogenetic networks are conceptually very appealing, the development of suitable methods for their computation remains a formidable challenge.