Sequence based typing methods are rapidly becoming the gold standard for epidemiological surveillance with the data generated being also used to study microbial population genetics.
Of the several methods developed, Multilocus Sequence Typing (MLST) [1
] has been widely applied to clinically relevant bacterial species, and online databases have been implemented [2
], facilitating the sharing and analysis of MLST data. The method itself is based on the sequencing of specific regions of (typically) seven housekeeping genes of the microorganism of choice. The selection of housekeeping genes stems from the assumption that they are under moderate to strong purifying selection, and the resulting sequence variation is mostly neutral. Therefore, the accumulation of variation by mutation, in the absence of recombination, occurs approximately linearly with time, and the resulting genetic distance tends to be proportional to the time of divergence between alleles. Those sequences are compared to an allele database for each gene, each unique sequence is assigned a numerical identifier and the combination of alleles at each locus creates an allelic profile. Each unique allelic profile is then converted to a Sequence Type (ST) that unambiguously identifies a clone. The main aim of MLST is to provide a portable, accurate, and highly discriminative typing system and this approach has been used successfully with many different bacteria and other microorganisms.
The first approach to the MLST data analysis for the illustration of the relationships between isolates, was the creation of Unweighted Pair Group Method with Arithmetic mean (UPGMA) dendrograms from distance matrices containing the pairwise differences of allelic profiles [6
]. This type of cluster analysis presents several advantages, such as the ease of interpretation and the creation of an hierarchical grouping of the isolates, that can provide a global overview of the relatedness of the isolates under study and how the defined clusters are connected to each other. However, the topology of such dendrograms frequently does not reflect the patterns of descent [8
]. As discussed by Hall and Barlow [9
] bifurcating tree-like representations of the relationships between isolates are frequently misinterpreted as depicting the precise evolutionary history of the isolates and are particularly susceptible to the confounding effects of recombination. A bifurcating tree representation, rather than a graph-based one, may more easily lead to false conclusions about the relationships of the isolates, and the latter should be preferred when recombination is high [10
Other analyses methodologies focus directly on the sequence data instead of on an allelic profile. Although supported by a large body of work [11
], these methods face an even greater challenge of disentangling the role of recombination over mutation. Distance metrics based on the comparison of a simple alignment of concatenated sequences can be greatly affected by a single recombination event that results on multiple nucleotide changes, thereby disrupting the phylogenetic signal. In spite of these caveats, Multilocus Sequence Analysis (MLSA) has been successfully used for the phylogenetic analysis of species relationships, partly because the use of multiple loci helps to circumvent this problem [12
]. Recently, a new Bayesian model based method was proposed as a solution for this problem by detecting if the differences in the sequence came from recombination or mutation events [14
]. The model, implemented in the ClonalFrame approach, relies on the analysis of the sequence of each locus and assumes that recombination occurs only with alleles not represented in the population, an unlikely assumption for many bacterial species that have been extensively sampled. The resulting distance matrix can then be represented as an UPGMA tree with bootstrap support or as a graph that should describe the contributions of both recombination and mutation.
Nevertheless, the most popular analysis of the allelic profiles in order to infer an hypothetical phylogenetic relationship between STs is that performed by the eBURST algorithm [8
]. This algorithm allows for an unrooted tree-based representation of the relationship of the isolates analyzed, based on the number of differences in the allelic profile, assigning isolates to clonal complexes (CCs). The main advantage of eBURST is that it implements the simplest model for the emergence of clonal complexes [15
]: a given genotype increases in frequency in the population, as a consequence of a fitness advantage or of random genetic drift, becoming a founder clone in the population, and this increase is accompanied by a gradual diversification of that genotype, by mutation and recombination, forming a cluster of phylogenetically closely related strains. This diversification of the "founding" genotype is reflected in the appearance of STs differing only in one housekeeping gene sequence from the founder genotype – single locus variants (SLVs). Further diversification of those SLVs will result in the appearance of variations of the original genotype with more than one difference in the allelic profile: double locus variants (DLVs), triple locus variants (TLVs), and so on. Upon application of the eBURST algorithm to an entire data set, the result is a forest, a disjoint set of trees (acyclic graphs), where each tree corresponds to a clonal complex. Those trees are the result of the application of a set of rules based on the model just described to the graph representing all SLV links. Thus, by considering only SLV links, eBURST does not aim at linking the entire population but identifies different clonal complexes.
The final eBURST forest provides us with an hypothetical pattern of descent for the strains analyzed, illustrating the possible phylogenetic relationships between STs [8
]. The reliance on the comparison of allelic profiles buffers eBURST against the possibility of the introduction of multiple sequence changes by a single recombination event.
Another current data analysis methodology for sequence based typing techniques relying on the difference between two allelic profiles such as MLST and Multilocus Variable Number of Tandem Repeats Analysis (MLVA), consists of computing a Minimum Spanning Tree (MST) [18
]. An MST is a tree that connects all entries in such a way that the summed distance of all links of the tree is the shortest (minimum) [22
]. In a biological context, the MST principle and a maximum parsimony principle share the idea that evolution should be explained with as few events as possible. The main difference between the two is that parsimony methods allow the introduction of hypothetical samples, that are created to construct the internal nodes of the tree, whereas the real samples from the data set are represented as the leaves of the tree. Those hypothetical samples, are assumed to be common ancestors of the current population that can no longer be sampled. In the Bionumerics(tm) software MLST data can be analyzed has a minimum spanning tree using a simple categorical coefficient, and the algorithm was adapted to include hypothetical STs whenever such an assumption decreases significantly the total span of the tree. These hypothetical STs usually correspond to missing types for which a number of SLVs are present in the data set [25
] and that are assumed to exist but not to have been included due to incomplete sampling. Although this is a reasonable assumption with the model of clonal expansion and diversification described above, it is a clear departure from the MST problem, since one postulates the existence of nodes that were not represented in the sampling and that clearly affect the MST topology. A tool available at the Pubmlst website [3
] also allows users to construct MSTs based on MLST data, but on the same page a link to a tool allowing a BURST analysis of the data is also offered and the relationship between the two methods is unclear. Up to now, MSTs and eBURST for the analysis of MLST data have been considered distinct methodologies.
In spite of both eBURST and the MST approach implemented in Bionumerics(tm) being clearly related to the MST problem, they present some particularities that deserve clarification. The main difference between the two methodologies is that MST will always connect all the STs without forming defined clonal complexes as eBURST does. Using MSTs, those clonal complexes have to be infered by using empirical rules based on other data available for the isolates, or by a pre-defined allelic distance. If the later is chosen, the most common definition of CC is a group of STs that have at least six alleles in common with another member of the defined group. In the MST problem one deals with a weighted graph for which one searches for a spanning tree whose sum of link weights is minimum. However, when analyzing MLST data the methods do not clearly define link weights or have identical weights for most of the links. The simple categorical coefficient of Bionumerics(tm) uses the number of allelic differences between STs, similarly to the eBURST approach. In a situation where multiple links connecting different ST nodes have identical weights a number of solutions to the MST problem exist. To overcome the degeneracy of the MST problem and present a unique solution, these methods usually consider a quality ordering of the links relying on additional measures of profile similarity. Bionumerics(tm) also has a feature to include sets of rules based on BURST rules (using only the number of SLVs and DLVs) or to add other user defined rules. In order to present the unique solution, the combinatorial optimization problem becomes finding a tree with the highest possible quality. The MST problem and the Maximum Weight Forest problem are particular cases of graphic matroids [26
]. Thus, finding a solution to represent MLST data by the use of MSTs consists of solving instances of graphic matroids [26
] which can be optimally solved with a greedy approach [29
]. We should also note that many optimal solutions for a given instance may exist that have the same quality. Formally, this happens when considering the quality sorted lists of links for two different optimal trees, the link lists have the same size and, for each list position, links in both lists have the same quality [24
The eBURST algorithmic implementation can be enunciated as follows: the STs are clustered with respect to their number of SLVs, DLVs, TLVs and occurrence frequency. Given a graph where each ST is a node and where a link between two STs exists whenever they are SLVs, we want to construct a forest, i.e. a collection of disjoint trees, such that each ST connects to the neighbor with highest number of SLVs. In case of a tie, we should consider the number of DLVs, followed by TLVs and lastly by the neighbors occurrence frequency.
In the current eBURST implementation [8
] a two step approach is used for each disjoint graph. In the first step, the algorithm builds a tree by doing a breadth-first search (BFS) starting from the group founders, i.e. the STs with highest rank in the disjoint graphs, considering the tiebreak rules of number of SLVs, DLVs, TLVs and occurrence frequency. The BFS is done iteratively. First, the group founder is connected to all its SLVs, then, each one of those, following the ST ordering, is connected to their SLVs not yet present in the tree. This process is repeated until every element of the disjoint graph is present in the tree. The second step consists in a local optimization of the tree. For each ST on the tree, the algorithm checks if a SLV exists with higher rank than the current parent. If such SLV exists, and if it is not a grandparent of the current ST, the link is deleted and the higher rank SLV becomes the new parent of the ST, thus locally improving the tree. The grandparent check is needed to verify that the tree is not broken nor does it form loops. As we have mentioned, the initial BFS tree is locally optimized, therefore, the obtained solution may not be the optimal solution. Nevertheless, and although eBURST v3 readme file [17
] suggests that the problem cannot be solved without using ad-hoc rules, the described problem can be stated as a graphic matroid [26
]. These problems can be optimally solved with the algorithm of Kruskal [23
]. Thus, in this article we provide the formulation of this problem and we solve it with the mentioned algorithm (goeBURST – global optimal eBURST). We also provide an exhaustive analysis of the existing biological data sets and comparison of the results with both implementations. To allow readers to use the new implementation, a prototype software is provided at http://goeBURST.phyloviz.net
where users can analyze their data sets with the proposed algorithm.