We first describe the principles behind our method. Next, we present the FluReF algorithm, including the description of tunable parameters. We then present the results on various flu datasets from the state of New York. We conducted a visual analysis of a collection of 2005–2008 New York flu genomes, identifying two reassortments, and ran FluReF on this dataset. We then expanded the temporal and geographic scope of the data to test the robustness of FluReF by augmenting our dataset with (i) a large number of sequences from the same area (New York) from a prior year (2004) and (ii) sequence data from all over the United States. We then ran FluReF on another, unrelated flu dataset from Holmes et al.
]. Finally, we experimented with a larger set of simulated sequences.
FluReF exploits certain characteristics of phylogenetic trees of the flu genome. The trees produced from samples taken over a number of years in the same geographical location follow a well established pattern—sequences from the same year tend to cluster together, sometimes forming a clade with sequences from the year before or the year after [3
]. Another common feature of localized phylogenetic trees is that sequences collected in earlier years tend to be closer to the root than those collected in later years, as they had less time to evolve away from the common ancestor at the root. In visual inspection methods [12
], the exploration starts by examining the full genome tree, looking for individual sequences, or small groups of sequences, that do not fit these characteristics.
For example, we may find some sequences that are not grouped with the others from the same year, but with sequences sampled in an earlier year: Figure shows a toy example where clade E from year 3 is grouped together with the sample from year 2. Similarly, we may find some sequences that, while grouped together with the rest of the season, are separated from them by a significantly large distance: Figure shows clade E correctly grouped with the other samples from year 3, but at a significant evolutionary distance from them. In either case, sequences phylogenetically separated from their seasonal grouping are candidates for reassortment. We postulate that this genetic disparity is possible if a strain from the sampling year, the survivor of the previous bottleneck event, has reassorted some of its segments with a strain that re-emerged from the source population. We assume that the lower selective pressure in the source population results in slower evolutionary change, so that a re-emerging sample from the source population would be more genetically similar to the sink population from prior sampling years.
To test for a reassortment, we examine the eight segment trees, searching for an isolated candidate clade. If the candidate clade remains isolated in all individual segment trees, the reason is unrelated to reassortment. One of the possible explanations is that such candidate strains infected the human host in a geographic area far away from the sampling area and thus have a somewhat different evolutionary history. If, however, the candidate clade is grouped together with the other samples from its season in some of the segments, but is isolated in others, we have identified a probable reassortment. Figure shows a toy example with three sampled years. Segments in the isolated candidate clade E3 (3, 5, and 6) have come from the seasonal migration of the source strain, while the rest of the segments for E3 (1, 2, 4, 7, and 8) came from the local seasonal population.
FluReF carries out an exhaustive bottom-up search of the phylogenetic tree reconstructed from the full genome sequences. As the search proceeds, various measures are checked to ensure that candidate reassortments satisfy parameter thresholds motivated by the visual inspection.
In the main loop of the algorithm, each leaf node (a single sequence) is considered if it was not already identified as part of a candidate reassortment. A candidate group is grown upward from the leaf, expansion terminating upon reaching the noise threshold—exceeding a tuneable parameter which dictates when the candidate group would encompass an unacceptably heterogeneous sample from different years.
Once a candidate group is identified, the Least Common Ancestor (LCA) is found for all leaves sampled in the year that contains the majority of sequences in this candidate group, as shown in Figure .
Candidate group (Blue/Black), LCA (Red/Dark Gray), LCA Without (Green/Light Gray)
Next, the Least Common Ancestor excluding the candidate group (LCA_Without) is found. Various metrics for the path from the candidate group to the LCA_Without, via the LCA, are checked to ensure that the separation distance is nontrivial and that the three path has strong support.
In the visual reassortment search method, the path is examined to ensure that it contains several edges with very high confidence values as provided by the phylogenetic reconstruction software. In general, it is desirable to have a majority of edges on the path with reasonably high confidence values, generating trust in the existence of the candidate group separation. FluReF translates this intuition into several tuneable parameters which minimize the rate of false positives by ensuring that only paths with high confidence values from the phylogenetic reconstruction are considered in a reassortment search. During the visual reassortment search, the candidate group is assessed for its distance away from the rest of the season, compared to the rest of the tree. FluReF encompasses this observation with a couple separation parameters, tuned to ignore candidate groups with a trivial genetic separation from the rest of the season. For each candidate group which satisfies all parameters, the algorithm then attempts to find the analog of this candidate group in each of the individual segment trees. If a group is found in a segment tree, it is again checked against various parameter thresholds—typically lower than those used with the tree based on the full genome sequences, because the confidence values from the phylogenetic reconstruction software tend to be lower for individual segment trees. The candidate group is output as a reassortment if it is found to be isolated from the rest of the year sample in some segment trees, but is grouped with the rest of the year sample in other segment trees, pointing to different evolutionary histories. (Preference may be given to certain segments, as there is evidence that some segments are more commonly involved in reassortments than others [11
FluReF runs in at most quadratic time. The main loop traverses a tree, taking time proportional to the size of the tree, i.e., proportional to n, the number of leaves; if each leaf (strain) is considered as a separate candidate group, the main loop will iterate n times.
Experiment 1: confirming visual inspection
We examined a dataset of 75 Human H3N2 strains collected between 2005 and 2008 in New York. The visual inspection of full-sequence and individual segment phylogenetic trees revealed two reassortments. Clade A from 2006, shown in Figure , was grouped separately from the rest of its season in the full genome tree, as well as in individual trees for segments 1, 2, 3, 5, and 6. Clade B from 2007, also shown in Figure , was grouped separately from the rest of its season in the full genome tree, and in individual trees for segments 3 and 4. We applied FluReF to this data set; it produced no false positives and output both Clades A and B as reassortment groups, with the same segments identified as in the visual analysis. This result confirms that FluReF properly applies the principles of the visual analysis of phylogenetic trees.
Full-genome phylogenetic tree for 75 Human H3N2 strains from New York, 2005–2008
Experiment 2: increasing the temporal scope
To test the robustness of FluReF, we augmented the dataset from Experiment 1 with human H3N2 strains sampled in 2004 from New York. The new data set thus contains 118 sequences—at the limit of what visual inspection can handle. FluReF run on this dataset returned the same output as on the unaugmented dataset used in Experiment 1, once again matching visual inspection results.
Experiment 3: increasing the geographic scope
The inclusion of geographically separated strains can lead to the isolation of subgroups from their seasonal cohort and thus potentially cause false positive identifications. We augmented the dataset from Experiment 2 with the rest of the 2005–2008 human H3N2 strain sequences collected all over the United States. The resulting data set contains 180 sequences, beyond our ability to inspect visually. FluReF once again returned the same output as on the unaugmented dataset from Experiment 1, a reassuring result in that it was not misled by geographically isolated strains.
Experiment 4: validating prior work
In 2005, Holmes et al.
performed a phylogenetic analysis of 156 complete genomes of human H3N2 influenza A viruses collected between 1999 and 2004 from New York State and found several reassortment events between the various clades [8
]. Aside from between-clade reassortments, which are currently not targeted by FluReF, Holmes et al.
identified three reassortment groups. Run on the same data, FluReF confirmed one of these candidate reassortment groups: a small clade containing two strains from 1999: [GenBank:CY001120-27, GenBank:CY000989-96]; another candidate group was considered by the algorithm, but rejected due to low confidence scores. We have tuned the parameters of FluReF to be very conservative, so the absence of false positives and the occurrence of some false negatives are to be expected; a more sensitive tuning is possible, especially one that favors certain segments over others, a bias adopted by Holmes et al.
in their analysis.
Experiment 5: scaling
While the quadratic limit makes FluReF scalable in terms of runtime, care must be take to ensure that the accuracy of the algorithm does not suffer as the datasets increase. We performed a first scaling experiment, with a set of 420 simulated sequences containing a single reassortment event. FluReF found this reassortment, and reported no false positives.