Our common neighbourhood approach easily picked up reassortants whose parent sequences were sufficiently distant and where the reassortment had not become fixed in the population. The common neighbourhood matrix of A/Swine/Italy/1521/98(H1N2) is a good example (). In segment 4(HA), it has only two elements in common between its segment 4 neighbourhood and most of the other neighbourhoods, which is a good example of a clean reassortment event. Segment 6(NA) is very similar in its relationship with the other segments, with their biggest common neighbourhoods having eleven elements. The rest of the segments have a fairly high number of common ‘neighbours’ amongst themselves. Altogether, this virus has three parents: two of them contributing one segment each and the last contributing the remaining segments.
Common neighbourhood size matrix of A/Swine/Italy/1521/98(H1N2).
A comprehensive search of all available full genome sequence data (9284 strains represented by 1670 genome sequences) resulted in 280 hits (see supplementary material Table S2 for a complete list). These represented a total of 3086 influenza A strains in the original set, with the pandemic H1N1/2009 virus responsible for 2636 of them. Due to space constraints, we chose to show just 52 of them with the highest confidence by limiting to hits with 12 or more very small elements (size ≤2) in their respective common neighbourhood size matrices (). Some of the reassortants that we found have already been reported, in which case we have shown the reference or commented about the annotation. In some other cases, the sequences have been published in a journal, but the reassortment has not been explicitly declared. Altogether, 35 of the total of 52 reassortants are reported here for the first time, to the best of our knowledge.
Table 2. List of predicted reassortant strains with strong confidence. All strains with a reference have been reported previously unless otherwise noted. Strains that do not have a reference or a specific remark have not been reported to date, to the best of our (more ...)
A/California/04/09, the reference strain for the pandemic H1N1/2009 virus, was easily picked up by the algorithm notwithstanding the huge sampling bias, while its reassortment pattern (Smith et al., 2009a
) was subsequently determined correctly ().
Reassortment patterns of selected swine and human strains.
Of particular interest was an individual segment's propensity to reassort and acquire genetic information from a parent unique to itself or at most common to one more segment. The 7-1(eg:aaabaaaa) and 6-2(eg:aaababaa) reassortment patterns are the most typical of this kind. Segment 4 and 6, which code the HA and NA genes, tend to reassort in this way very often (13/52 instances).
Using our algorithm, we were able to identify further breakdowns in the ancestry of known reassortants. In A/Swine/Ontario/53518/03 (Karasin et al., 2006
), for example, we found that PB2 - as well as the previously reported PB1 - was derived from a unique parent of its own. In the 2005 triple reassortant H3N2 viruses from Canada (Olsen et al., 2006
), we found that the PB1 gene was of a lineage distinct from that of the other polymerase genes and close to that of HA. Moreover, as a by-product of our analysis, we found that these swine viral sequences were very similar to A/turkey/Ontario/31232/2005(H3N2), a contemporary avian virus from the same region, strongly suggesting cross species transmission. This finding was only possible owing to our comprehensive dataset spanning all hosts.
Similarly, it is quite trivial to find influenza sequences “frozen” in time. A/USSR/90/1977(H1N1), one of the first H1N1 isolates after its re-emergence in 1977 (Zimmer and Burke, 2009
) after a 20 year lapse, happened to possess a genomic sequence very similar to that of A/Roma/1949(H1N1).
Furthermore, it proved to be powerful enough to analyse complex reassortment patterns within closely related sequences, when an appropriate data set is used. For example, the predicted reassortment patterns for Clade B (A/New York/32/2003, A/New York/198/2003 and A/New York/199/2003), Clade C (A/New York/52/2004 and A/New York/59/2004) and A/New York/11/2003 from a comprehensive phylogenetic study of 156 complete genomes of H3N2 influenza A collected between 1999 and 2004 from New York (Karasin et al., 2000
) are a perfect match with the patterns that were previously inferred by examining their phylogenies (data not shown).
Sample bias is a major confounding factor in molecular evolutionary analyses, particularly so in our reassortment search. The number of isolates available from the first half of the 20th century is very scarce, making it difficult to determine evolutionary lineages. This is exacerbated by our fixed neighbourhood size of 80, which is too big for sparsely sampled lineages. We actually did not have any hits from that era.
In a preliminary analysis with a smaller data set, the oldest influenza A strain in the database, A/Brevig Mission/1/1918, was picked up by our algorithm, in spite of the fact that, by definition, its ancestry and reassortment history could not be directly determined by the available data. This is a result of our neighbourhoods consisting of both ancestors and descendants, when only ancestors define a given strain's reassortment history. This would potentially pose problems in highly reassortment driven lineages. For example, A/Goose/Guangdong/1/96 (Gs/Gd/1/96), the precursor of the recent HPAI H5N1 lineage in Asia (Li et al., 2004
) has passed various combinations of its gene segments to a few generations of multiple reassortants, which did adversely affect our grouping of its own segments by ancestry. Direct descendants could also negatively affect the output when the reassorted genes get fixed in the population. Conversely, such direct descendants of reassortant strains may be wrongly selected as reassortants themselves.
The ideal solution for this problem is to have only ancestors in the neighbourhoods. However, it is not possible to distinguish ancestors from descendants from our distance matrices alone. It would require the construction of all the phylogenies with additional assumptions about the relative rate of evolution on each branch.
Minor topological and distance inconsistencies may occur across segments in phylogenies even without a reassortment event, due to stochastic errors and limitations in distance estimation methods. We need to allow for such minor inconsistencies so that our algorithm does not wrongly pick up strains that are in fact not reassortants. To this end, we must avoid too small a neighbourhood size, thereby allowing movements upto a certain degree to occur without being considered as results of reassortment. Too large a neighbourhood size would, on the other hand, not detect small movements that are actually reassortment driven and may give distorted results when the immediate surroundings are sparsely sampled. After much deliberation, we decided to use a neighbourhood size of approximately 5% of the data set, which seemed to work reasonably well. Perhaps, a neighbourhood size that varies across lineages by sampling density would be a potential improvement to our algorithm.
The property of common ancestry should be transitive over all segments in order to group the segments by ancestry without confusion. (ie. if i and j have common ancestry, and j and k have common ancestry, then i and k should also have common ancestry). Nevertheless, many of our results do not satisfy this criteria, which is no wonder given the fact that we use a common cutoff value for all segment combinations and all lineages. Hence, we have had to assume transitivity in some cases before assigning the ancestry of each segment. (ie. we assume i and k have common ancestry even if the common neighbourhood size falls below the cutoff, provided there exists a segment j that has common ancestry with both of them).
We have tried to reconstruct the phylogenies for our data using MrBayes (Huelsenbeck and Ronquist, 2001
) as described in the GiRaF paper (Nagarajan and Kingsford, 2011
) and found the computation time till convergence with sufficient mixing to be at least in the order of months per segment on a single processor machine (data not shown). It seems inevitable that we would have to settle for phylogeny independent methods at current processor speeds, when doing comprehensive analyses of influenza genomic data. One such earlier method (Rabadan et al., 2008
) seemed to perform well in detecting reassortants within lineages, but no comprehensive study has been undertaken to date using this method.
In this paper, we demonstrate our algorithm using a comprehensive complete genome data set, and strive to find the reassortants within that data set while using the same data for reference. The same algorithm may be used to check any given new influenza A strain with a complete genome sequence for reassortment. If this algorithm is to be used for that purpose, it is imperative that the reference data set is always maintained up to date. We believe this method could be efficiently utilized for rapidly testing high throughput sequence data if the need arises.