|Home | About | Journals | Submit | Contact Us | Français|
Finding the amino acid mutations that affect the severity of influenza infections remains an open and challenging problem. Of special interest is better understanding how current circulating influenza strains could evolve into a new pandemic strain. Influenza proteomes from distinct viral phenotype classes were searched for class specific amino acid mutations conserved in past pandemics, using reverse engineered linear classifiers.
Thirty-four amino acid markers associated with host specificity and high mortality rate were found. Some markers had little impact on distinguishing the functional classes by themselves, however in combination with other mutations they improved class prediction. Pairwise combinations of influenza genomes were checked for reassortment and mutation events needed to acquire the pandemic conserved markers. Evolutionary pathways involving H1N1 human and swine strains mixed with avian strains show the potential to acquire the pandemic markers with a double reassortment and one or two amino acid mutations.
The small mutation combinations found at multiple protein positions associated with viral phenotype indicate that surveillance tools could monitor genetic variation beyond single point mutations to track influenza strains. Finding that certain strain combinations have the potential to acquire pandemic conserved markers through a limited number of reassortment and mutation events illustrates the potential for reassortment and mutation events to lead to new circulating influenza strains.
Influenza A has evolved toward host specific mechanisms of infection leading to genetic divergence between human and avian strains. Sequence divergence is so striking that single nucleotide counts are sufficient for classifying the host type for most influenza strains when analyzing whole segment or whole genome data . A notable exception is the H5N1 avian strain that crosses the species barrier and can lead to deadly human infection. The H5 surface protein, hemagglutinin (HA), in some cases is able to recognize human cell receptors [2,3] along with mutations that allow the virus to better survive in the upper respiratory tract . To date, however, there are relatively low numbers of human H5N1 infections compared to the more human persistent subtypes, which may be in part due to inefficient human to human transmission [5,6]. In this study the influenza viruses from the pandemics of 1918, 1957 and 1968 with elements of avian (or avian-like) strains mixed with genetic elements persistent in humans [7-9] are used to provide a historic map of enduring genetic features from past pandemics and their circulation in current human, avian and swine strains .
Whole influenza genomes were searched for genetic markers conserved in pandemic strains that are associated with two features of infection: host specificity and high mortality rate. For host specificity a search was designed to find amino acid mutations in human influenza strains that were not observed in avian strains. The approach for defining host specificity markers closely followed the work of  which predicted positions in the genome associated with human host specificity. Other recent work  looked more broadly for human markers beyond the pandemic conserved regions. Both of these studies examined amino acid point mutations using differing measures for functional significance. In this study a new approach was developed to look for combinations of mutations in the genome that identify host specific evolutionary pressures beyond single point mutations. New mutations were identified that exhibit a co-variation mutation pattern. Evaluating mutation combinations allowed for the analysis of genetic markers where single point mutations failed to distinguish high and low mortality rate strains. In total 34 host specific and high mortality rate pandemic conserved markers were found. The ultimate goal of our study was to examine how the 34 pandemic conserved markers might re-emerge in a future single strain. While marker re-emergence in a single strain does not predict pandemic potential, their presence could highlight unexpected evolutionary events in circulating strains that warrant closer scrutiny.
Influenza genomes not used in the marker estimation process were searched for the presence or absence of the markers. The human host specific markers were sought in the recent avian strains infecting human (H5N1, H9N2, H7N3 and H7N7), the high mortality rate associated markers were sought in the avian strains and both marker sets were sought in non-avian non-human strains (e.g. swine, cat and others). The high mortality rate markers appeared in a wide variety of avian strains but the recent avian to human strain crossovers lacked most of the human strain specific markers. Human persistent strains retained human specific markers (by definition) but lacked most of the high mortality rate markers. Swine strains fell in the middle, carrying both high mortality rate and host specificity markers but with no single strain containing all 34 markers. Using a maximum parsimony principle, likely evolutionary pathways for the re-emergence of the 34 markers in a single strain were considered with a computational experiment. The fewest evolutionary events through reassortment and mutation needed for a single influenza strain to acquire all 34 markers in the presence of a second strain were counted. Starting with a small number of sequenced H1N1 human and swine strains, a mix with avian strains were found to acquire the 34 pandemic markers through a combination of 4 or fewer segment reassortment and amino acid mutation events.
The genetic marker identification procedure uses a discriminative classifier (a linear support vector machine ) with cross validation to build two models, one for host specificity and one for high mortality rate strains. The discriminative classifier is a computational tool that is designed to classify an unknown sample as belonging to one of two classes. Here one classifier model is designed to classify the influenza host type, the second model is designed to classify the influenza mortality rate type. Each model takes as input the 11 influenza proteins aligned and concatenated and classifies the strain in the case of host specificity as being human or avian. For mortality rate, input strains are divided into high and low mortality rate strain classes. The purpose for building the classifier is to find the positions in the influenza genome that are important in the model for accurately classifying input strains, a problem commonly referred to as the feature selection problem . Candidate markers are found by building new classifiers that take as input a small subset of the influenza proteome. The input sets that lead to classifiers that match the accuracy of the original classifier (which uses the entire proteome as input) highlight the amino acid markers that are important for class discrimination. An iterative procedure is used. For the initial step all single amino acid positions are found that separate the two classes (human/avian or high/low mortality rate). The iterative step n identifies the n sized (potentially non-contiguous sequence) combinations that separate the data such that each combination does not contain a smaller sized combination that separates the two classes equally well. This procedure yields a set of non-redundant mutation patterns that separate the two classes. The iterative procedure is important so that a candidate marker is only included as part of a distinguishing pattern when it adds to the classification accuracy. So for example if position 21 in the PB2 protein distinguishes avian and human strains, then position 21 would not be included as part of another set of features (say position 22 in the PB2 protein). Only markers that contribute significantly to classification accuracy are included in the final result. Details on selecting candidate functional markers are given in the Methods section.
Sixteen positions in the influenza genome were found to be associated with human host specificity. The markers were found on the non-structural protein 1 (NS1), non-structural protein 2 (NS2), matrix protein 1 (MP1), nucleoprotein (NP), acidic protein (PA) and the basic polymerase 2 (PB2) protein. Each strain was assigned a genotype, which showed whether the human consensus amino acid variant was present at each of the 16 positions. Strains excluded from the marker estimation process, human infections of avian origin  and non-human non-avian strains, were checked for evidence of an enrichment of human specificity markers relative to the remaining avian strains. With one exception the human infections of avian origin showed a genotype that was distinct from the most common avian genotype background but the number of accumulated human markers was small.
Figure Figure11 shows the relative frequency of different host specificity genotypes among the sequenced samples with minimum 1% frequency for the three host categories: avian, human infections of avian origin and all other non-human non-avian host types. Redundant sequences that occur within the same region and year are collapsed to prevent over weighting heavily sequenced outbreaks. Columns in the table show each genotype configuration with the last row (Rank) reporting the rank of the genotype's relative frequency in avian strains. For example, the most common avian genotype is rank 0 and shown by the red bar to cover nearly 40% of all avian strains as well as a small percentage of other host infections shown by the blue and orange bars. Underneath the three frequency bars is the corresponding genotype: NHHHHNNNNNNNNNNN, which means that these strains have the human consensus marked 'H' at 4 protein positions: 87 NS1, 103 NS1, 207 NS1 and 63 NS2. The remaining 12 positions carry a non-human amino acid variant marked 'N'. Many of the human markers could be a consequence of persistent founder mutations from the ancestral 1918 pandemic strain, which gave rise to current circulating human strains. It is interesting to observe, however, that avian strains maintain each of the human consensus variants in the NS segment with species specific variation patterns. Twenty-four percent of the avian strains share the human consensus amino acid in position 87 NS1 spanning 35 distinct serotypes. Seventy-seven percent of the avian strains share at least one human consensus at one of the other positions in the NS segment, spanning 65 distinct serotypes. If the two sites evolved independently, 19% of the observed avian genotypes would be expected to share a human consensus at 87 NS1 and at least one of the other NS segment positions, however, only 2% of avian strains show this pattern. Half of these cases involve a collection of H3N2 avian strains that recently acquired the NS segment from a swine virus (Rank 12 in Figure Figure1).1). For position 70 and 87 in NS1, Lysine and Serine are the respective consensus amino acids in human. In avian strains, the combinations for the respective positions are Glutamic acid and Serine (58%), Lysine and Proline (26%), Glutamic acid and Proline (9%) and only rarely Lysine and Serine (2%).
The columns are grouped so that avian to human crossover genotypes are clustered into three groups labeled at the top of Figure Figure11 as: H7 (avian frequency rank 0 and 14), H5N1 beginning in 2003 (rank 2, 8, 3, 16 and 9) [7,16-19] and the H5N1/H9N2 Hong Kong outbreaks from 1997–1999 (rank 13, 15, 6, and 17) [20,21]. Additional similar genotype patterns are placed in adjacent columns. A pattern emerges from the two most common avian genotypes ranked 0 and 1 in Figure Figure1.1. These two genotypes cover 60% of the sequenced strains and span nearly all of the subtypes including H5N1, H9N2, H7N7 and H7N3. Among the lethal avian to human crossovers there are two genotypes that arise in humans that are not found in sequenced avian strains (rank 16 and 17). These cases could be examples of post infection mutations, or alternatively show the limits in the coverage of sequenced avian strains.
In a second experiment human influenza strains were separated into two groups: a high mortality rate group containing pandemic genomes selected from the 1918, 1957 and 1968 outbreaks, human H5N1 and the H1N1 1976 deadly New Jersey infection and a low mortality rate group containing all other whole genome human infection samples. As with the pandemic conserved host type markers, the high mortality rate markers were required to be positively identified in each of the sequenced strains associated with the three pandemic outbreaks (e.g. perfect conservation and no ambiguous sequence codes). Eighteen of 2,112 sequenced human influenza genomes (9 of 286 when samples were grouped by year, subtype and location) not in the high mortality rate class contained all 18 of the identified high mortality rate markers. These cases occurred in H2N2 and H3N2 strains from the 1960s and 1970s in years following their respective pandemics.
Figure Figure22 shows the high mortality rate genotypes among the sequenced samples with minimum 1% frequency for the three host categories. The figure shows that the human high mortality rate genotype is the most common avian genotype and that each avian strain has at least 13 of the 18 high mortality rate markers. Analogous to the co-variation pattern found in the NS segment for the human host type markers, the non-lethal human strains show that where the hemagglutinin (HA), neuraminidase (NA) subtype lacks the high mortality rate makers (rank 27, 29 and 31 in Figure Figure2)2) high mortality rate markers are found in other segments. The opposite also occurs (rank 26, 28 and 30 in Figure Figure22).
The most common non-human non-avian genotype (rank 43 in Figure Figure2)2) is a swine H1N1, which shares many of the high mortality rate variants but misses the mutations found on the NS and PB1 segments. The second most common subtype shares all but one of the high mortality rate variants and is circulating in horse (rank 15) but Figure Figure11 shows that H3N8 lacks most of the human host type markers (rank 19 and 21 in Figure Figure1).1). The complete high mortality rate variant (rank 0) are H5N1 cases that infect a broad host range including swine, tiger, domestic cat, civet, and stone marten. Figure Figure11 shows that these strains (most with the genotype with rank 9 in Figure Figure1)1) contain only a small number of human specific markers similar to the H5N1 human infections. The differences in genotypes show that swine host both strains found with human transmission markers or strains enriched with the high mortality rate markers. This could present an opportunity for two strains to mix and evolve into a swine strain with all 34 of the predicted pandemic conserved markers.
Recent work mixing avian H5N1 with human H3N2 in ferret models has shown that combining the H5N1 cell surface proteins with the internal human proteins need not lead directly to efficient ferret to ferret transmission, which serves as a model for human to human transmission . In this approach only reassortment events were considered, highlighting the complexity that may be involved in acquiring the precise mix of genetic elements required for an H5N1 virus to acquire pandemic potential.
To explore the steps needed to acquire the 34 genetic markers, hypothetical strain mixes were examined where pairs of genotypes sampled within one year difference were tested to simulate concurrent circulating strains. Two evolutionary events were considered: reassortment between segments counted as a single evolutionary event and an amino acid point mutation, also counted as a single evolutionary event. Each genotype was checked for the minimal number of events needed to acquire all 34 markers when mixed with a second strain. For completeness, all 9 pairwise combinations for the three host types were considered: human, avian and non-human non-avian. There were 269 distinct genotypes with 24,889 pairwise combinations and 187 distinct combinations of events that led to the 34 markers in a new strain. It is important to note that strain mixes that include a human strain already have the 16 human conserved markers and only lack the complement of high mortality rate conserved markers. Thus, human strains should require fewer mutation and reassortment events to acquire the 34 markers, compared to strain combinations between non-human influenza strains. Figure Figure33 shows the frequency distribution (in blue) for the fewest events needed for each of the 269 genotypes to acquire the 34 markers. The percentage of the blue bar covered by red is the relative contribution of reassortment events to the total. For example, in the case of 4 events, on average roughly half the events are attributed to reassortment. The histogram shows that on average the fewest events to acquire the 34 markers is almost always through a combination of reassortment and mutation. The figure points to two cases that are one mutation away from the 34 markers, a human H2N2 strain from 1968 and a H3N2 strain from 1971. These are examples of strains conserved from their respective recent pandemic ancestors and their presence (along with the other strains with the 34 markers previously referenced that persisted in the 1960s and 1970s) indicate that the 34 markers are not sufficient for causing pandemic potential.
Potentially novel strains with avian subtypes found to infect humans, which could circumvent existing human immunity (H7N7, H7N3, H7N2, H9N2, and H5N1), were examined more closely. Sixty-six distinct event combinations were found, but only a few cases required 4 events or less, which are summarized in Table Table1.1. These potential paths involve 8 distinct genotypes from human and swine H1N1 strains, which acquire the two avian surface proteins plus one or two additional amino acid mutations on the NS1, PB1 or PB2 gene. Three of the 8 genotypes were observed in 2006 or later. The first sequenced strain from each location is given in Table Table2.2. Although all of the human strains maintain all 16 human markers, they differ in the number of 18 high mortality rate markers present. Thus, different human strains require different numbers of mutations to acquire the 34 markers. For example, when starting with human H3N2 strains, 6 or more high mortality rate mutations are required in addition to the double reassortment with the HA and NA genes.
A distinct genotype subset emerges from the avian background from which the human crossovers are derived with some strains adopting a limited number of human persistent markers. Overall, the human infections of avian origin have acquired no more than a few human specific markers, which suggests that avian strains are not rapidly acquiring human persistent markers through genetic drift. The high mortality rate markers are ubiquitous in the avian background and are distinct from the vast majority of human infections. While the host type markers clearly separate avian and human strains, there are a number of cases where descendants of the 1957 and 1968 pandemics continued to retain all of the predicted high mortality rate markers. Finding that classification accuracy for high mortality rate strains is lower than the host type classification weakens support for the notion of a single essential common set of high mortality rate markers. The reduced classification accuracy comes primarily from the fact that the H2N2 sequences continue to maintain the 18 markers into the 1960s, well past the associated pandemic. Thus, these 18 markers do not clearly distinguish between pandemic and non-pandemic associated H2N2 strains. Instead the results support the hypothesis that additional factors play an important role in determining the mortality rates of a specific strain. This highlights the potential importance to pandemic potential of host immunity and antigenic novelty. Even in the case of host type markers where classification accuracy is very high, markers could be missed. For example, the HA and NA genes play a critical role in host specific infection, but this study focused specifically on the persistent markers, and host specificity markers were found only on the more heavily conserved internal proteins. Additional potentially important host type markers that are not persistent should still exist.
It is worth noting that 5 of the 18 high mortality rate markers lie on the NA or PB1 segments implying that they were independently introduced into the three respective pandemic outbreaks . Aside from the 18 high mortality rate markers persisting in H2N2 strains past the 1957 pandemic time frame, the markers give an overall high degree of classification accuracy and, therefore, a potentially useful common, although not sufficient, set of associated genetic factors. Among the high mortality rate strains not associated with a pandemic, only the 1976 H1N1 isolate lacks all 18 markers (4 are not present). Because the 1976 sample is a small contributor to the total number of high mortality rate features, it does not significantly contribute to the classification model. Substituting a single alternate 1976 swine strain for example, would have limited impact on the markers chosen unless more strains were added or a single strain was given the same weight as the pandemic strains in which perfect conservation is required. In this case mixing low mortality rate strains into the high mortality rate class would substantially alter the reported set of persistent markers. Requiring perfect conservation with the 1976 H1N1 strain would reduce the number of candidate markers to 14 (or less if an alternate 1976 swine strain were used). Similarly, swapping in nearly any other H3N2 sequence from the low mortality rate class, including those from the 1970s would alter the candidate marker set due to a lack of conservation.
Evolutionary pathways through reassortment and mutation show that strain combinations starting with H1N1 human and swine need the fewest events to acquire the pandemic conserved markers. Several of these pathways would lead to novel strains with H5N1 subtypes that could challenge human immunity. The potential need for an extended time or number of exposures for strains to acquire the human persistent mutations combined with the high mortality rate markers associated with avian strains suggests how swine could act as a mixing vessel where both human specific and high mortality rate markers are found to persist. Additional work may reveal restrictions that limit the strain combinations that lead to viable new strains. Measuring the rate of co-infection in swine and human, particularly in cases where an avian like strain is suspected to be present, could provide additional data for more precisely modeling the likelihood of the reassortment events that combine with mutations to facilitate mutation combinations important to infection.
A pattern classification approach  is used with heuristic feature selection [14,24] to predict the candidate markers. Taken as input is a multiple sequence alignment (using MUSCLE ) for a collection of influenza genomes, where the 11 proteins are concatenated together. Each position in the alignment is converted to a bit vector of length 21, where an entry of 1 in the vector indicates the presence of one of the 20 amino acids or an insertion symbol. For an input alignment of length x (and 21 × x length bit vector), to find all n sized mutation subsets, x choose n combinations are checked, which is time prohibitive even for small n when x is large. A heuristic is used to exploit the information obtained from the linear support vector machine (LSVM) to reduce the size of x to 60 and limit n to 10. Note that even this size (~7 × 1010) in theory could be too large to efficiently process. Since smaller combination sizes were found, the search space size was sufficiently reduced to compute a solution. The LSVM computes weights for each position in the alignment reflecting the relative influence on the classifier. These weights are used to select the x most heavily weighted mutations from which to consider combinations. A similar approach was used in document classification  and a related approach was taken to classify 70 antibody light chain proteins . LSVM code was developed by modifying the software package LIBSVM .
The expected classification accuracy is defined by the accuracy of the LSVM using the aligned proteome as input and 5-fold cross validation. Similar to the approach taken by  for human specific markers, sequences in the multiple sequence alignment used for training the classifier were labeled either human or avian depending on the host, excluding the avian to human crossover samples (H5N1, H9N2, H7N7 and H7N3) from training and testing. The 2,026 human persistent strains and 1,018 avian strains were grouped by time, location and subtype, with representative samples chosen at random to yield 281 distinct human strains and 560 distinct avian strains. Classifier accuracy was estimated by randomly dividing the data set into 5 non-overlapping partitions. The classifier was trained on 4 of the partitions and accuracy was measured by the percentage of correct classifications on the fifth partition, with the percentage of correct classifications calculated separately for each class to account for the difference in class size. The average of all 5 tested non-overlapping partitions was calculated giving two accuracy values (one for each class) and the final accuracy measure was the average of these two values. The 34 pandemic conserved markers given in this report were required to be positively identified in every sequenced strain in each of the three pandemic outbreaks without deviation from the majority consensus. This led to three markers reported in  that were excluded from this report for lack of conservation or positive identification (when an ambiguous sequence code was present) in one of the sequenced strains associated with the pandemic outbreaks.
The host specificity classifier misclassified 2 human and 2 avian strains for a classification accuracy of 99.5%. The classification errors appeared to be due to recent reassortment events that suggest the presence of influenza genomes that are a mix of both human and avian strains .
The high mortality rate data set was constructed using the same procedure as the host type dataset and the same 5-fold cross validation procedure was used to estimate accuracy. A total of 111 influenza genomes were classified as high-mortality rate strains and 2,001 were classified as low-mortality rate strains, with a non-redundant subset taken for training (35 high mortality rate, and 255 low mortality rate). The percentage of high and low mortality rate strains that were correctly classified was 96.2% and 96.9% respectively (an average of 96.6%). The lower accuracy for the high mortality rate classifier compared to the host type classifier likely highlights the genetic complexity associated with high mortality rate and the influence of other important factors such as host interaction.
Newly generated classifiers using only a small subset of the aligned proteomes as input were required to match the original classifier accuracy (99.5% for host type and 96.6% for high mortality rate type) within a margin of error defined by a confidence threshold. The confidence thresholds were defined by confidence intervals assuming 1 sided t-test comparisons using the standard deviation in the cross validation tests. Lowering the classification accuracy threshold allowed for the possibility of undetected reassortment events and other potential strain labeling errors (such as host interaction factors) that preclude perfect separation of class types.
Figure Figure44 shows the frequency distribution for the size of amino acid combinations (combinations up to size 10 were checked) that distinguish avian and human strains at the different accuracy thresholds. The highest accuracy threshold of 99.5% (red bar in Figure Figure4)4) requires using more mutations per combination to accurately discriminate host type. For example, a minimum of 3 amino acid positions are required, with most combinations using 4 or more amino acid positions. By contrast, at the lowest accuracy thresholds, only single or pairs of amino acids are needed.
In Chen et al. (2006) functional significance was calibrated to detect the 627 PB2 mutation. A feature of the 627 PB2 mutation is that the human variant (Lysine) was found in 1% of the background avian flu and 23% of the H5N1 avian flu (~5% total) suggesting less human specific selective pressure. Thus distinguishing at the minimal accuracy threshold (set at 98.3%) using 627 PB2 required at least one additional marker. From the combinations of amino acid positions used for discrimination, an individual marker's functional significance was determined by two criteria. The marker must be part of a combination of mutations that separates the two phenotype classes with the same degree of accuracy (at one of the four confidence thresholds) that was achieved using the complete proteome alignment as input. Second the marker's individual contribution to the combination's classification accuracy must be above a minimal threshold defined by the distribution of observed contribution values. A mutation's contribution value was measured by the maximal increase in classification accuracy gained by adding the marker as a feature to one of the classifiers that met the minimal accuracy requirements. For example, mutation 627 PB2 could be combined with several additional mutations to make an accurate classifier. The classification accuracy of each of the additional mutations was measured without including 627 PB2 and compared to the accuracy when including 627 PB2, with the maximal difference being 627 PB2's contribution value. Figure Figure55 plots the contribution values for each candidate marker's maximal contribution to classification accuracy for the 4 different accuracy thresholds. At one end of the spectrum are markers like position 199 PB2 which is shown in Figure 5 to accurately classify close to 99% of the samples, without looking at any other positions in the proteome. Most positions add little to the host type discrimination, with accuracy contributions well below 1% (for clarity these positions were excluded from Figure Figure5).5). The figure shows the 16 mutations that stand out by their contribution of at least a 10% increase in accuracy at one of the four accuracy thresholds.
Ten of the 13 pandemic conserved host specificity positions reported in  were found. The 3 remaining markers (702 PB2, 28 PA and 552 PA) were not predicted due to lack of conservation among the pandemic strains. The host specific mutations reported here but not in  are attributed to the use of mutation combinations to guide the search for new genetic markers. Two mutations of note not reported by  that gave at least a 5% increase in accuracy at the highest classification accuracy threshold (99.5%) were 400 PA and 70 NS1. The 400 PA human consensus amino acid was Leucine and 3% of the avian strains had Leucine, with the remainder split between Serine and Proline. In the case of 70 NS1, 99.6% of human samples had Lysine along with 23% of the avian strains. (The avian consensus amino acid was Glutamic acid.)
Figure Figure66 shows the analysis for finding the high mortality rate type mutations. No single mutation contributed more than 50% to the classification accuracy, which illustrates the complexity of high mortality rate classification. Multiple mutations were required, but even considering combinations of size less than 10 precluded classification accuracy levels that matched the initial classifier accuracy using the whole genome as input. The marker combinations were found to reach the accuracy levels only at the 3 lower thresholds of 94.8%, 93.5% and 92.8% but not at the highest threshold of 96.6%
JEA, SNG and TRS conceived and designed experiments. JEA implemented experiments and drafted the manuscript. JEA, SNG, EAV and TRS analyzed results and edited the manuscript.
JEA was supported in part by an IC Postdoctoral fellowship. We thank Stephen P. Velsko for valuable discussions. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.