A pattern classification approach [
23] is used with heuristic feature selection [
14,
24] to predict the candidate markers. Taken as input is a multiple sequence alignment (using MUSCLE [
25]) for a collection of influenza genomes, where the 11 proteins are concatenated together. Each position in the alignment is converted to a bit vector of length 21, where an entry of 1 in the vector indicates the presence of one of the 20 amino acids or an insertion symbol. For an input alignment of length
x (and 21 ×
x length bit vector), to find all
n sized mutation subsets,
x choose
n combinations are checked, which is time prohibitive even for small
n when
x is large. A heuristic is used to exploit the information obtained from the linear support vector machine (LSVM) to reduce the size of
x to 60 and limit
n to 10. Note that even this size (~7 × 10
10) in theory could be too large to efficiently process. Since smaller combination sizes were found, the search space size was sufficiently reduced to compute a solution. The LSVM computes weights for each position in the alignment reflecting the relative influence on the classifier. These weights are used to select the
x most heavily weighted mutations from which to consider combinations. A similar approach was used in document classification [
26] and a related approach was taken to classify 70 antibody light chain proteins [
27]. LSVM code was developed by modifying the software package LIBSVM [
28].
The expected classification accuracy is defined by the accuracy of the LSVM using the aligned proteome as input and 5-fold cross validation. Similar to the approach taken by [
11] for human specific markers, sequences in the multiple sequence alignment used for training the classifier were labeled either human or avian depending on the host, excluding the avian to human crossover samples (H5N1, H9N2, H7N7 and H7N3) from training and testing. The 2,026 human persistent strains and 1,018 avian strains were grouped by time, location and subtype, with representative samples chosen at random to yield 281 distinct human strains and 560 distinct avian strains. Classifier accuracy was estimated by randomly dividing the data set into 5 non-overlapping partitions. The classifier was trained on 4 of the partitions and accuracy was measured by the percentage of correct classifications on the fifth partition, with the percentage of correct classifications calculated separately for each class to account for the difference in class size. The average of all 5 tested non-overlapping partitions was calculated giving two accuracy values (one for each class) and the final accuracy measure was the average of these two values. The 34 pandemic conserved markers given in this report were required to be positively identified in every sequenced strain in each of the three pandemic outbreaks without deviation from the majority consensus. This led to three markers reported in [
11] that were excluded from this report for lack of conservation or positive identification (when an ambiguous sequence code was present) in one of the sequenced strains associated with the pandemic outbreaks.
The host specificity classifier misclassified 2 human and 2 avian strains for a classification accuracy of 99.5%. The classification errors appeared to be due to recent reassortment events that suggest the presence of influenza genomes that are a mix of both human and avian strains [
29].
The high mortality rate data set was constructed using the same procedure as the host type dataset and the same 5-fold cross validation procedure was used to estimate accuracy. A total of 111 influenza genomes were classified as high-mortality rate strains and 2,001 were classified as low-mortality rate strains, with a non-redundant subset taken for training (35 high mortality rate, and 255 low mortality rate). The percentage of high and low mortality rate strains that were correctly classified was 96.2% and 96.9% respectively (an average of 96.6%). The lower accuracy for the high mortality rate classifier compared to the host type classifier likely highlights the genetic complexity associated with high mortality rate and the influence of other important factors such as host interaction.
Newly generated classifiers using only a small subset of the aligned proteomes as input were required to match the original classifier accuracy (99.5% for host type and 96.6% for high mortality rate type) within a margin of error defined by a confidence threshold. The confidence thresholds were defined by confidence intervals assuming 1 sided t-test comparisons using the standard deviation in the cross validation tests. Lowering the classification accuracy threshold allowed for the possibility of undetected reassortment events and other potential strain labeling errors (such as host interaction factors) that preclude perfect separation of class types.
The genotype analysis shown in Figures and includes 193 non-human non-avian influenza strains. All data was downloaded from the NCBI influenza whole genome database [
30].
Finding markers tied to function
Figure shows the frequency distribution for the size of amino acid combinations (combinations up to size 10 were checked) that distinguish avian and human strains at the different accuracy thresholds. The highest accuracy threshold of 99.5% (red bar in Figure ) requires using more mutations per combination to accurately discriminate host type. For example, a minimum of 3 amino acid positions are required, with most combinations using 4 or more amino acid positions. By contrast, at the lowest accuracy thresholds, only single or pairs of amino acids are needed.
In Chen et al. (2006) functional significance was calibrated to detect the 627 PB2 mutation. A feature of the 627 PB2 mutation is that the human variant (Lysine) was found in 1% of the background avian flu and 23% of the H5N1 avian flu (~5% total) suggesting less human specific selective pressure. Thus distinguishing at the minimal accuracy threshold (set at 98.3%) using 627 PB2 required at least one additional marker. From the combinations of amino acid positions used for discrimination, an individual marker's functional significance was determined by two criteria. The marker must be part of a combination of mutations that separates the two phenotype classes with the same degree of accuracy (at one of the four confidence thresholds) that was achieved using the complete proteome alignment as input. Second the marker's individual contribution to the combination's classification accuracy must be above a minimal threshold defined by the distribution of observed contribution values. A mutation's contribution value was measured by the maximal increase in classification accuracy gained by adding the marker as a feature to one of the classifiers that met the minimal accuracy requirements. For example, mutation 627 PB2 could be combined with several additional mutations to make an accurate classifier. The classification accuracy of each of the additional mutations was measured without including 627 PB2 and compared to the accuracy when including 627 PB2, with the maximal difference being 627 PB2's contribution value. Figure plots the contribution values for each candidate marker's maximal contribution to classification accuracy for the 4 different accuracy thresholds. At one end of the spectrum are markers like position 199 PB2 which is shown in Figure 5 to accurately classify close to 99% of the samples, without looking at any other positions in the proteome. Most positions add little to the host type discrimination, with accuracy contributions well below 1% (for clarity these positions were excluded from Figure ). The figure shows the 16 mutations that stand out by their contribution of at least a 10% increase in accuracy at one of the four accuracy thresholds.
Ten of the 13 pandemic conserved host specificity positions reported in [
11] were found. The 3 remaining markers (702 PB2, 28 PA and 552 PA) were not predicted due to lack of conservation among the pandemic strains. The host specific mutations reported here but not in [
11] are attributed to the use of mutation combinations to guide the search for new genetic markers. Two mutations of note not reported by [
11] that gave at least a 5% increase in accuracy at the highest classification accuracy threshold (99.5%) were 400 PA and 70 NS1. The 400 PA human consensus amino acid was Leucine and 3% of the avian strains had Leucine, with the remainder split between Serine and Proline. In the case of 70 NS1, 99.6% of human samples had Lysine along with 23% of the avian strains. (The avian consensus amino acid was Glutamic acid.)
Figure shows the analysis for finding the high mortality rate type mutations. No single mutation contributed more than 50% to the classification accuracy, which illustrates the complexity of high mortality rate classification. Multiple mutations were required, but even considering combinations of size less than 10 precluded classification accuracy levels that matched the initial classifier accuracy using the whole genome as input. The marker combinations were found to reach the accuracy levels only at the 3 lower thresholds of 94.8%, 93.5% and 92.8% but not at the highest threshold of 96.6%