|Home | About | Journals | Submit | Contact Us | Français|
Highly pathogenic avian influenza (HPAI) virus H5N1 infects water and land fowl and can infect and cause mortality in mammals, including humans. However, HPAI H5N1 strains are not equally virulent in mammals, and some strains have been shown to cause only mild symptoms in experimental infections. Since most experimental studies of the basis of virulence in mammals have been small in scale, we undertook a meta-analysis of available experimental studies and used Bayesian graphical models (BGM) to increase the power of inference. We applied text-mining techniques to identify 27 individual studies that experimentally determined pathogenicity in HPAI H5N1 strains comprising 69 complete genome sequences. Amino acid sequence data in all 11 genes were coded as binary data for the presence or absence of mutations related to virulence in mammals or nonconsensus residues. Sites previously implicated as virulence determinants were examined for association with virulence in mammals in this data set, and the sites with the most significant association were selected for further BGM analysis. The analyses show that virulence in mammals is a complex genetic trait directly influenced by mutations in polymerase basic 1 (PB1) and PB2, nonstructural 1 (NS1), and hemagglutinin (HA) genes. Several intra- and intersegment correlations were also found, and we postulate that there may be two separate virulence mechanisms involving particular combinations of polymerase and NS1 mutations or of NS1 and HA mutations.
H5N1 avian influenza initially garnered widespread attention following the first human cases of infection recorded in Hong Kong in 1997, after an outbreak in chickens (15, 72, 74), which itself followed an earlier infection in water fowl (21, 84). Since then the virus has spread across Asia to Russia, the Middle East, Europe, and Africa, causing deaths in wild aquatic birds (13, 58, 71, 81), domestic ducks and chickens (29, 39, 43), dogs and cats (2, 35, 36, 55, 76), and 254 of 405 confirmed human cases as of 5 February 2009 according to the World Health Organization (http://www.who.int/csr/disease/avian_influenza/).
Avian influenza virus strains are classified as highly pathogenic or of low pathogenicity on the basis of their virulence in chickens—highly pathogenic avian influenza (HPAI) virus can cause significant mortality (75 to 100%) in unvaccinated flocks (see reference 1), whereas low-pathogenicity strains cause only mild or no symptoms. HPAI H5N1 strains possess multiple basic amino acids in the hemagglutinin (HA) surface glycoprotein (27) cleavage site, the characteristic of all HPAI (62, 63). A variety of substrains of HPAI H5N1 have appeared during the last 10 years (82) and have probably become endemic in both wild birds and poultry (12, 43, 58). All strains of HPAI H5N1 are highly pathogenic in chickens, but their virulence varies in ducks (29) and mammals (see for example, references 11, 20, and 34).
The influenza A virus has an eight-segment RNA genome encoding 11 proteins, and the effects of point mutations or allelic combinations on the virulence of HPAI H5N1 isolates in mammals have been measured in mammals by reverse genetics and reassortment studies. Virulence in mammals is thought to be polygenic (10, 20), and several experimental studies have shown that mutations in the genes encoding internal proteins are important virulence determinants. For example, E627K in PB2 was shown to be important for replication in mammalian cells (73) and in the difference in virulence of two H5N1 Hong Kong 1997 outbreak isolates (24), whereas D92E in NS1 was found to increase resistance to tumor necrosis factor alpha and gamma interferon host responses for another H5N1 Hong Kong 1997 strain in vitro and in vivo in one study in swine (67).
Given the diversity of H5N1 viruses, their high rates of mutation and reassortment (14, 21, 22, 68, 77), and the polygenic nature of their virulence in mammals, general conclusions about virulence factors cannot be drawn with confidence from individual studies of small numbers of closely related sequences. However, systematically amalgamating the results from several studies may allow a more comprehensive picture to emerge. Such integration of results from many experiments and assays to provide a more complete understanding of gene-protein interactions has been carried out in Saccharomyces cerevisiae (32, 40, 65). Using multiple data sources reduces the number of false-positive results and can reveal new patterns of association (32, 40), provided data quality and content issues are overcome (see for example, reference 56). Furthermore, while assembly of a few large data sets into a form suitable for meta-analysis can be done manually, experimental results from many small individual studies can also be captured by text mining published abstracts from online databases such as PubMed at the National Center for Biotechnology Information (NCBI) (51, 69).
The aim of the present study was to find a statistical model to describe the amino acid sites important for virulence of HPAI H5N1 in mammals and to indicate possible interactions between the sites. We have text mined PubMed abstracts for H5N1 isolates with experimentally determined virulence in mammals and downloaded the corresponding full genome sequences from the NCBI Influenza Virus Resource (4). We then fitted a Bayesian graphical model (BGM) to capture the dependencies among mutations and between individual mutations and the mammalian virulence phenotype. BGMs represent the probabilistic dependencies between random variables as a network without self-references, i.e., a directed acyclic graph (37, 59). They provide a convenient description of multivariate data because each link encodes a direct probabilistic dependency between sampled variables (e.g., A − C), rather than just a correlation (e.g., A − C, because A − B and B − C). BGMs have been used in several biological contexts: to infer gene expression networks from microarray data (18), to find the genes involved in a complex trait using single nucleotide polymorphism data (66), and to find coevolving sites in a rapidly variable region of a human immunodeficiency virus (HIV) gene (61). In the present study the nodes (random variables) in the network represent variable amino acid sites or the phenotype: links between nodes represent direct dependencies, i.e., links between amino acid sites represent putative functional interactions, and links between amino acid sites and the phenotype of “virulence in mammals” represent direct association of specific mutations and virulence.
A list of H5N1 HPAI isolates with experimentally determined virulence in mammals was obtained by using a text-mining approach, followed by manual curation. Initially, XML (Extensible Markup Language) records of all abstracts containing “H5N1” and “virulence” keywords were obtained from PubMed using Eutils queries (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) generated with R scripts (http://www.r-project.org). Each of the 542 abstracts obtained on 5 March 08 were searched for isolate names (such as A/Vietnam/1203/2004) using a “/(words)/(words)/” regular expression (i.e., at least three “/”s). This yielded 185 isolate names in 97 abstracts. Next, a list was compiled (using R scripts and manual editing) of unique isolate names, taking into account synonyms (e.g., dk = duck) and recording the PubMed identifiers (PMIDs) of papers that had mentioned the strain. Of the 94 unique isolate names found, 64 were for H5N1 isolates with measured virulence in mammals and full genome sequences deposited in the NCBI Flu database (4) (http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html).
The abstracts corresponding to the 64 H5N1 isolates were examined, and detailed results from mice or ferret studies were manually extracted from the corresponding studies. In some studies additional strains not mentioned in the abstract had also been studied, and these strains were also added to the original list, bringing the total to 69 strains from 27 individual studies. Table SA1 in the supplemental material contains the references to the studies and the numbers of strains from each study in the final data set. Each strain was manually classified as either virulent or nonvirulent in mammals based upon the amalgamated experimental evidence. Table SA2 in the supplemental material contains the assigned classification and experimental evidence per strain.
Full genome nucleotide sequences of the 69 isolates with experimental evidence for (or against) virulence in mammals were downloaded from the NCBI Flu Database. The individual segments were initially aligned by using CLUSTAL W in BioEdit, and then a final manual codon alignment was performed in MEGA 4.0. The PB1-F2, M2, and NS2 coding sequences were manually extracted from the aligned nucleotide files (not all strains were annotated for these alternative coding sequences), and all 11 nucleotide coding sequences were translated to protein sequences in MEGA 4.0.
To analyze the association of particular residues at amino acid sites and virulence using BGMs, the protein sequences were converted into a binary matrix where each row of the matrix represented an isolate genome (11 proteins concatenated), and columns represented amino acid sites. An additional column for the virulence phenotype was added. Zeros represented consensus amino acid residues, ones represented any mutant (or the virulent phenotype). In particular, the following procedure was used to determine the binary coding at each variable site (via a custom R script). (i) If there are only two types of amino acids present at a site (A or B), a 2×2 table was calculated, counting the number of A residues in the virulent sequences, V(A); the number of A's in nonvirulent sequences, N(A); the number of B's in virulent sequences, V(B); and the number of B's in nonvirulent sequences, N(B). (ii) To determine which to code as 1, A or B, the determinant of the 2×2 table was calculated as V(A)N(B) - V(B)N(A). If the determinant is positive, then amino acids of type A are coded as 1's and the B's are coded as 0's (and vice versa for negative determinants). Alternatively, if there is evidence in the literature to indicate that a particular residue is related to virulence (or nonvirulence), then this residue is coded as 1 (or 0). (iii) If there are more than two types of residue at a site, then step 2 is repeated for all combinations of amino acids partitioned into two groups, and the combination with the smallest P value from the Fisher exact test on the 2×2 table is chosen.
The coding scheme used, corresponding 2×2 table entries and P value from the Fisher exact test for sites of interest are reported in Table Table11 (see Table SA3 in the supplemental material for information on all of the sites examined).
BGMs were inferred by using HyPhy (60, 61) from sequence and phenotype data coded to a binary matrix. HyPhy implements the methods of Cooper and Herskovits (16) and Friedman and Koller (19) for BGM inference as follows. The probability for a particular graph structure given the data can be calculated by using the K2 scoring metric (16). The K2 metric score for one node is a function of the number of discrete states of the node and its parents (e.g., how many B = 1 when A = 1, etc.), and the total score for the graph is the product of the individual node scores. When the number of nodes is nontrivial (typically greater than 5), evaluating all possible graph structures becomes prohibitive so the Monte Carlo Markov chain (MCMC) method is used over families of ordered nodes to find a set of probable graph structures to describe the data (19). The output of the BGM inference is a consensus graph, where each link is assigned a probability of existing according to how often it appears in the sampled MCMC results (i.e., the marginal posterior probability − the expectation of posterior probabilities weighted by the likelihoods of the node orderings in which the given link appears). Links with probability greater than or equal to 0.5 in the consensus graph are used to create the final network.
For analysis of the 69 sequence data set (with 10 nodes), the HyPhy BGM MCMC was run with 106 steps (after a burn-in of 105 steps) and sampled every 1,000 steps. These parameters were chosen so that the sampled likelihood values were converged to a stationary distribution, and the individual samples were independent (the autocorrelation of the sampled likelihood values was calculated in every run, and the values of the first lags were not significantly different from zero). The maximum number of parents allowed per node was increased until there was no significant change in network structure (i.e., no new links where P ≥ 0.5 occurred), and a maximum of five parents per node was found to be sufficient for the present data set.
The significance of the links in the BGM inferred from the 69 × 10 binary data matrix was tested in two ways.
(a) The inferred BGM was compared to a set of null model BGMs. Null models were generated by inferring BGMs from nonparametrically bootstrapped data in which one site (or phenotype) was randomly permuted (i.e., a fixed column in a matrix of observation was resampled without replacement) at a time (leaving the other sites unchanged). Ten randomizations were performed per site (i.e., one-hundred randomizations for the ten sites). Randomizing each site (or the phenotype) independently of the others and inferring the BGM gives the background probability of the links between that site and the others due to random noise alone, conditional upon the joint probabilistic structure of the remaining data. The unpermuted BGM link probabilities from the site of interest to or from the other sites were normalized by the mean and standard deviation of the 10 permuted results to give a standard score per link.
(b) Cross-validation of structure was applied. To test the robustness of the inferred BGM and mitigate any possible effects from isolate misclassification or sequencing errors, 10 BGMs were calculated and averaged over different subsets of the data. Each subset of data consisted of 65 sequences (i.e., training data), and the 4 remaining sequences were kept as test sets. Since a different group of strains (particularly the Z constellation) began to replace the early strains from 2002 onward (11, 22, 43), different patterns of mutations may occur after 2002; hence, four sequences were excluded in each run, one from each of the pre-2002 (early) virulent, 2002+ (late) virulent, pre-2002 nonvirulent, and 2002+ nonvirulent sequences.
Conditional probability tables describing the probability that a strain is virulent (nonvirulent) given the presence or absence of mutations at individual sites (identified in the BGM as directly influencing virulence), or a combination of those sites, were calculated separately for each of the 10 cross-validation training data sets using the observed frequency of mutations. Certain combinations of mutations in the influencing sites were not observed because of the correlation between the sites (e.g., there were no sequences in the 69 sequence set with PB1-317 = 1 and NS1-92 = 0 or vice versa). For (hypothetical) strains with these unobserved combinations, we estimated the probability of virulence from the probability of virulence of the individual sites, assuming that the sites were independent as in equation 1 below.
where P(Vir = 1|a1a2a3) is the probability of virulence for a sequence with the particular combination a1a2a3 of amino acids, and pi(ai) is the probability of virulence when node i has amino acid i.
The performance of the model was examined by predicting the virulence of the sequences in each of the 10 training and test data set pairs using the conditional probability table derived from each training set, respectively. The numbers of true positives (TP; virulent sequences predicted to be virulent by the model), false negatives (FN; virulent sequences predicted as nonvirulent), false positives (FP; nonvirulent sequences predicted as virulent), and true negatives (TN; nonvirulent sequences predicted as nonvirulent) were calculated along with the misclassification (FP + FN), sensitivity [TP/(TP + FN)], and specificity [TN/(FP + TN)] values.
A total of 69 H5N1 full-genome isolates from 27 studies of virulence in mammals identified using a text-mining approach, followed by manual curation, were used in this analysis (see Materials and Methods and Table SA1 in the supplemental material). The most frequently measured strain—A/Vietnam/1203/04—was the subject of seven virulence studies. Five studies (11, 20, 34, 47, 50) contained measurements for at least five sequences. Strains were classified as virulent (or nonvirulent) if infection studies showed a high mortality with a low dose; sequences with 50% lethal dose of ≤103 50% egg infectious doses were classified as virulent in mammals. Even though the virulence results were obtained in different animals and using different protocols, the results were generally concordant: in 75% of cases (52 sequences), strains could be classified clearly either as highly pathogenic in mice or ferrets or only caused mild or no symptoms. Furthermore, three studies measured virulence in both mice and ferrets (28, 50, 64) in a total of five strains. In these cases, the pathogenicity in ferrets correlated with that in mice (highly pathogenic strains in ferrets were highly pathogenic in mice, low-pathogenicity strains in ferrets showed low pathogenicity in mice). Although differences in the virulence of two H5N1 strains in mice and ferrets at high dose have been reported (86), we did not find significant differences in the overall virulence of the strains in the present data set when we used our binary classification scheme. Details of the strain classification and associated experimental evidence can be found in Table SA2 in the supplemental material.
The final full data set consisted of 69 full genome sequences: 35 from water fowl, 17 from land fowl, and 17 from humans. Totals of 20% (7) of the water fowl isolates, 24% (4) of the land fowl isolates, and 65% (11) of the human sequences were classified as virulent in mammals, making a total of 22 virulent sequences. The distribution of the strains across isolation year and place is shown in Fig. Fig.1:1: this data set covers several outbreaks, including the 1997 Hong Kong, 2000 to 2002 China, and 2004 Viet Nam epidemics.
Across the whole genome, 1,235 of 4,582 (~27%) amino acid sites were variable (≥1 alternative residue in the 69 sequences) with 969 of 1,235 (~78%) of these variable sites containing only 2 amino acid residues. Furthermore, in the remaining 266 (22%) of variable sites that coded for three or more alternative amino acids, the average frequency of the next most abundant residue was only 4%. To facilitate analysis of the relationship between mutations and virulence in mammals, the sequence data were coded to a binary matrix, which therefore retained almost all of the information contained in the sequence alignment.
All 1,235 polymorphic amino acid sites in the protein sequences of the 69 strains were examined, and the distribution of the alternative residues among strains classified as virulent or nonvirulent was recorded. A total of 227 amino acid sites (19% of the variable sites) across the entire genome were found to have an association with virulence at P ≤ 0.05 (Fisher exact test, uncorrected). However, there was a high degree of association among sites, with mutations at 129 sites having an identical distribution across the strains to at least one other site. Of the 98 sites with a distribution that was unique in the data set, only 53 differed by more than one mutation from any other site. Of the 98 sites with a unique distribution, 12 have been reported in the literature to have functional significance for virulence (see Table SA3 in the supplemental material).
We next examined all amino acid sites or genomic features (e.g., neuraminidase [NA] stalk deletion; PDZ binding domain in NS1) previously reported in the literature as having functional significance for virulence in H5N1 and found evidence for 70 sites/features. Their distribution among virulent and nonvirulent strains was tested as described above: the 23 that were apparently associated with virulence in this data set are shown in Table Table11 (Fisher exact test P ≤ 0.05, uncorrected; see Table SA3 in the supplemental material for details of all 70 sites). All of the strains in this data set, including nonvirulent ones, contained the HA-226(Q) and HA-228(G) (H3 numbering) residues for the avian sialic acid receptor binding site (23, 30) and the NS1-149(A) virulence determinant in chickens (45); on the other hand, only one strain contained the pathogenicity-associated proline at amino acid site 42 in NS1 (33).
Of the 23 sites, three in PB2 (sites 318, 355, and 627), one in PB1 (site 317) and one in NS1 (site 92) which were apparently associated with virulence (P < 0.01; Table Table1)1) had previously been claimed to have a role in virulence determination (see the Discussion). In addition, the presence of PB2-701N has been suggested as an alternative virulence or adaptation marker in mammals in place of PB2-627K (17, 38, 70), and combining the PB2-627 and PB2-701 sites resulted in a more significant P value (P = 0.0003) for association with virulence than PB2-627 (P = 0.003) or PB2-701 (P = 0.08) alone. A further five sites in polymerase acidic (PA), and nonstructural genes (NS1 and NS2) were also nominally significant; however, these sites had very similar distributions across strains to PB2-318. Two HA antigenic sites—HA-102(mature H5 site 86, part of canonical antigenic site E) and HA-279(263E) and the glycosylation site HA-172(156)—had a possible association with virulence (P < 0.01), and deletions in the NA stalk region appeared to be quite strongly associated: 21 of the 22 virulent sequences showed stalk deletion compared to about half of the nonvirulent sequences (P < 0.001).
The uncorrected associations are not themselves evidence of causal dependencies but assist with the selection of sites to be analyzed in the multivariate model. Sites with the same (or only one mutation different) pattern across the sequences were combined since the individual contribution of these sites to virulence would not be distinguishable from this data set. The sites with common patterns are shown in Table Table2.2. Of the 23 sites and/or features with Fisher exact test P values of ≤0.05, only 13 had distinct patterns. In order to reduce the possibility of inferring false links, the total number of variables included in the BGM analysis was restricted as indicated by simulation studies based on sample sizes of 60 to 70 strains (see the discussion of simulation analysis in the supplemental material). Consequently, the nine uniquely patterned sites with the most significant P values for association with virulence in mammals (ranging from P < 10−4 to P = 0.019), together with the virulence phenotype, were included in the multivariate analysis (Table (Table3).3). The selected sites were: PB2-318, PB2-627, PB1-317 ( PB2-355), HA-102 ( NS1-195), HA-172, HA-228, HA-279, NA stalk deletion, and NS1-92 ( NS1-228) (the final binary data used can be found in Table SA4 in the supplemental material). Since there is evidence in the literature to suggest that PB2-701N may compensate for the lack of PB2-627K in the adaptation of the virus to mammals (17, 38, 70), we also investigated the effect of using the combined PB2-627 and PB2-701 site data in place of PB2-627 alone.
BGMs were inferred from the binary data of the selected sites using: (i) all 69 sequences; (ii) 100 cases of permuted binary data (null models) and; (iii) 10 cases of leaving 4 sequences out each time (10× cross validation, see the discussion of model validation in Materials and Methods). Figure Figure22 shows the inferred network structure for significant links, together with the average link probability from the 10× cross validation and standard score from the comparison to the null model networks. The model confirms that virulence in mammals is a complex genetic trait affected directly by mutations in the polymerase, NS1, and HA genes.
Very strong correlations among certain sites leave some ambiguity over the precise nature of the molecular changes since these sites could not be analyzed separately (Table (Table2).2). Specifically, three distinct mutation patterns, representing two sites each—PB1-317/PB2-355, NS1-92/NS1-228, and HA-102/NS1-195—were directly associated with virulence in mammals in the model. We did not detect a significant direct association of PB2-627 alone with virulence in mammals in the model; however, a strong direct association, with a probability of ≥0.8 (standard score = 3 to 4) was detected for the combined sites PB2-627 and PB2-701 (data not shown).
Sites within the polymerase genes were strongly associated with each other in the model: the pattern 1 sites (including PB2-318, PA-127, and PA-336) were linked to PB1-317/PB2-355 pattern 2 sites with a probability of ≥0.9 (standard score = 9 to 10). Strong intrasegment associations were also found in both HA and NS, particularly between antigenic sites HA-102 and HA-279, HA-102 and the variable glycosylation site HA-172, and NS1-92 and NS1-228 (pattern 5) and between NS1-189, NS2-31, and NS2-56 (pattern 1). The most significant intersegment association was between PB1-317/PB2-355 (pattern 2) and NS1-92/228 (pattern 5). NS1 sites 92, 195, and 228 were also strongly associated with HA-279 (antigenic site), and NS1-189 had the same pattern as the PB2-318, PA, and NS2 sites (pattern 1). Additionally, the previously reported interaction between the HA-172(156) glycosylation site and the NA stalk deletion (3) was detected in the BGM with a probability of ≥0.8 (standard score = 9 to 10).
The model was tested by using cross-validation (see Materials and Methods). Table Table44 and Table Table55 show the conditional probabilities for virulence in mammals averaged over each training data case from the cross-validation data sets, calculated per single influencing node (Table (Table4)4) and per combination of influencing sites (Table (Table5).5). Where a particular combination of amino acids has not occurred in this data set (due to the small data set size and interactions between the influencing sites), the probability of virulence in Table Table55 was estimated by combining the individual probabilities in Table Table4,4, assuming independence among sites (see the discussion of conditional probability tables in Materials and Methods).
Using the conditional probability tables from the 10 training data examples, virulence predictions were made for each test data example (these were composed of the four excluded sequences from the training data: one of each early-virulent, late-virulent, early-nonvirulent, and late-nonvirulent sequences, where early refers to a pre-2002 occurrence; see the discussion of model validation in Materials and Methods for details). The average prediction performance over the cross validation training and test data sets was calculated (Table (Table6);6); for the test data, the sensitivity was 70%, and the specificity was 95%.
This study combines results from previous experimental studies based on strains isolated during 3 major HPAI H5N1 outbreaks in order to examine the genetic basis of virulence in mammals. Extensive literature mining, followed by rigorous quality control, identified 69 complete genome sequences with experimentally determined virulence in mammals for use in this analysis, a threefold increase from the number of sequences analyzed in the largest experimental study to date. We have fitted for the first time a rigorous statistical model relating mutations in multiple influenza viral segments to virulence in mammals. Many authors have previously described virulence in terms of a complex trait (10, 11, 20): the model presented here confirms that virulence in mammals is affected by mutations in at least four segments: PB1, PB2, HA, and NS1.
Three sites in PB2 (sites 318, 355, and 627), PB1-317 and NS1-92 that were apparently associated with virulence in the univariate studies had been proposed in earlier experimental studies (10, 11, 24, 34, 41, 50, 67). Recent experimental evidence indicates that a PB2-701N mutation may arise in mammals instead of PB2-627K (70); we found no isolates containing both mutations in the 69-sequence data set, and combining these two sites increases the association with virulence. A further five sites in PA, NS1, and NS2 identified in a study of the 1997 Hong Kong H5N1 human outbreak (75) were also nominally significant; however, these sites had very similar mutation patterns to PB2-318 and the “virulent” residues only appeared in the 1997 sequences. We did not detect any significant association of NS1 deletions alone with virulence in mammals in this data set (48). However, since NS1-97E is a conserved residue (in the aligned sequences), the 5-amino-acid NS1-deletion also leads to an E at position 92 in the protein (48), and NS1-92E (no deletion) or NS1-97E (with deletion) with virulence in mammals was marginally significant (P ~ 0.05). The association of NS1-92E alone (aligned sequences) with virulence in mammals was more significant in this data set (P < 0.001). In addition, no significant associations between virulence in mammals and the Src homology 3 domain binding motifs or PDZ binding domains containing the amino acid motif ESEV or EPEV in NS1 were found (26, 31, 57). However, proline at NS1-228 (part of the PDZ binding domain) was associated with both virulence in mammals (P < 0.003) and strongly with E at NS1-92 (P 10−6).
The nine sites with the strongest individual associations with virulence were included in the BGM analysis, and three of them (PB1-317/PB2-355, NS1-92/NS1-228 and HA-102/NS1-195) were found to have a direct association with virulence.
The mutation patterns at PB1-317 and PB2-355 were very similar to each other in the 69-sequence set and directly associated with virulence in mammals. These mutations were first identified by Katz et al. (34) as correlated with virulence in mice in the 1997 Hong Kong outbreak. In this 69-sequence data set, the PB1 (M,V)317I and PB2 (R,Q)355K “virulent” mutations only appear in sequences from the 1997 Hong Kong outbreak. In an alignment of 579 full-genome sequences (representing all H5N1 unique sequences deposited in the NCBI Influenza Virus Resource as of September 2008), 44 sequences contained one and 12 contained both of the mutations. Apart from the 1997 Hong Kong outbreak, PB1-317I and/or PB2-355K mutations were also present in sequences from birds in China and Viet Nam in 2005 and 2006; a civet (Viet Nam 2005); and four human sequences (Indonesia 2006, China 2006 and 2007).
Two mutations in NS1, D92E and S228P, which were strongly associated in the data set were also directly associated with virulence in mammals in the BGM. The NS1 D92E change is thought to influence pathogenicity by making the virus more resistant to host interferon and tumor necrosis factor alpha responses (67), thus inducing a more severe cytokine response in the host (46). NS1-228 is one of the C-terminal residues and part of a region that could bind proteins containing PDZ domains. NS1 proteins with avian ESEV or EPEV PDZ domain ligand motifs were postulated to disrupt human cell pathways by binding to proteins with PDZ domains (57), and NS1 proteins with avian PDZ domain ligands have been shown to cause increased virulence in mice (31). A total of 51 of the 69 sequences examined here contained ESEV or EPEV at positions 227 to 230: no significant association of these motifs with mammalian virulence was found. However, the point mutation S228P was significantly associated with virulence in mammals. The NS1-92 and NS1-228 mutations remain correlated in the 579 H5N1 full genome sequence set. There are 17 sequences with both mutations, the majority were from the 1997 Hong Kong outbreak, but 3 were from isolated from water fowl in Viet Nam 2005. Five sequences (one from swine in China 2003 and four from birds in Viet Nam and Russia 2005) contained the S228P mutation only, but no sequences contained only the D92E mutation. The BGM inferred from the 69-sequence data set showed a strong link between PB1-317/PB2-355 and NS1-92/NS1-228, and this association still holds in the 579 full-genome data set (P 10−6 for the Fisher Exact test with the Bonferroni correction for multiple testing between any of the 2,312 variable sites, i.e., 2.67 × 106 pairwise interactions). All 12 sequences with both PB1-317 and PB2-355 mutations also contained both NS1-92 and NS1-228 mutations. Hence, the model and wider results indicate that the particular polymerase and NS1 gene combination significantly affects HPAI H5N1 virulence in mammals and that a mutation pattern associated with virulence in mammals in the 1997 Hong Kong outbreak has reemerged among human H5N1 cases in Indonesia and China in 2006 and 2007.
The other mutations directly associated with virulence in the BGM are NS1-195T and HA-102V. These sites are not strongly linked in the model to the PB1, PB2, and NS1 sites discussed above and may represent a separate mechanism for enhanced virulence in mammals. NS1 sites S195, T197, and D92 are hydrogen bonded in the NS1 crystal structure (6) and make a cleft. It has been proposed that mutation D92E could alter the phosphorylation of this NS1 cleft and thus affect virulence (6, 44). The BGM results indicate that NS1-195 mutations also affect virulence, and it is postulated that this could be due to NS1-195 mutations changing the phosphorylation properties of NS1, although clearly more experimental work is required to investigate this as a possible mechanism. Interestingly, the pattern of HA-102(86) V residues were completely correlated with the NS1-195 T in the 69 sequence data set. HA-102(86) is part of antigenic site E (83), so it is conceivable that a mutation here could reduce any cross-immunity that mammals may have against the virus. The correlation between NS1-195T and HA-102V is highly significant in the 579 full-genome sequence set (P 10−6 for the Fisher Exact test with the Bonferroni correction), and there are 83 sequences with both mutations, recorded from 2004 onward, mostly in Thailand, Viet Nam, and Russia. HA-102 is a highly polymorphic site, with six different residues present in the 69-sequence set. Of these, V (present in 10 isolates) and A (present in 55 isolates) are associated with virulence and nonvirulence in mammals, respectively. The remaining four residues (I, P, S, and T) are present in one isolate each, making a virulence assignment on the basis of the 69 sequence data set alone difficult. In the 579-sequence data set, residues I, P, and S were rare and only present in 13 avian isolates in total. HA-102T, on the other hand, was present in 48 isolates: 44 of the 48 human isolates from Indonesia; a cat isolate from Indonesia, and 3 avian isolates from China. In the BGM, HA-102V was directly associated with virulence with a link probability of 0.6 to 0.7 (standard score = 3 to 4), and updating the binary coding scheme for HA-102 to include T as a “virulent” residue increases the link probability in the BGM to 0.8 to 0.9 (the standard score remains 3 to 4). Consequently, HA-102 V or T should be considered as a virulence marker.
At first sight, it is surprising that PB2-627 alone was not found to be directly associated to virulence in mammals because this mutation is thought to be involved in adaptation to mammalian hosts (73), has been previously reported as contributing to virulence in the 1997 Hong Kong outbreak (10, 24, 25), and was shown to be important for efficient respiratory tract growth in mammals (25). The sequence set used here spanned more than the well-studied 1997 Hong Kong outbreak and contained 6 PB2-627K virulent sequences (of 22 virulent sequences), and 1 PB2-627K nonvirulent sequence (of 47). The six virulent PB2-627K sequences were sampled in Hong Kong in 1997 (2), Thailand and Viet Nam in 2004 (3), and Germany in 2006 (1), whereas the single PB2-627K sequence classified as nonvirulent was from A/Turkey/65596/2006 (a human case in Turkey), which caused only mild symptoms in ferrets at a low dose (85). In one of the cross-validation data sets, the PB2-627K nonvirulent sequence was (randomly) excluded, and the resulting BGM had a PB2-627-virulence link probability of 0.6. However, using only sequences that were either very highly pathogenic or nonpathogenic (75% of cases, 52 sequences), no significant link was found between PB2-627K and virulence in mammals. On the other hand, since it has been suggested that a PB2-701D-to-N mutation may arise to compensate for a the lack of the “virulent” PB2-627K residue in mammals (17, 38, 70), we combined the PB2-627 and PB2-701 sites, repeated the analysis, and found a significant direct association between the combined sites and virulence in mammals using BGMs (link probability of ≥0.8, standard score = 3 to 4). Although the relationship between PB2-627K alone and virulence in mammals was not found to be significant in the 69 sequence data set, these results suggest that PB2-627K and PB2-701N might be directly associated with virulence, but more isolates with these mutations and experimentally measured virulence in mammals would be required to detect the associations unambiguously. In addition, more isolates would allow the putative negative association between PB2-627K and PB2-701N to be evaluated.
Even though the uncorrected P value from the Fisher exact test for the association of NA stalk deletion and virulence in mammals was ~0.0003, the BGM results did not show a direct association. However, strong direct associations of NA stalk deletion with HA-172(156, glycosylation site) and HA-279 (263, antigenic site) were inferred. HA and NA have complementary functions in viral replication (see Wagner et al.  for a review), and various studies have indicated that the effects of mutations in one can be compensated by mutations in the other (52, 54, 79); consequently, some HA-NA associations are to be expected. In particular, the NA stalk deletion-HA-172 glycosylation intersegment association (3) was confirmed by the BGM in the present study. Both the association of HA-172 and NA stalk deletion and the HA-279 and NA stalk deletion were significant in the 579-genome sequence set (Bonferroni-corrected Fisher exact test result of P < 10−6 for each). Neither the BGM nor the large sequence set provided any evidence that HA-172 and HA-279 were associated with each other (the corrected P value for the Fisher exact test on the large sequence set was 1), so we deduce that the NA stalk deletion is independently associated with both HA mutations, indicating a functional interaction between HA and NA.
An underlying issue with analysis of genetic determinants in influenza is the general lack of recombination within segments and consequent strong associations of amino acid variants over evolutionary time. Among the 228 sites with P ≤ 0.05 for association with virulence in the 69-sequence data set, the high degree of correlation in patterns of mutation is largely due to identity by descent of the viral genome segments. Thus, 34 sites have most or all of their “virulent” residues in the 11 sequences isolated in the 1997 Hong Kong outbreak, 15 sites on segment 8 have mostly “virulent” residues except in the Goose/Guangdong/96 and 1999-2001 NS allele B sequences (9, 21, 49), and 7 sites have most or all of their “virulent” residues in the 10 sequences from the 2004 Vietnam/Thailand outbreak.
To mitigate the potential bias in the results introduced by nonindependence, an alternative approach would be to code each sequence for the presence or absence of mutations compared to its immediate inferred ancestral sequence (using a phylogenetic tree and model of sequence evolution). The change in the viral phenotype from the ancestral sequences would also have to be inferred. However, although this approach was successfully used to find mutations in the HIV envelope protein (60, 61), it cannot be applied to a data set where each segment has its own evolutionary history (22, 77, 80) since reassortment of the internal genes with other subtypes means that no single tree is applicable to all eight segments. Recognizing the issue, we reduced potential biases incurred due to the relationships between the sequences by using a relatively wide range of isolates (spanning more than 10 years) and selection of sites with dissimilar mutation patterns across the strains.
We used several techniques to reduce other potential sources of error in the model. First, backed by simulation studies investigating the number of false-positive associations as a function of data set size and number of sites, we limited the number of sites in our analysis to those with prior evidence from the literature and the strongest univariate associations to virulence in mammals. Second, we used cross-validation to reduce the effects of sequence misclassification (and sequence errors): the model presented here is an average over 10 subsamples of the data. Third, we used permutation tests to validate the significance of the associations against the null model of association by chance.
Using the model, we estimated the probability of virulence in mammals for strains containing different combinations of influencing sites. To evaluate the prediction performance of the model, we predicted each sequence in the cross-validation test data sets as virulent or nonvirulent on the basis of the combination of mutations it contained and compared the results to the original values. The high specificity (95%) indicates the model yields few false positives, but the lower sensitivity (70%) indicates that not all virulence determinants may have been captured as a result of the small size of the data set. Noise and experimental errors in the data are also expected to reduce the classification performance. The performance of the model compares well with other models relating viral mutations to phenotype, e.g., sensitivity in the range 40 to 100% (70% was typical) and a specificity of ca. 70% have been reported for HIV drug resistance models (42).
In conclusion, we have collated experimental data about the virulence of HPAI H5N1 strains in mammals and used it to identify its genetic basis by inferring a BGM of whole-genome mutations affecting the phenotype. We used data from strains isolated over a period of at least 10 years and analyzed the results obtained in 27 studies so our conclusions are general rather than specific to a single experimental system. The resulting statistical model reveals the polygenic nature of virulence in mammals: mutations at PB1-317, PB2-355, NS1-92, NS1-228, NS1-195 and HA-102(86) are directly associated with the trait. These sites split into two semi-independent groups, possibly indicating separate virulence mechanisms: PB1-317, PB2-355, NS1-92, and NS1-228 mutations were strongly correlated with each other, as were NS1-195 and HA-102(86). We also found strong correlation patterns in the PB1, PB2, and PA sites; the antigenic sites and a variable glycosylation site in HA; and the HA and NS sites. The results highlight the importance of the polymerase and nonstructural internal genes in addition to HA in determining the virulence of HPAI H5N1 in mammals and identify potential targets for intervention in mammalian H5N1 infections.
This study was funded by the Biotechnology and Biological Sciences Research Council (Bb/B009670/1). In addition, this research was supported in part by the National Institutes of Health (AI43638, AI47745, and AI57167), the University of California Universitywide AIDS Research Program (grant number IS02-SD-701), and a University of California, San Diego, Center for AIDS Research/NIAID Developmental Award (AI36214).
Published ahead of print on 22 July 2009.
†Supplemental material for this article may be found at http://jvi.asm.org/.