|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: GL ROD. Performed the experiments: JX. Analyzed the data: JX CTD HZ PR. Wrote the paper: JX GL ROD MCC CTD HZ PR.
Influenza neuraminidase (NA) is an important surface glycoprotein and plays a vital role in viral replication and drug development. The NA is found in influenza A and B viruses, with nine subtypes classified in influenza A. The complete knowledge of influenza NA evolutionary history and phylodynamics, although critical for the prevention and control of influenza epidemics and pandemics, remains lacking.
Evolutionary and phylogenetic analyses of influenza NA sequences using Maximum Likelihood and Bayesian MCMC methods demonstrated that the divergence of influenza viruses into types A and B occurred earlier than the divergence of influenza A NA subtypes. Twenty-three lineages were identified within influenza A, two lineages were classified within influenza B, and most lineages were specific to host, subtype or geographical location. Interestingly, evolutionary rates vary not only among lineages but also among branches within lineages. The estimated tMRCAs of influenza lineages suggest that the viruses of different lineages emerge several months or even years before their initial detection. The dN/dS ratios ranged from 0.062 to 0.313 for influenza A lineages, and 0.257 to 0.259 for influenza B lineages. Structural analyses revealed that all positively selected sites are at the surface of the NA protein, with a number of sites found to be important for host antibody and drug binding.
The divergence into influenza type A and B from a putative ancestral NA was followed by the divergence of type A into nine NA subtypes, of which 23 lineages subsequently diverged. This study provides a better understanding of influenza NA lineages and their evolutionary dynamics, which may facilitate early detection of newly emerging influenza viruses and thus improve influenza surveillance.
Influenza virus belongs to the viral family Orthomyxoviridae and has a segmented negative-sense RNA genome in an enveloped virion . According to the antigenic properties of nucleoproteins (NP) and matrix proteins (MP), influenza viruses are classified into three types - A, B and C. The microscopic structural features and genome organization of influenza A, B and C viruses suggest that they descended from a common ancestor . The influenza A virus infects a wide variety of bird and mammalian species and can cause moderate to severe epidemics annually and catastrophic pandemics sporadically , . The influenza B and C viruses are considered less pathogenic compared with influenza A and are found mainly in humans, although there is increasing evidence that B and C viruses can also infect other species .
Genetic mutation is considered one of the most important molecular mechanisms in the evolution of influenza virus . Like most RNA viruses, the influenza virus has low fidelity RNA synthesis, which results in a high mutation rate - around one mutation per genome per replication , several orders of magnitude higher than those in most DNA-based organisms . Evolutionary forces such as natural selection acting upon rapidly mutating viral populations could shape the genetic structure of influenza viruses in different hosts, geographic regions and periods of time , . Importantly, rapid evolution could partially facilitate the ability of influenza viruses to cross host species barriers and successfully emerge in new hosts with often important public health and/or veterinary health implications. One such example is the Eurasian avian-like H1N1 swine virus, which was first detected in pigs in Belgium in 1979, with all of the eight segments found to be derived from a Eurasian avian H1N1 virus, presumably following adaptive mutation .
Influenza virus has also shown the propensity to escape immunity because of continuous antigenic drift, i.e., mutation at the epitope positions of HA and NA segments , . Antigenic drift may often result in structural changes in antigenic sites, which must be recognized by the host immune system in order to suppress viral infection . This antigenic drift often requires the update of annual influenza vaccines to assure a match between the vaccine and currently circulating viral strains . Additionally, influenza viruses undergo more dramatic antigenic changes, known as antigenic shifts, which occur following reassortment between different subtypes of influenza viruses within a single host .
Each of the influenza viral genes is thought to be important in viral replication and interaction with host cells; therefore, understanding the evolutionary tempo and mode of each viral gene can provide new insight into the epidemiology of influenza viruses , . Among the eight segments, neuraminidase (NA) is of particular significance. NA is a major surface glycoprotein of influenza A and B, but does not occur in influenza C . It plays a key role in virus replication by removing sialic acids from the host cell surface and thus releasing newly formed virions . Drugs that inhibit this NA activity, known as neuraminidase inhibitors, are often used for the treatment of influenza . However, drug resistance mutations (e.g., H275Y) have been broadly observed in epidemic viruses .
Influenza A viral neuraminidases are classified into nine subtypes (N1–N9) according to their antigenic properties, whereas influenza B neuraminidases are classified into two lineages . Previous phylogenetic analyses of influenza viral NA sequences have provided important insight into understanding the evolution of influenza viruses; however, these studies mainly focused on either specific types or subtypes , , , . A global perspective of the evolutionary history of influenza NA genes and their spatial, temporal, and host associations remain lacking. In addition, evolutionary rates of influenza viral genes were estimated (~10−3 substitution/site/year) in previous studies , , ; however, only the average values across all branches were presented. It is unlikely the evolutionary rates are the same in all branches within a phylogenetic tree. The investigation of rate variations among different branches is thus of significant importance in understanding the interior evolutionary behavior of the influenza virus. Moreover, selection pressure and positive/negative selection sites were described in previous studies , , but only a small number of representative sequences were selected for the estimations. Finally, the structural analysis of positively selected amino acid sites, although essential for the development of antiviral drugs and vaccines, has largely been neglected in previous studies. In this study, we employed all influenza NA sequences available in public repositories and conducted large-scale evolutionary, phylodynamic and structural analyses to address the above issues.
The Maximum Likelihood (ML) and MCMC Bayesian analyses demonstrate that the influenza NA gene diverged first into A and B (Group I and Group II), followed by the division of influenza A subtypes (Figure 1, File S1). The monophylic origin of influenza A and influenza B was strongly supported by the bootstrap values (100%). Within influenza A, two subgroups were found, one consisting of subtype N2, N3, N6, N7 and N9 (Subgroup I) and the other consisting of the remaining four subtypes, N1, N4, N5 and N8 (Subgroup II) (Figure 1). Each subgroup consists of viruses independently adapted to the avian, human, equine and swine hosts, indicating that parallel evolution occurred in these two subgroups (Figure 1). In addition, each of the nine influenza A NA subtypes was found to form a distinct cluster with a high bootstrap support value (>90%), indicating a monophyletic origin for each subtype.
A total of 23 lineages, two to three lineages for each subtype, were identified within influenza A viruses, while two lineages were classified within influenza B (Table 1). Lineages 1A and 2A were further divided into five and three sublineages, respectively. Human lineages were found in influenza A N1 and N2 subtypes and influenza B, swine lineages in N1 and N2, equine lineages in N7 and N8, and avian lineages in all influenza A subtypes. In addition, avian lineages were found to have more combinations of HA and NA compared with mammalian lineages.
Three lineages, 1A, 1B and 1C, were identified based upon strong bootstrap support values (100%) of the phylogenetic tree, which was generated from 4,146 sequences (Figure 2-A, Table 1). The genetic distances between lineages ranged from 0.191 to 0.238. Lineage 1A is a major avian lineage, which is further divided into five sublineages: 1A.1 (H5N1), 1A.2 (Eurasian avian), 1A.3 (Pandemic H1N1 2009), 1A.4 (Eurasian avian-like swine) and 1A.5 (North American avian).
Sublineage 1A.1 originated from the recent highly pathogenic H5N1 avian influenza epizootic that started in Asia around 1996 and has spread throughout the Eastern Hemisphere. The viruses in 1A.1 are mostly from birds (n=1,031), but some are from humans (n=164), swine (n=8), tigers (n=2) and mink (n=1). Sublineage 1A.2 is composed of mostly Eurasian avian influenza viruses (n=230), whereas some human highly pathogenic H5N1 influenza viruses (n=24) sampled in 1997 in Hong Kong were also found in 1A.2. Sublineage 1A.4 consists of Eurasian swine influenza viruses which were originally derived from Eurasian avian viruses and first detected in Belgium in 1979. Not surprisingly, 1A.3 (Pandemic H1N1 2009) is grouped together with Eurasian swine, which confirms previous findings that the NA segment of pandemic H1N1 2009 viruses originated from the Eurasian swine influenza viruses. Sublineage 1A.5 is composed of viruses mainly from North American avian species (n=162), with a few exceptions: 1 viral sequence from human and 3 from environmental samples.
Lineage 1B consists of mainly North American swine influenza viruses, while 1C is a human lineage, consisting mainly of H1N1 human influenza viruses. The viruses in 1B correspond mostly to the classical H1N1 isolates from swine (n=126), but include 9 isolates from humans and 9 from birds, indicating sporadic interspecies transmissions of influenza viruses from swine to humans or birds. Lineage 1C consists predominantly of human viruses (n=1204), with a few exceptions, namely, swine (4 isolates) and birds (2 isolates). Within the influenza A N1 subtype, avian influenza viruses include sequences from multiple HA subtypes (e.g., H1N1, H3N1, H5N1, H6N1, H7N1, H9N1, and H11N1), whereas human and swine viruses have limited HA subtypes (human: H1N1; swine: H1N1, H3N1).
The N2 sequences (3,754 in total) were classified into two major lineages, 2A and 2B (Figure 2-B, Table 1). The genetic distance between lineages 2A and 2B was estimated to be 0.204. Lineage 2A is a major avian lineage whereas 2B consists of mainly mammalian (i.e., human and swine) influenza viruses. Three sublineages were further classified in 2A, 2A.1 for H9N2, 2A.2 for Eurasian avian, and 2A.3 for North American avian.
The 2A.1 is a subtype-specific sublineage consisting of mainly H9N2 avian influenza viruses, with the majority from birds (n=412), but with 24 sequences from swine and 4 from humans, which indicates the occurrence of interspecies transmissions. The 2A.2 and 2A.3 correspond to Eurasian and North American avian viruses, respectively. The viruses of 2A.2 are mainly from birds (n=342), but a few are from swine (n=7) and humans (n=2). A similar result was also found in 2A.3, which includes 291 avian viruses, 1 H7N2 human virus, and 29 viruses isolated from environmental samples.
Within 2B, most of the influenza viruses are from human H2N2 and H3N2 influenza viruses (n=2,340) and swine H3N2 and H1N2 viruses (n=214). However, avian influenza H3N2 viruses (n=11) were also found in this lineage. Interestingly, there were five major clades of swine influenza viruses scattered within lineage 2B, suggesting these viruses originate from human viruses through either genome reassortment or direct transmission events. It is also noted that the branch lengths of the swine clusters are much longer as compared to those of the closely related human viruses, indicating extensive evolution of the N2 gene in swine viruses after transmission from humans to swine.
Three lineages, 3A, 3B, and 3C, were found in N3, with genetic distances between lineages ranging from 0.173 to 0.349 (Table 1, Figure S1). Lineage 3A consists mainly of North American avian viruses (n=173), but includes several avian strains from South America (n=8). In addition, within lineage 3A, 166 sequences were isolated from avian, 4 from swine, 1 from human, and 9 from environmental samples. Lineage 3B is a Eurasian/Oceanian avian lineage, while 3C is also an avian lineage, but does not show any geographical pattern. Lineage 3B and 3C were all composed of avian influenza viruses.
The N4, N5 and N6 subtypes were each classified into two lineages, one corresponding to North American avian (4A, 5A and 6A) and the other Eurasian/Oceanian avian (4B, 5B and 6B) (Table 1, Figure S2, Figure 2-C, Figure S3). The genetic distance between lineages was estimated to be 0.198 for N4, 0.254 for N5, and 0.250 for N6 viruses, respectively. All N4 and N5 viruses are from avian species. Lineage 6A is composed mainly of North American avian viruses (n=336), with a few exceptions (n=2) from Asia avian viruses. Lineage 6B consists mainly of Eurasian/Oceanian avian viruses (n=121), but contains 6 avian viruses from North America.
Three lineages were identified in N7 and N8, which correspond to North American avian (7A, 8A), equine (7C, 8B) and Eurasian/Oceanian avian (7B, 8C), respectively (Table 1, Figure S4 and Figure 2-D). For N9, 3 lineages were identified: 9A, 9B and 9C, which correspond to North American avian, Eurasian/Oceanian avian I and Eurasian/Oceanian avian II, respectively (Table 1, Figure S5). The genetic distances between lineages were found in the range from 0.297 to 0.320 for N7, from 0.269 to 0.298 for N8, and from 0.117 to 0.224 for N9, respectively.
The NA genes of influenza B viruses were divided into two distinct lineages, B/Victoria/2/87-like (Vic87) and B/Yamagata/16/88-like (Yam88) (Figure 3). All influenza B viruses were found from humans, with no obvious geographical separation in either lineage. The genetic distance between Vic87 and Yam88 lineages was estimated to be 0.06.
Outliers were identified and removed before the estimation of substitution rate and tMRCA for each lineage (Table S1). The mean substitution rate and 95% HPD range for each lineage are summarized in Table 2. Our results demonstrated that the mean substitution rates estimated under random local clock (RLC) model were generally lower than the corresponding rates estimated under uncorrelated exponential relaxed clock (UCED) model (Table 2). In the following, we present the results based upon the RLC model, a new model that can reveal the rate heterogeneity among branches.
The Bayesian consensus tree for each lineage, along with posterior mean branch lengths scaled in real time, is depicted in Figure 4. To reflect the rate variation, we colored branches by their posterior mean relative rate of nucleotide substitution. Blue branches reflect a slow substitution rate, whereas red branches indicate rapid change. For H5N1, the mean substitution rate was estimated to be 3.06×10−3 subs/site/year (Table 2), with a low rate (1.5×10−3) found in earlier branches (blue) and a high rate (4.20×10−3) in later branches (red) (Figure 4-A). In contrast, N1 genes of North American swine viruses have a mean rate of 2.55×10−3, with a decrease in rates during evolution: a high rate (3.2×10−3) in earlier branches (red) and a low rate (0.9×10−3) in later branches (blue) (Figure 4-B). It is noted that human H1N1 viruses were found to evolve at two different rates in two circulation periods, with a low rate (1.3×10−3) during 1918–1957 (blue) and a high rate (2.9×10−3) after 1977 (red) (Figure 4-C).
The H9N2 lineage was found to have a mean substitution rate of 4.45×10−3 (Table 2), with a constant rate of 4.9×10−3 in the majority of branches (red) and a low rate (2.6×10−3) in a small number of branches (blue) (Figure 4-D). The substitution rates with the equine N7 lineage decreased from earlier branches (red) (3.4×10−3) to late branches (blue) (1.6×10−3) and averaged at 2.65×10−3 (Figure 4-E). The influenza B Yama88 viruses has a mean substitution rate of 2.3×10−3 (Table 2), with a consistent rate of 2.4×10−3 in the majority of branches (red) and a rate of 1.5×10−3 in a small number of branches (blue) (Figure 4-F). Different rate heterogeneity patterns were also found in other lineages (Data available from authors on request).
The time of most recent common ancestor (tMRCA) varies from lineage to lineage (Table 2). The tMRCA for human H1N1 (1C), which includes viruses causing the 1918 Spanish Flu, was dated to 1898 and the 95% HPD interval was between 1882 and 1909. The tMRCA of H5N1viruses (1A.1) was estimated to be at 1988 (95% HPD: 1984–1992), eight years before the outbreak of H5N1 avian virus in 1996 in Asia. For 1A.2 (Eurasian avian in N1), the tMRCA was estimated to be at 1927 (95% HPD: 1922–1931), with the earliest sampling time being 1934. For the pandemic H1N1 2009 (1A.3), it can be dated back to Nov 19, 2008 (95% HPD: June 7, 2008– Mar 16, 2009). The most recent common ancestor of the Eurasian (avian-like) swine (1A.4) can be dated back to 1978 (95% HPD: 1977–1979), one year earlier than the first detection of this lineage in 1979. For lineage 2B, the tMRCA was dated to 1956 (95% HPD: 1955–1957), one year before the occurrence of human H2N2 in 1957. The tMRCAs for other lineages are shown in Table 2 and the MCC trees are available from the authors upon request. The above results suggest that pandemic or epidemic viruses emerged several months or several years before their initial detection, indicating the crucial role for enhanced surveillance of newly emerging viruses.
Different selection pressures were revealed in different lineages as indicated by the ratio of non-synonymous (dN) to synonymous (dS) substitutions per site (dN/dS) (Table 3). Within influenza A, the highest dN/dS ratio was observed in 2B - human N2 lineage (0.313), which was slightly higher than that of 8B - equine N8 lineage (0.281), 1C - human N1 lineage (0.261), 1A.1 - H5N1 (0.274) and 2A.1- H9N2 (0.252), most likely reflecting host immune selection pressure, as a result of continuous circulation within the respective hosts and/or vaccination. The lineages under the most purifying selection were lineage 9C (0.068), 4B (0.062) and 5B (0.078). In comparison, the dN/dS ratios for influenza B lineages were comparable: 0.259 for Yam88 and 0.257 for Vic87.
Human lineages were found to have the largest numbers of positively selected sites, with 16 sites for the human N2 lineage (2B), 9 sites for human H1N1 lineage (1C), and 8 sites for Yam88 lineage (Table 3). In addition, H5N1 (1A.1) and H9N2 (2A.1), have 10 and 7 positively selected sites, respectively. No positive selection sites were detected in lineages 3C, 6B, 7A, 7C, 8B, and 9A–9C. Other lineages were found to have one to six sites under positive selection.
Protein structure analyses revealed all the positively selected sites were located at the surface of the NA protein and pertained to antibody binding and/or interactions with the sugar molecules of host cells (Figure 5, Figures S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16). In addition, a number of positively selected sites reside in regions of the NA protein where neuraminidase inhibitors have been known to bind, indicating strong selection in influenza viruses with molecular markers predictive of antiviral resistance.
In the human H1N1 lineage (1C), amino acid positions 151, 222 and 344 were found to be under a strong positive selection, and the amino acids in these appear to interact with the NA inhibitor – zanamivir, a drug molecule according to the NA structure (Figure 5-A). In addition, positively selected sites 344 and 365 are located in the B-cell antigenic regions. The amino acid position 319 in human H1N1 lineage, identified to be under positive selection, forms a hydrogen bond with position 379, whose backbone carbonyl is involved in interactions with calcium ions (Figure 5-A). This Ca2+ ion interacts with positions 379, 389, 387, 382, and 381, forming H-bonds with position 385 and position 383. These interactions are crucial in protein folding to create the appropriate tertiary structure for sialic acid binding (which allows the NA to cleave the sialic acid) or for NA inhibitor binding.
With regard to another human lineage (2B), positions 126 and 127 were found to be within the binding pocket of influenza A virus (Figure 5-B). These two residues, along with residues 120 and 151 were found to be under positive selection. All these sites fold in close proximity to each other, providing a hydrogen-bond network that is essential for NA inhibitor binding. Specifically, position 151 forms a hydrogen bond to position 75, which itself is predicted to bind to zanamivir.
For human influenza B, positions 42, 65, 248, 345, 373, 389, 395, and 436 were found to be under positive selection (Table 3). The crystal structure of the B/Perth/211/2011 virus NA region with zanamivir, oseltamivir, or peramivir showed that residues 373 and 374 participated in drug binding, while residue 345 is involved in calcium binding and dimerization of two NA monomers (Figure 5-C, D).
The ML and Bayesian MCMC analyses revealed that the divergence of influenza A and B NA genes occurred earlier than the divergence of influenza A NA subtypes. Similar findings were reported for the hemagglutinin (HA) genes , in which influenza A and B HA genes were found to diverge first, followed by the division of influenza A HA subtypes. Interestingly, within influenza A, both subgroups (I and II) consist mainly of human, swine, avian, and equine viruses and show similar patterns of host-specific lineage composition (Figure 6). This strongly supports the hypothesis that subgroup I and II viruses experienced parallel evolution due to similar rates of genetic mutation and adaption to host environments , .
In this study, 23 NA lineages were determined within influenza A based upon both theoretical (e.g., phylogenetic tree topology) and empirical criteria (e.g., pandemic events). The majority of lineages were found to be specific in hosts, or geographical locations, with a genetic distance around 0.2, ranging from 0.117 to 0.349. These results are generally consistent with previous findings , , , but our study relies on a much larger dataset (focusing on the NA segment) and illustrates more detailed evolutionary dynamics of the influenza A NA lineages.
Classification and designation of the lineages and sublineages within the influenza A virus are essential for studies of viral evolution, ecology and epidemiology. However, how to accurately identify an evolutionary lineage of influenza A viruses is challenging. Whether the naming system will be accepted and used by influenza researchers is even more challenging. To trace the evolutionary change of highly pathogenic avian influenza (HPAI) viruses, a hierarchical nomenclature system for HPAI hemagglutinin clades and sub-clades has been implemented by the WHO/OIE/FAO H5N1 Evolution Working Group and widely adapted by the research community . The work presented here is one of the first steps toward the development of a nomenclature system for influenza A virus lineages (at the segment level) and genotypes (at the genome level). We will incorporate the findings of lineages and genotypes into our FluGenome database for the detection of newly emerging viral lineages and genome reassortment, which will improve influenza surveillance .
It is notable that substitution rates are not the same across all branches within a phylogenetic tree. The relaxed clock model was developed to cope with this issue. An average rate across all branches in the tree is estimated under relaxed clock model in BEAST with 95% HPDs summarized from average rates, which are estimated from the sampled trees . The 95% HPDs thus reflect the topological uncertainty among the sampled trees, but do not show the rate variation across different branches within a tree. In previous studies, the relaxed clock model was used to estimate the substitution rate and 95% HPDs and the resulting values were used for comparison , , , . In fact, such comparison is less accurate. For example, if a phylogenetic tree is mixed with branches of very high and low rates, it might result in an average rate that is similar to that from another tree with branches of a constant rate. We cannot simply conclude the two virus lineages evolve at the same rate. Using the random local clock (RLC), we not only computed the mean substitution rate and 95% HPDs but also estimated rate heterogeneity among different branches of a phylogenetic tree, which reflects evolutionary dynamics of influenza viruses with a given lineage.
This study demonstrated that human influenza viruses were shown to have little geographical restriction, indicating that human viruses were transmitted globally and probably rapidly as well . In addition, this study provides new insights into the emergence time of the human pandemic influenza virus because of the employment of the new random local clock model. This model takes into account the rate variation among different branches within lineage and has been considered superior to other models. For the H1N1 human lineage (1C), which includes the sequences from the 1918 Spanish Flu, the tMRCA under the random local clock model was estimated to be 1898 (95% HPD: 1882–1909), which is earlier than the years previously reported. Using the uncorrelated exponential relaxed clock model, the tMRCA of pandemic 1918 H1N1 viruses was estimated to be 1905–1918  and 1910–1915 .
Lineage 2B includes human influenza viruses isolated from two different subtypes, H2N2 between 1957 and 1968 and H3N2 after 1968, which share the same N2 gene maintained in human influenza virus after the antigenic shift from H2 to H3 occurred in 1968 . The tMRCA of lineage 2B was estimated to be 1956, which is six years later than the tMRCA of 1950 estimated by Smith et al. (2009a), but further supporting the hypothesis that emerging lineages circulate years prior to their initial detection in humans. Furthermore, the human lineages have relatively high non-synonymous to synonymous (dN/dS) ratios (1C: 0.263, 2B: 0.313, Yam88: 0.259, and Vic87: 0.257), suggesting strong immune selection in viruses persistently circulating in humans .
In addition to the above discussed human lineages, pandemic H1N1 2009 influenza viruses are believed to have arisen from a reassortment between North American and Eurasian swine lineages, and as expected, the pandemic H1N1 2009 viruses grouped with the Eurasian swine lineage , , . The substitution rate for NA genes of pandemic H1N1 2009, estimated using sequences from the entire pandemic period (as of March 2011), was found to be very close to the substitution rate estimated in the early outbreak period . In addition, the dN/dS ratio of the NA genes of pandemic H1N1 2009 was 0.226, which was higher than the ratio of the closely related swine NA genes (0.100) . This increase could be attributed either to the adaptations of pandemic H1N1 2009 viruses to humans or to intensive sampling (more frequent mutations detected) . Immune pressure from a previously infected and/or vaccinated population may also account for the differences observed.
Influenza viruses circulating in non-human species have evolved in association with their various hosts on different continents for extended periods of time . Avian influenza viruses were usually classified into Eurasian and North American lineages in the past, which was attributed to confinement of birds to distinct flyways in each hemisphere , .This phylo-geographic pattern is evident in the lineages designated for N4–N9 subtypes (Figure 6). We also found that the viruses isolated from Oceania were grouped together with Eurasian viruses. Therefore, we expanded our geographic designation and used Eurasian/Oceanian to define theses lineages. Avian lineages were found in all nine subtypes, whereas the mammalian lineages occurred in only four subtypes, providing support for the hypothesis that wild aquatic avian viruses are considered the natural reservoir of all influenza viruses (Figure 6) , . In addition, the avian lineages from N3 to N9 appeared to be under strong purifying selection pressures as suggested by the low dN/dS ratios. Similar observations were also described in a previous study .
The two subtype-specific avian sublineages, 1A.1 for H5N1 and 2A.1 for H9N2, are considered to have pandemic potential and were found to evolve relatively faster compared with other avian lineages from multiple subtypes (Table 2). In addition, these two sublineages were found to be relatively young (23 years for H9N2 and 17 years for H5N1), indicating a more recent emergence likely indicative of their adaptation from a wild bird reservoir to domestic poultry and their ensuing establishment in poultry populations. Together, the higher substitution rates and contemporary divergence of sublineages, 1A.1 (H5N1) and 2A.1 (H9N2) may be attributable to the rapid geographical dissemination of these viruses in wild birds and poultry, followed by the establishment of endemicity in poultry-dense regions and consistent transmission and outbreaks , . It will be interesting in future studies to determine if sampling biases (over-sampling in the case of H5N1 and H9N2) may also play a role in the higher substitution rates and recent divergence times observed compared to other avian influenza virus lineages. Nonetheless, it is important to continue systematic surveillance of these high-risk viruses in both birds and humans to better understand these processes.
Two lineages, H7N7 (7C) and H3N8 (8B), were revealed in equine influenza viruses. The H7N7 equine influenza viruses have not been detected since the late 1970s , whereas H3N8 was first isolated in 1963  and is still circulating in equine populations throughout most of the world. Lineage 8B (H3N8) is composed predominantly of equine viruses, but canine influenza viruses are also found in this lineage. This observation is consistent with the fact that equine influenza virus has crossed the species barrier and become established as a respiratory pathogen of dogs . The equine influenza viruses share ancestors with avian viruses in the same subtype, indicating their possible avian origin.
Two major swine virus groups, Eurasian (avian-like) swine (1A.4) and North American swine (1B), were found within N1 (Figure 6). Our observation of geographic separation of swine lineages agrees with previous findings . Our tMRCA analysis of the Eurasian (avian-like) swine lineage revealed that an avian virus crossed the species boundary from birds to pigs in 1978, which is seven years later than previously described  but just one year earlier than the first detection of these viruses in 1979. Similar dN/dS ratios were observed in both swine lineages, suggesting comparable selection pressures occurred in both lineages . Interestingly, amino acid positions 46, 53, and 453 in North American swine (1B), which were found to be under positive selection, are located in the T-cell antigenic regions, while position 339 lies in the B-cell antigenic region , indicating a strong immune selection occurred on these positions.
Complicated evolutionary dynamics were observed in lineage 2B. Within this major human lineage, five separate sub-clusters of swine viruses occurred in North America and Eurasia, suggesting that human-origin N2 genes were transmitted to swine in at least five separate instances (Figures 2-B and 6). Between the 1970s and 1980s, human-origin H3N2 influenza viruses circulated in Eurasia , . Reassortment events between human-origin H3N2 and avian-like H1N1 swine influenza virus resulted in the emergence of H3N2 viruses, with HA and NA from human viruses and all six internal genes originally from birds . These viruses eventually superseded the former human-origin H3N2 viruses in swine around 1984. In 1994, a further swine reassortant H1N2 virus was identified in the United Kingdom . Phylogenetic analyses revealed that the HA gene of this virus was derived from a human-like H1N1 virus, whereas the NA and internal genes were derived from the European swine reassortant H3N2 .
In addition to the complexity found in Eurasian swine N2 viruses, similarly, in North America in 1998 there were outbreaks of influenza observed in swine herds in Minnesota, Iowa, and Texas. The outbreaks were caused by a triple-reassortant H3N2 virus which contained genes from human (HA, NA, and PB1), swine (NS, NP, and M), and avian (PB2 and PA) influenza viruses . An additional important reassortment event in North American swine resulted in an H1N2 reassortant between classical H1N1 swine (contributing only HA) and H3N2 swine (contributing the other seven segments) . These reassortment events, coupled with interspecies transmission between swine and humans, have led to the complexity seen within lineage 2A.
In summary, we analyzed 14,328 influenza A and B NA sequences and studied the evolutionary history and phylodynamics of the influenza NA gene. The divergence of influenza NA into influenza A and B NA occurred first, and nine NA subtypes further diverged within influenza A, with two to three lineages identified within each NA subtype. The analyses of substitution rates, dN/dS ratio, selection sites and protein structures revealed important associations between mutations and antiviral drug resistance/vaccine escape. Further analyses of other influenza segments are needed in order to obtain a comprehensive understanding of influenza virus evolution, which will facilitate influenza surveillance and control.
A total of 14,328 neuraminidase (NA) nucleotide sequences longer than 1330 nts, excluding laboratory recombinant sequences, were downloaded from the Influenza Virus Resource at NCBI . Their host distributions are detailed in Table 4. A Perl script (http://sysbio.harvard.edu) was used to remove identical sequences, which resulted in 10,679 NA sequences, including 10,001 influenza A and 678 influenza B sequences. The influenza A NA sequences were divided into nine datasets (one for each subtype) that consist of 4146, 3754, 351, 85, 128, 488, 189, 684, and 176 sequences respectively for N1–N9.
Homologous gene recombination was identified using the 3SEQ algorithm under RDP3 . Ideally, all influenza sequences are analyzed in a single run. However, because of computational limitations to the program when a large data set is used, we examined our dataset for gene recombination within each influenza A subtype and within influenza B. Sequences with mosaic recombination signals were identified using a cutoff p-value 0.05 .
The SeqMat program was used to collapse similar sequences from the same location and the same year, which results in ~1500 representative sequences, respectively, for N1 and N2 . For other subtypes, we used all available sequences to detect recombination. Fourteen influenza A N1, 14 N2, two N3, one N4, five N6, three N8, and one influenza B NA sequences were identified to have mosaic recombination signals and thus were excluded from the analyses. No mosaic recombination signals were found in N5, N7 or N9.
Influenza A and B NA sequences are remotely related with around 40% nucleotide sequence similarity. We thus conducted both protein and nucleotide sequence alignments using Expresso - a program based upon protein structural information for alignment and TranslatorX - a program referring to the corresponding protein sequence alignment to align nucleotide sequences, respectively –. The resulting alignment between influenza A and B sequences was considered to be of good quality, which assured the reliability of the downstream analysis (File S2). MAFFT and MUSCLE were used to align sequences from each of the nine influenza A NA subtypes and the influenza B NA sequences, respectively , . The alignment results from MAFFT and MUSCLE were compared and adjusted accordingly.
Phylogenetic analysis was conducted using the Maximum-likelihood (ML) method in RAxML . RAxML uses rapid algorithms for bootstrap and maximum likelihood searches and is considered one of the fastest and most accurate phylogeny programs. Two hundred independent inferences starting from random MP trees were performed, and the tree with the highest likelihood score was selected as the representative. The GTRGAMMA model was employed to correct the biases of multiple substitution and rate heterogeneity in sequences. All the analyses were conducted on the supercomputing clusters at the Holland Computing Center (http://hcc.unl.edu/main/index.php). The trees were visualized and color-coded using FigTree (version 1.3.1) (http://tree.bio.ed.ac.uk/software/figtree/) to demonstrate tree topologies and corresponding hosts, subtypes and geographic locations.
Lineages were determined based upon the topology of phylogenetic trees and strong bootstrap support values (100 for influenza A and approximately 90 for influenza B). The genetics distances between lineages were calculated using the Kimura-2-Parameter (K2P) distance matric under MEGA 5.0 . Additional information such as the distribution of viruses in hosts and geographic regions were also considered in the classification. The aim was to identify the lineages of clearly related sequences, which might interest the virology-epidemiology community and could be used for further evolutionary dynamics analyses. The lineage and sublineage were named according to the following notations: a single digit is used to represent one of the nine influenza A NA subtypes and a letter is used to represent a lineage. A sublineage is also represented using a digit. A dot is used to separate a lineage and a sublineage. For example, 1A.2 means N1 subtype, lineage A, and sublineage 2. For influenza B, two lineages were assigned and named following the conventions well-accepted by the influenza research community. To make our lineage assignment scheme justifiable and extensible, we use alphabetic letters to represent lineages in the order of avian, swine, human, and equine for hosts and in the order of the North America followed by Eurasian/Oceanian in geography.
The substitution rate and the time of most recent common ancestor (tMRCA) were estimated for each lineage/sublineage using the Bayesian Markov Chain Monte Carlo (MCMC) method available in the BEAST package . Prior to the MCMC analysis, the linear regression and residual analyses for each lineage were performed using Path-O-Gen ; significant outliers identified were then removed. To reduce excessive computational load, we followed the common strategy that achieves computer tractability while preserving the accuracy of the estimates . We wrote a Java program to select around 120 sequences from each lineage or sublineage, which represent viruses sampled from different locations and at different time points. In all cases, the data were analyzed under the GTR (General Time Reversible) + nucleotide substitution model, as this model was consistently found to be the best by Modeltest .
Three clock models were compared statistically for each dataset using a Bayes factor test in the Tracer program : a strict clock, an uncorrelated lognormal relaxed clock (UCLD) and an uncorrelated exponential relaxed clock (UCED) . The UCED model was found to provide the best fit for all lineages. In addition, we used the newly developed random local clock model (RLC) that takes into account the rate variation among different branches within lineage by applying a series of local molecular clocks, each extending over a subregion of the overall phylogeny. All estimates also incorporated a different substitution rate for each codon position and a Bayesian skyline coalescent prior . For each dataset, two independent Bayesian MCMC runs were conducted for 30 million generations to achieve convergence, with uncertainty in parameter estimates reflected in the 95% highest probability density (HPD). The Maximum Clade Credibility (MCC) tree across all plausible trees was then computed from the BEAST trees using the TreeAnnotator program, with the first 10% of trees removed as burn-in.
The ratio of non-synonymous (dN) to synonymous (dS) substitutions per site (ratio dN/dS) were estimated using the single likelihood ancestor counting (SLAC) method available in the HYPHY package . Positively selected codons were detected using the single likelihood ancestor counting (SLAC), fixed effects likelihood (FEL) and internal fixed effects likelihood (IFEL) methods with a significance level of 0.05. In the SLAC method, the nucleotide and codon model parameter estimates are used to reconstruct the ancestral codon sequences at the internal nodes of the tree. The single most likely ancestral sequences are then fixed as known variables, and applied to infer the expected number of non-synonymous or synonymous substitutions that have occurred along each branch, for each codon position. The FEL method is based on maximum-likelihood estimates. The FEL method estimates the ratio of non-synonymous to synonymous substitutions on a site-by-site basis for the entire tree or only the interior branches (IFEL). In all cases, dN/dS estimates were based on Maximum-likelihood trees under the GTR + Г substitution model. Protein structures of template NAs used in structural analyses were downloaded from the Protein Data Bank (www.pdb.org). Positively selected sites were mapped on the structure of the protein using Molecular Operating Environment (MOE) .
Maximum-likelihood (ML) tree of influenza A N3 genes. Three lineages, denoted 3A, 3B and 3C, were classified. The bootstrap values supporting the corresponding lineages are shown on the major nodes. The scale bars indicate the numbers of nucleotide substitutions per site.
Maximum-likelihood (ML) tree of influenza A N4 genes. Two lineages, denoted 4A and 4B, were classified. The bootstrap values supporting the corresponding lineages are shown on the major nodes. The scale bars indicate the numbers of nucleotide substitutions per site.
Maximum-likelihood (ML) tree of influenza A N6 genes. Two lineages, denoted 6A and 6B, were classified. The bootstrap values supporting the corresponding lineages are shown on the major nodes. The scale bars indicate the number of nucleotide substitutions per site.
Maximum-likelihood (ML) tree of influenza A N7 genes. Three lineages, denoted 7A, 7B and 7C, were classified. The bootstrap values supporting the corresponding lineages are shown on the major nodes. The scale bars indicate the number of nucleotide substitutions per site.
Maximum-likelihood (ML) tree of influenza A N9 genes. Three lineages, denoted 9A, 9B and 9C, were classified. The bootstrap values supporting the corresponding lineages are shown on the major nodes. The scale bars indicate the number of nucleotide substitutions per site.
The structure of 1A.1 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 1A.2 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 1A.3 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 1A.4 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 1A.5 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 1B influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 2A.1 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 2A.3 influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 5A influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 6A influenza neuraminidase, with positive selection sites denoted as green balls.
The structure of 8A influenza neuraminidase, with positive selection sites denoted as green balls.
The number of sequences of each lineage and the number of outliers identified by residual analysis.
The phylogenetic tree of influenza A and B neuraminidase sequences.
The alignment of influenza A and B neuraminidase sequences. The quality of the alignment is indicated by different colors.
We are grateful to the Holland Computing Center (HCC) at the University of Nebraska-Lincoln (UNL) for the computing support. We specially thank our UNL colleagues: David Swanson, Ashu Guru, and Jun Wang and our UNO students and colleagues: Pavan Attaluri, Santosh Servisetti, Thaine Rowley and Mohammad Shafiullah for their help. We particularly thank the Academic Editor: Dr Dong-Yan Jin and four anonymous reviewers for their helpful comments and constructive suggestions.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This publication was made possible by National Institutes of Health grant numbers R01 LM009985-01A1. The authors also acknowledge continuous funding support from The University Committee on Research and Creative Activity (UCRCA), University of Nebraska at Omaha, for this research program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.