We have described an approach wherein the effects of sequence alterations in influenza virus proteins on virus phenotypic characteristics can be analyzed at a fine level of granularity, by defining combinations of specific amino acid residues that function as structural or functional units called sequence features (SFs). Recurrent sequence variations—variant types (VTs)—occurring within the defined SF region are computed by aligning each SF from a chosen reference strain to all other related sequences in the IRD resource (www.fludb.org
). The SFVT module is freely available online to the influenza virus research community through IRD, which provides access to the complete list of SFs, SFVT alignments, and metadata associated with the VT sequences, including host, country and year of isolation, and virus subtype.
By compiling the set of all currently characterized SFs for influenza virus proteins, a valuable resource that can be used as a reference for all characterized influenza virus protein regions has been created. Each defined SF is associated with links to relevant publications, protein annotations, and protein structure records through PubMed, UniProt, PDB, and IEDB. The SFVT system is designed to support the addition of new SFs as they become available without altering the existing list. In an effort to make the SFVT component as comprehensive and up-to-date as possible, we are now creating a community-based annotation web interface to allow external researchers and IRD curators to submit new SFs to the system. The user interface for data capture will contain some required fields (e.g., virus name, SF positions, category and definition, submitter's name and affiliation, etc.) that the submitter will need to provide, while other fields will be populated automatically by the IRD system (e.g., SF identity [ID] and length) based on the primary data entered into the required fields. SFs submitted using this method will be internally validated for completeness, accuracy, and nonredundancy and subsequently reviewed by influenza virus experts before public release. In addition, IRD will automatically add new SFs from the UniProt and IEDB resources using custom parsing scripts.
In an effort to demonstrate the utility of this approach in studying genetic determinants of virus phenotypes, we performed computational and statistical analysis on the VTs of the Influenza virus A_NS1_SF18 of the NS1 protein for their potential correlation with host range restriction. Even after controlling for data collection biases, highly significant P values and dramatically skewed distributions of VTs were observed across different host groups, suggesting that sequence variations in Influenza virus A_NS1_SF18 appear to impact host range restriction with high statistical confidence.
For a virus to spread, it should have both the opportunity and capability to infect a given host. A virus infection within an isolated community, for example, might result in all individuals carrying a particular substitution carried by the founder virus; however, those living outside of that specific community may lack the substitution because of a lack of opportunity for the founder virus or its progeny to infect them. We checked to see if this kind of founder effect could explain the associations observed in our data set but found just the opposite to be true in many cases. For example, viruses carrying VT-8, which were found to infect predominantly horses and dogs, had ample opportunity to spread based on their worldwide occurrence over more than 45 years, and yet they continued to remain within their preferred host species. In fact, the only evidence of cross-species virus spreading from horses to another host group was at a racetrack in Miami, Florida, where dogs and horses raced at the same facility (2
). Although most viruses in the VT-8 group also belong to the H3N8 subtype, it is clear from the sequence records that H3N8 viruses circulate effectively in birds and other hosts. Thus, the restriction of certain influenza viruses to equine and canine species appears to be dictated, at least in part, by sequence variations in the NS1 protein. Similar arguments can be made for the associations between the VT-4 lineage and avian host restriction and the VT-9 and VT-16 lineages and human host restriction.
It should be noted that for statistical inference, the data set used here cannot be considered as a random sample from the entire population of influenza virus-infected hosts, since the data records are from diverse sources and are based on free response data collection schemes and therefore may not be independent or random. One consequence of this observation is that the P values from chi-square analysis may be biased. Despite the fact that we cannot prove the independence and randomness of the data, we know that the records are at least from different regions of the world and were collected at different time points throughout several decades. Indeed, while our approach to control for geographic and temporal biases resulted in changes in the chi-square statistical values, the extreme skewing in VT-host distributions remained significant.
The geographic bias in the data could be attributed at least partially to difference in population density of host species. To control for this contribution to geographic bias, we repeated the chi-square analysis by also adjusting for virus prevalence per capita in human populations across different regions of the world (see Table S1 in the supplemental material). Once again, while the absolute values of the chi-square statistics and P value changed, the extreme skewing in VT-host distributions remained significant. However, it should be noted that this adjusts only for the effects of population density of the human host; population density information for the other influenza virus host species is not readily available in order to perform a similar adjustment of their effects.
As an alternative to the chi-square analysis, we also applied an association rule data-mining method to investigate the relationship between VTs and host groups. One advantage of this method is that it does not require independent data or random sampling to infer results. For this data-mining process, we separately considered two rules, namely, VT-to-host-type and host-type-to-VT relationships, and therefore for each direction we had 96 (16 · 6 = 96) possible rules. These rules were then assessed using two common evaluation criteria—support and confidence. Using 0.5 as a confidence cutoff, we ended up with two significant rules from host type to variant type (avian to VT-1 and equine to VT-8) and 11 significant rules from variant type to host type (VT-1, VT-4, VT-12, VT-13, and VT-14 to avian; VT-8 to equine; and VT-2, VT-3, VT-5, VT-10, and VT-16 to human), providing further support for the role of NS1 sequence variations in host range restriction (see Table S2 in the supplemental material).
To determine if sequence variations in Influenza virus A_NS1_SF18 are independent predictors of virus host range, we performed additional analysis of variance (ANOVA) tests to examine the main effects of host type, VT, HA subtype, NA subtype, and the two-way interactions of host type by VT, host type by HA, host type by NA, VT by HA, and VT by NA using the number of records as the response variable (see Table S3 in the supplemental material). In the ANOVA output, we see that the overall model is significant with P value F statistics of <0.0001, indicating that the model is valid. Since the interaction of host type and NA is significant (P value of 0.0037), we know that at least one pair of response variables for different combinations of host type and NA is different from each other. Hence, there is no need to look at their main effects because the possible significance of their main effects could be driven by their interactions. This finding agrees with the prior knowledge that certain host types are associated with certain virus NA subtypes. We also find that the P value for the interaction of VT and host is less than 0.0001, thus verifying our conclusion that VT is highly associated with host type. However, neither the interaction of VT and HA nor that of VT and NA is significant, implying that there is insufficient evidence of an association between VT and either NA or HA subtype. Therefore, virus subtype is not a confounding variable in our NS1 VT analysis, even though the NA subtype is also associated with host type. Thus, the association between host type and VT is verified and therefore NS1 sequence variation can be considered as an independent factor dictating host range restriction.
Several previous studies have provided evidence of a role for NS1 sequence variation in differential replication and pathogenicity in different host species, but most of the studies were focused on a whole-segment analysis strategy. Reassortant viruses carrying aberrant NS segments from the cold-adapted CR43 clone 3 virus were defective for replication in Madin-Darby canine kidney cells and ferrets (12
). Reassortant A/Udorn/72 viruses carrying the B allele of NS1 from avian viruses were found to be attenuated for replication in the respiratory tract of squirrel monkeys in comparison with the same virus carrying the A allele of NS1 (20
). The NS1 protein from the A/Goose/Guangdong/1/96 (H5N1) strain increased the replication efficiency of the A/FPV/Rostock/34 (H7N1) strain in human and mouse cell lines (11
). These studies and others provide clear evidence of a role for NS1 in host range restriction at the segment level. More recently, the C-terminal ESEV/RSKV motif, conserved in avian and human viruses, respectively, was shown to affect replication in a species-specific manner in cell culture and animals (17
). While all the statistical tests performed in this study provide strong evidence for the role of Influenza virus A_NS1_SF18
in contributing to host specificity, we cannot definitively conclude that the cellular nuclear export machinery is solely responsible for host restriction, as we have not yet completed an in-depth analysis of all other NS1 SFs for correlation with this phenotype. However, the SFVT strategy should allow us to rapidly identify the most likely candidates. Ultimately, experimental validations will be required to elucidate the connections between VT-mediated host range specificity and the relevant cellular/biochemical processes.