According to our analysis, the major migratory routes of influenza virus pass through the United States, eastern Asia, and Australia/New Zealand. Europe—despite its population density and consistency of wintertime influenza epidemics—was slightly less connected to other parts of the world when compared with the United States. These results are consistent with those of previous studies that showed eastern Asia (2
) and tropical Asia (1
) as key influenza source populations and the United States as a major contributing region (3
). The new sequence data in this analysis support strong migratory connections between Vietnam and neighboring countries, the United States, and Europe. Our regional phylogenetic analysis supports a strong connection between Vietnam and Australia/New Zealand, but the global analysis reveals that Australia/New Zealand sequences are more closely related to sequences from Asian countries other than Vietnam. In addition, the inferred phylogenies provide evidence of virus persistence in Vietnam for >
1 year. This is a major finding because strong migratory links and persistence are the 2 key features for a proposed source region for influenza transmission; long-term persistence in tropical regions may be associated with more antigenic evolution and immune escape if it can be shown that longer persistence gives the virus population more time to accumulate and fix antigenic changes (2
In general, persistence analyses are difficult even with regular sequence sampling and weekly virologic confirmations. When attempting to assess the likelihood of influenza persistence in a focal region (e.g., Vietnam), we must sample outside the focal region to determine whether local viruses have been reintroduced from elsewhere. However, the more sampling in the nonfocal region, the more likely it becomes that we sample nonfocal viruses similar to focal viruses and that more diversity is detected in the nonfocal region, making it seem basal (closer to the root) to the focal region. There are no clear criteria for whether we have undersampled or oversampled the focal or the nonfocal region; thus, it is extremely difficult to state with certainty that an apparently local lineage has persisted in the same location. For the 2007–2008 Vietnam influenza sequences, viruses were sampled for most of this period and coalescence times were generally short, indicating that most of these viruses have a relatively recent ancestor in Vietnam. These data are consistent with and provide evidence for lineage persistence in Vietnam during this time. However, we know of no unbiased test that can reject the possibility of virus introduction. The perfect dataset for demonstrating lineage persistence would seem to be 52 viruses sampled in 52 weeks, with consecutive viruses differing at 0 or 1 nt positions.
A major limitation of all migration analyses performed with sequence data is geographic sampling bias: undersampling and oversampling. The more sequences that are available for a given location, the more likely it is that 1 of these sequences will be a recent immigrant, identifiable by the presence of similar sequences from other locations. To overcome this bias, subsampling is typically conducted (3
) to ensure that the same numbers of sequences are used from each region. In the situation when too few sequences are available from a particular location, a smaller number of migratory links will be able to be inferred for that location. This second bias cannot be corrected with a subsampling strategy.
Our analysis of the global subsampled dataset showed that sample counts and strength of migratory connections were highly correlated. It has so far been impossible to determine the causal direction in this correlation. A migration signal can be weak because of a dearth of samples. Conversely, the small number of samples can be the result of low influenza activity and a corresponding weak migratory connection with other regions. The directionality of causation cannot be determined from sequence data alone. A sequence sampling strategy must be devised in the context of an influenza surveillance system, and the epidemiologic data and sequence data must be analyzed jointly. Disease prevalence and sequence data should be directly linked to provide a denominator to help determine whether undersampling or oversampling are truly occurring, which would allow for correction of sampling numbers across regions.
Despite this seemingly obvious point about oversampling, the counterpoint is that oversampling in influenza sequence data occurs with a high degree of pseudoreplication. Influenza sequence sampling in most scientific studies and public health contexts is conducted in such a way that each additional sequence sample is not an independent observation but, rather, is an observation with a high degree of correlation to recently collected samples (33
). These pseudoreplicated samples should not, in principle, generate additional artificial migration events into the analysis because the dependency structure of the samples is entirely accounted for in the phylogeny. Nevertheless, a correlation between sample number and migration strength persists in the data, partially, at least, because a larger number of samples increases the probability that a distant recently introduced lineage is sampled.
New approaches are needed in order to fully account for all spatial, evolutionary, and epidemiologic dependencies in phylogeographic analyses. For recent phylogeographic studies, Bayesian approaches have been the method of choice (1
), primarily because of their ability to account for uncertainty in evolutionary, demographic, and migratory parameters, but especially because of their ability to incorporate topological uncertainty into phylogenetic analyses. If these methods can be further developed to incorporate representativeness uncertainty—essentially, a prior distribution on the size of the sampling pool to account for the fact that some parts of the phylogeny will be oversampled while others will be undersampled—then this type of Bayesian analysis could serve as a powerful auxiliary tool in phylogeography, enabling us to determine whether sampling bias has a larger effect in some regions than others. Another role for Bayesian analysis of influenza sequences will be the application of Bayesian phylogeographic methods on whole-genome sequence data (1
). For highly reassortant datasets, the presence of independent migration signals in 8 phylogenies (for the influenza virus 8 RNA segments) should act to reduce uncertainty for the inferred migration parameters.
We intended to elucidate the migratory pathways of influenza into and out of Vietnam and the likelihood of virus persistence in Vietnam. For each of these objectives, we recommend that future studies link phylogenetic analysis with prevalence data, allowing for correction of known biases and providing crucial complementary epidemiologic evidence for migration and persistence. If the source–sink framework is an oversimplification of global influenza circulation (3
), Vietnam probably plays both roles on different occasions, given its close connections to other countries in Asia, Europe, and the United States.