PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS Comput Biol. 2010 October; 6(10): e1000949.
Published online 2010 October 7. doi:  10.1371/journal.pcbi.1000949
PMCID: PMC2951339

A Computational Framework for Influenza Antigenic Cartography

Christophe Fraser, Editor

Abstract

Influenza viruses have been responsible for large losses of lives around the world and continue to present a great public health challenge. Antigenic characterization based on hemagglutination inhibition (HI) assay is one of the routine procedures for influenza vaccine strain selection. However, HI assay is only a crude experiment reflecting the antigenic correlations among testing antigens (viruses) and reference antisera (antibodies). Moreover, antigenic characterization is usually based on more than one HI dataset. The combination of multiple datasets results in an incomplete HI matrix with many unobserved entries. This paper proposes a new computational framework for constructing an influenza antigenic cartography from this incomplete matrix, which we refer to as Matrix Completion-Multidimensional Scaling (MC-MDS). In this approach, we first reconstruct the HI matrices with viruses and antibodies using low-rank matrix completion, and then generate the two-dimensional antigenic cartography using multidimensional scaling. Moreover, for influenza HI tables with herd immunity effect (such as those from Human influenza viruses), we propose a temporal model to reduce the inherent temporal bias of HI tables caused by herd immunity. By applying our method in HI datasets containing H3N2 influenza A viruses isolated from 1968 to 2003, we identified eleven clusters of antigenic variants, representing all major antigenic drift events in these 36 years. Our results showed that both the completed HI matrix and the antigenic cartography obtained via MC-MDS are useful in identifying influenza antigenic variants and thus can be used to facilitate influenza vaccine strain selection. The webserver is available at http://sysbio.cvm.msstate.edu/AntigenMap.

Author Summary

Influenza antigenic cartography is an analogy of geographic cartography, and it projects influenza antigens into a two- or three-dimensional map through which we can visualize and measure the antigenic distances between influenza antigens as we visualize and measure geographic distances between the cities in a geographic cartography. Thus, influenza antigenic cartography can be utilized to identify influenza antigenic variants, and it is useful for influenza vaccine strain selection. Here we develop a new computational framework for constructing influenza antigenic cartography based on hemagglutination inhibition assay, a routine antigenic characterization method in influenza surveillance and vaccine strain selection. This method can be used for antigenic characterization in vaccine strain selection for both seasonal influenza and pandemic influenza.

Introduction

An influenza virus is a negative-stranded RNA virus that belongs to the Orthomyxoviridae family. There are three serotypes, A, B, and C, of which B and C are reported to infect mammals only. The influenza A viruses have An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e001.jpg genomic segments (segment An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e002.jpg) with varying lengths from about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e003.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e004.jpg nucleotides which encode at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e005.jpg proteins: PB2 by segment An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e006.jpg, PB1 and PB1-F2 by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e007.jpg, PA by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e008.jpg, haemagglutinin (HA) by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e009.jpg, nucleoprotein (NP) by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e010.jpg, neuraminidase (NA) by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e011.jpg, matrix protein M1 and M2 by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e012.jpg, and nonstructural protein NS1 and NS2 by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e013.jpg. Among these proteins, the surface proteins HA and NA are involved in virus attachment and cell fusion. Both HA and NA are the primary targets for host immune systems. The serotypes of influenza A viruses are based on HA and NA subtypes. To date, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e014.jpg HA and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e015.jpg NA subtypes have been reported in influenza A viruses [1]. For instance, H1N1 influenza A virus is named since it has HA and NA recognized by HA subtype An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e016.jpg and NA subtype An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e017.jpg antibodies, respectively. Influenza B viruses have An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e018.jpg segments while Influenza C has An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e019.jpg segments. There is not yet an HA-NA nomenclature system in Influenza B and C viruses.

The peak influenza season in the northern hemisphere is from January to April every year. More than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e020.jpg hospitalizations and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e021.jpg deaths are caused by influenza in the United States each year [2], [3]. The influenza A virus may cause a pandemic disaster that will impact multiple continents. In the 20th century, three influenza A pandemics occurred in 1918, 1957, and 1968, respectively [4], [5]. More than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e022.jpg million people were killed in the 1918 influenza pandemic, which was caused by the H1N1 influenza A virus. This influenza pandemic shortened global life expectancy by more than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e023.jpg years. During March and early April An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e024.jpg, a new H1N1 influenza A virus epidemic was detected in Mexico and the United States [6], and the virus spread rapidly through human-to-human transmission, resulting in WHO declaring a pandemic, which was the first influenza pandemic in the past An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e025.jpg years. This virus was estimated to cause about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e026.jpg million infections and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e027.jpg deaths solely in United States through Jan 14, 2010 (www.cdc.gov). If we consider all cases in five continents, the numbers will become significantly larger.

In the United States, vaccination is the primary option for reducing the effects of influenza. The seasonal influenza vaccines used in the past decades include three viral components: H1N1 influenza A virus, H3N2 influenza A virus, and influenza B virus. In an effective vaccination program, vaccine strain selection will be the most important step since the highest protection could be achieved only if there is an identical antigenic match of the vaccine and epidemic virus HA and NA antigens, especially HA, which is the primary target of human immune system. However, as an RNA virus, influenza A virus has rapid mutations in these two proteins, and such mutations can cause a change of antigenicity, thus making vaccines ineffective. Mutations in HA and NA are also referred as antigenic drift.

Immunological tests, such as hemagglutination inhibition (HI) assay, enzyme-linked immunosorbent assay (ELISA), and microneutralization assay, have been utilized to identify antigenic variants among the circulating influenza strains. Among these assays, HI, has been one of the routine procedures in influenza vaccine strain selection. HI assay is an experiment to measure how a testing influenza antigen (virus) and a reference antiserum (antibody) react. The antibody is usually diluted in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e028.jpg fold first and then diluted in powers of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e029.jpg. Thus, the titre from HI assay will be An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e030.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e031.jpg. The larger the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e032.jpg is, the more closely the testing antigens match the reference antigens, for which the reference antisera are generated. Usually a number smaller than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e033.jpg is considered as a low reaction between antigen and antibody. In many cases, HI experiments are used to measure the antigenic distance between two testing antigens through their immunological reactions to the same reference antiserum. For instance, if one testing antigen is a high reactor for the reference antiserum (e.g. with a titre of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e034.jpg) while another testing antigen is a low reactor (e.g. with a titre of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e035.jpg). The antigenic distance could be approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e036.jpg units, which is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e037.jpg. In reality, the antigenic distances are usually measured by a set of reference antisera, thus the calculation is much more complicated. Such measurements from HI data are generally used to determine the antigenic distances between testing antigens.

In a typical influenza HI assay, generally less than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e038.jpg reference antisera are used but the number of test antigens can be more than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e039.jpg. However, interpretations of HI results are not straightforward due to the following two challenges: (1) HI assay only shows the indirect relationship between antigens and antisera since each value reflects a reaction from antigen, red blood cell (RBC), and antibody. Many variables from RBC and antibody will interfere the HI titres; (2) it is not be possible to perform HI for all pairs of antigen and antisera reactions. Thus, the resulting HI table is generally incomplete, and the percentage of missing data could be up to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e040.jpg. By applying the metric multidimensional scaling method (MDS) to reduce the shape space into less than three dimensions, Lapedes and Farber [7] showed a linear correlation between logarithm values of HI titers and the space distances between influenza antigens. Based on this method, Smith et al. [8] constructed influenza cartography to visualize the distances among influenza antigens from HI tables by further developing the metric MDS method. Their method assumes that antigens and antibodies are mapped into the same low-dimensional space, and their interactions are the distances between the embedded points. However, in our implementation of their algorithm, the resulting influenza cartography depends on the initial values selected, and thus may not be stable. Moreover, this method results in cartographies in which global distances may contain relatively large errors. This is because the algorithm does not incorporate temporal modeling to reduce the inherent temporal bias in HI tables. The temporal bias is caused by the fact that HI table entries are not missing uniformly at random, and off diagonal entries are more likely to be missing or become low reactors (Figure 1). The underlying biological reason for this bias can be explained by the herd immunity effect, where influenza antigens evolve rapidly under the accumulating immune pressures of human population [9]. A more detailed illustration of this phenomenon will be given later.

Figure 1
The hemagglutination inhibition (HI) data in temporal order.

The goal of this paper is to present a computational framework for influenza cartography construction which we call Matrix Completion-Multidimensional Scaling (MC-MDS). An important aspect of this framework is that temporal modeling can be easily incorporated, which as we shall show, is useful for dealing with HI tables with herd immunity induced temporal bias. Our framework includes two integrated steps: (1) a low rank matrix completion algorithm is first employed to fill in the entries of the HI matrix; (2) a MDS algorithm is utilized to map the antigens (or similarly, antibodies) into a two dimensional space for visualization. Our approach explicitly separates the visualization (cartography) step from the matrix completion step, making it easier to incorporate temporal models. Our experience shows that while temporal modeling is beneficial in both steps, it is less important in the first step, for which we may simply employ a sliding window approach; however it is more essential in the second step, for which we propose a more complex herd-immunity temporal regularization model as described in the Materials and Methods section. The reason for the difference is that the inherent temporal bias tends to give rise to incorrect global distances if not handled explicitly, and thus affect the 2D cartography process more significantly. The two step procedure in our approach is thus flexible in the first step, where we can simply use a standard low rank matrix completion algorithm. On the other hand, we have to pay special attention to temporal modeling in the second step, which is essential for accurate cartography construction. Both simulation and a practical application in H3N2 influenza A viruses demonstrate that this method is able to overcome some limitations in the original metric MDS method of [8] and it results in better influenza antigenic cartographies from HI data. Therefore the proposed framework can potentially facilitate more accurate interpretation of HI data in influenza surveillance as well as more accurate identification of influenza antigenic variants. Both are essential for influenza vaccine strain selection.

Results

While greater details are given in the Materials and Methods section, we shall summarize the most important observations and intuitions in our computational framework before presenting the actual experimental results.

Characteristics of HI data

In this work we are specifically interested in HI datasets existing accumulating original, such as the immunological datasets of human origin. In a typical HI dataset, three types of data entries are present: Type I, a regular HI titre; Type II (low reactors), the value is defined as “less than a threshold”, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e041.jpg and this threshold is caused by the lower bound experimental limit in HI assays indicating a weak (or low) immunological reaction between a testing antigen (virus) and an antiserum (antibody); Type III, missing values. A major characteristic of HI dataset is that the distributions of type I, type II, and type III data are not random. Specifically, if we arrange both antigens and antibodies in a HI matrix according to time, then there is a banded structure, where most Type I data appear very close to the diagonal of the matrix; Type II data tend to be slightly off diagonal, while Type III data are more likely to occur in matrix entries that are significantly off diagonal (Figure 1). This data characteristic introduces a “temporal bias” concerning the data distribution (in comparison to uniformly random distribution) that needs to be corrected. As we will show, if the problem is not handled appropriately, then inaccurate result will be produced. This is because classical methods assume uniformly random data distribution, which does not take the temporal bias effect into consideration. Our paper shows that temporal modeling, which reduces the data distribution bias in HI tables, is important in HI based influenza cartography.

The specific benchmark dataset used in our study includes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e042.jpg entries, representing An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e043.jpg of all table entries (Figure 2). Among these entries, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e044.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e045.jpg) are Type II values (that is, they are recorded as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e046.jpg) with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e047.jpg. For algorithmic comparison purposes, we also include results on a simulation dataset with ground truth, which is generated according to characteristics of real HI datasets.

Figure 2
Data distribution in the H3N2 HI dataset.

Sliding-window procedure for matrix completion

As pointed out above, most Type I data are located across the diagonal line of the HI matrix, which significantly deviates from the “missing uniformly at random” assumption in classical matrix completion. In order to reduce this bias, we adopt a sliding window approach where each low rank matrix completion will be performed in a HI sub-matrix, which has fewer amount of Type II and Type III data that more closely satisfy the “missing uniformly at random” assumption. The remaining entries that are not covered by the (sliding window) sub-matrices can be filled with a global matrix completion algorithm – those entries will be predicted with less accuracy due to the banded-structure of the HI data that violates the “missing uniformly at random” assumption.

The windows are based on the temporal spans of influenza A viruses. In order to complete the entire matrix, the algorithm will slide yearly along with both the dimensions of antigens and antisera to ensure the time difference between all antigens and antisera are within a certain window size. In order to obtain an optimal window size and best rank in matrix competition, we tested six different sizes, including An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e054.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e055.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e056.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e057.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e058.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e059.jpg, and ranks An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e060.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e061.jpg. A An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e062.jpg-fold cross validation suggested that the time frames of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e063.jpg-year and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e064.jpg-year with rank An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e065.jpg are two best ones towards achieving the lowest RMSE (root mean squared error) value in matrix completion of H3N2 dataset (Table 1). The average RMSE from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e066.jpg-year experiment is slightly better than that from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e067.jpg-year experiment. Both the average RMSE for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e068.jpg-year and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e069.jpg-year experiment are better than that from the entire HI matrix. Thus, during matrix completion, a window of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e070.jpg and a rank of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e071.jpg will used. Similarly, our optimization method demonstrated that the window size of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e072.jpg and the rank of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e073.jpg are the best parameters for our simulation data.

Table 1
The local RMSE values from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e074.jpg-fold cross validations using H3N2 HI dataset (1968–2003) with different window sizes (W) in sliding window based Alternating Gradient Descent.

Herd-immunity MDS model for antigenic cartography construction

After the matrix completion step, we need to project the influenza antigens onto a two-dimensional (2D) map. In order to obtain accurate global distances, we incorporate a temporal model in MDS based on the fact that the influenza antigens continue to evolve under the accumulating immune pressures of human population [9]. In order to evade the herd immunity, an influenza virus will most likely evolve into a strain with different antigenicity from recently circulating strains in human population. This intuition is mathematically incorporated in our temporal MDS model, where we assume that on the 2D cartography, influenza viruses tend to evolve along (approximate) straight-line segments during short time spans; that is, they tend to evolve in directions as far away from recently appeared viruses as possible. The detailed mathematical formula is presented in the Materials and Methods section.

In HI tables, a Type II value is resulted from experimental limitation of HI assay and reflects a weak (or low) immunological reaction between a testing antigen/antiserum pair. Although this value is not as informative as a Type I value, it is more useful than a Type III value (missing value). In particular, if a particular virus has type I values with a certain set of antibodies that show strong reactions, while another virus reacts weakly with the same set of antibodies (resulting in type II values), then the global distance between their 2D cartography embeddings should be relatively large. A set of constraints on global distances can be derived from this observation. The details can be found in the Materials and Methods section.

There are four parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e138.jpg to be optimized in our temporal MDS model. We use An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e139.jpg-fold cross validations to select the optimal parameters that achieve the lowest RMSE while satisfying global distance constraints derived from Type II data. Our cross-validation results led to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e140.jpg for the real data and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e141.jpg for the simulation data.

Influenza cartography construction using simulation data

To demonstrate the potential impacts of Type II data (low reactors) and Type III data (missing values) on the influenza cartography, we performed experiments using simulated HI matrices containing An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e142.jpg antigens versus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e143.jpg antibodies in which we know the ground-truth. Three simulated HI matrices were generated, where one was based on the distributions of H3N2 1968–2003 HI data: (1) HI matrix (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e144.jpg data absence) with neither Type II nor Type III data; (2) HI matrix (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e145.jpg data absence, data structure: randomly distributed) with Type III data but without Type II data; (3) HI matrix (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e146.jpg data absence, data structure: with a temporal data missing bias similar to H3N2 data as shown in Figure 2) with both Type II data and Type III data. The first HI matrix serves as the benchmark data (ground truth). The second HI matrix is used to test the efficiency of standard matrix completion algorithms under the missing uniformly at random assumption. The third matrix is used to examine the efficacy of the temporal model in MDS. A more effective computational method would be expected to produce a cartography more similar to that of the benchmark matrix. Using these simulated HI matrices, we are able to compare the MC-MDS method proposed in this work to the original metric MDS method of [8] in terms of HI matrix completion and cartography construction accuracies.

To assess whether MC-MDS and metric MDS can accurately recover the HI values in the HI data, we calculated the local RMSEs for the Type I data using An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e147.jpg-fold cross validation (Table 2). The experimental data were partitioned into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e148.jpg parts, and each time we use An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e149.jpg parts for training and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e150.jpg part for testing. The RMSE values were calculated using the Type I values in the testing part. Here we only use Type I values for RMSE calculation in order to be consistent with our real-data experiment, where we do not know the ground-truth corresponding to Type II and Type III data. The local RMSE values were An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e151.jpg for MC-MDS and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e152.jpg for metric MDS, where the notation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e153.jpg is used. Since a typical matrix value is about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e154.jpg, these local RMSE values indicate that both methods were able to recover HI values effectively. The small difference between the two means of MC-MDS and metric MDS is significantly smaller than the standard deviations. Hence they are statistically insignificant. However, we note that metric MDS has a larger standard deviation, which is consistent with our observation that it is less stable.

Table 2
Comparison between MC-MDS and metric MDS.

The effectiveness of a cartography construction algorithm can be assessed using figures of merit that measure its robustness and correctness. The robustness of a method is determined by the correlation coefficient (CC value) that is calculated from the pairwise distances among antigens for every two independent runs. The correctness of cartography is measured by two values: the difference between the maximum distances (MD value) between any antigens in the benchmark cartography and that from the method being evaluated (either MC-MDS or metric MDS); the pairwise distance RMSEs (PD value), calculated by measuring the difference between the pairwise distances among all antigens in the benchmark cartography and those from the method being evaluated. We performed An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e173.jpg independent runs, and the mean and standard deviation for each figure of merit can be found in Table 2.

As specified in the Materials and Methods section, the matrix completion method employed in this paper was Alternating Gradient Descent (AGD). In Figure 3, the ground-truth cartography is given in Figure 3a. Figure 3b shows a typical result when matrix entries are missing uniformly at random (the second matrix generated in our simulation study), where the standard AGD method accurately reconstructed cartography since the resulting cartography is similar to that from the benchmark matrix. Figure 3c shows a typical result with temporally biased HI table (the third matrix generated in our simulation study), where the cartography was constructed from a combination of AGD for matrix completion and the conventional MDS (without temporal modeling) for cartography generation. It shows that this combination is unable to accurately recover the cartography of the benchmark data since the global distances are incorrect. In comparison, the combination of AGD with temporal MDS, shown in Figure 3d, does achieve significantly more accurate global cartography. This experiment demonstrates the need to explicitly incorporate temporal modeling into the MDS step. Moreover, our experiment shows that cartographies generated by AGD and temporal MDS are stable. The CC value and PD value for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e174.jpg independent runs are An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e175.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e176.jpg, respectively. The MD value for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e177.jpg independent runs is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e178.jpg, which is close to the ground-truth value of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e179.jpg in the benchmark cartography (Figure 3a).

Figure 3
Computational simulation demonstrates that temporal model can reduce the biases generated by the Type II data (low reactors) in hemagglutination inhibition (HI) dataset.

For comparison, we implemented the metric MDS method of [8] and applied to the third HI matrix which was generated with temporally biased data type distributions. Our results indicate that the cartographies from metric MDS are less stable, with two typical runs given in Figure 3e and 3f. In the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e185.jpg independent runs of metric MDS, the CC value and PD value for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e186.jpg independent runs are An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e187.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e188.jpg. The MD value is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e189.jpg. These numbers are significantly worse than the corresponding numbers from the MC-MDS method proposed in this work. We shall especially note that the metric MDS method tends to over-estimate the global distances in this stimulation study. Moreover, the large standard deviations in the results also indicate that metric MDS is not very stable.

While these two methods achieve similar matrix completion accuracies, the reconstructed cartographies reveal a more significant difference. As we pointed out earlier, this is because the temporal bias (of data type distribution) in HI tables has stronger impact in the MDS step, especially when we compare global distances. Without temporal modeling, the accuracy of global distances between two points (representing two viruses) in the 2D cartography decays more rapidly when the two points become further apart in time. While this reduction of accuracy is an unavoidable limitation of the banded structure in HI tables (Figure 1) that makes it harder to reliably compare points far away in time, a good temporal model can alleviate its impact, and thus increase the accuracy of the resulting cartography.

Finally we summarize the main observations from this simulation study as follows. Both MC-MDS and metric MDS methods achieved similar accuracy in recovering HI values. This means that they achieve comparable performance in the matrix completion sub-task, which is less sensitive to the temporal bias problem in HI tables. However, without temporal modeling, the global distances among far away points in the reconstructed cartography become inaccurate. Therefore it is helpful to incorporate temporal modeling into the MDS step in order to reduce the temporal bias effect. The proposed MC-MDS framework (with herd-immunity temporal model) is effective in reducing the bias problem, and it leads to more accurate cartography. The metric MDS appears to be less stable and it generates less accurate cartographies because the method does not address the temporal bias problem.

Influenza antigenic cartography for H3N2 influenza A virus

In the second experiment, we use MC-MDS to construct influenza cartography for H3N2 influenza A viruses from 1968 to 2003 using the HI datasets from Smith et al. [8]. The antigenic map is shown in Figure 4. The scale of antigenic cartography is based on the antigenic distances from HI tables, e.g. each unit (grid) in the antigenic cartography represents of a 2-fold change in HI titres. These viruses are specifically labeled as eleven clusters (HK68, EN72, VI75, TX77, BK79, SI87, BE89, and BE92, WU95, SY97, and FU02). Our results indicate that the antigenic distance between HK68 and FU02 is approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e190.jpg units.

Figure 4
The influenza antigenic cartography constructed by MC-MDS for H3N2 viruses from 1968 to 2003.

The resulting cartography can be compared to the published antigenic map in Smith et al. [8]. The overall trend in our results is similar to the cartography from Smith et al. [8]. However, there are two major differences: (1) The global distances in our cartography are smaller than those of Smith et al. [8]. For example Smith et al. [8] shows a distance of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e192.jpg units between HK68 and FU02. Although we have no ground truth for this data, we note that this discrepancy is consistent with our simulation study, where the metric MDS method also produces larger global distances. In that case, the metric MDS method over-estimated the global antigenic distance between A and J by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e193.jpg units more than the true distance. (2) The local cartographies between some clusters are different. For instance, the distance between WU95 and BE89 from our method is larger than those shown in Smith et al. [8]. In order to examine which antigenic cartography is likely to be more accurate, we performed a small cartography for H3N2 HI data from 1987 to 1995. Since the number of Type II data on the HI data from 1987 to 1995 is quite small, the effects of Type II on the antigenic cartography is minimal. Therefore, the cartography for the viruses between 1987 to 1995 using data from the limited span will not suffer much from the temporal bias problem discussed in the paper, and thus should be close to the true cartography. Our result shows that the distance between WU95 and BE89 should indeed be larger than that between BE95 and BE92 (Figure 5), and this is consistent with the local cartographies from MC-MDS.

Figure 5
The antigenic cartography by MC-MDS for H3N2 HI data from 1987 to 1995.

Similar to the simulated HI data experiments, we can assess the robustness of MC-MDS and metric MDS on the H3N2 data (Table 2). The best local RMSE was An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e194.jpg for MC-MDS and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e195.jpg for metric MDS. Therefore there is no statistically significant difference in matrix completion quality. The CC values from the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e196.jpg independent runs are An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e197.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e198.jpg for MC-MDS and metric MDS, respectively. The MD value was An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e199.jpg for MC-MDS and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e200.jpg for metric MDS. These numbers are consistent with the simulation study, showing again that MC-MDS is more stable for antigenic cartography construction.

From the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e201.jpg runs of metric MDS, we were not able to generate the exact cartography in Smith et al. [8]. One reason might be that the initial values we randomly chose were not exactly the same as those from [8], which were not specified clearly from [8]. The source code of our implementation of the metric MDS method in [8] is available upon request. We shall point out that our implementation is strictly based on what was described in [8]. While we have spent great effort to ensure the correctness of our implementation, it is possible that there are undocumented improvements in the optimization algorithm used to solve the metric-MDS problem. In such case, their actual implementation might not suffer from the issues observed in our study. Nevertheless it is still useful for us to examine problems of the algorithm presented in their original paper, the underlying causes of these problems and their potential mathematical remedies. This is what this study tries to achieve.

Discussions

Each year, about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e202.jpg World Health Organization (WHO) collaborating laboratories and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e203.jpg National Respiratory and Enteric Virus Surveillance System (NREVSS) that are located throughout the United States participate in virologic surveillance for influenza. By collaborating with over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e204.jpg other National Influenza Centers in the WHO Global Influenza Surveillance Network, the vaccine strains for next influenza season are determined in the middle of February each year for northern hemisphere (these strains are used as vaccine strains in the United States) and September for southern hemisphere. The pandemic vaccine strains are also selected through collaborative efforts among different laboratories across the WHO Global Influenza Surveillance Network. Influenza vaccine strain selection is a very labor intensive procedure that depends on both antigenic characterization and genetic characterization. In general, whether an isolate will be sequenced or not is based on the result from antigenic characterization, and only highly potential antigenic variants are sequenced. Therefore, antigenic characterization is critical for vaccine strain selection. In order to identify a potential influenza vaccine strain, we have to integrate the HI tables from different experiments in the same laboratories or even from different laboratories. Each experiment only includes up to 15 reference antisera, which are updated at each influenza season or even each month within the same influenza season. In addition, it is common for individual laboratories to use different antisera. Therefore, the integrated HI table is typically an incomplete matrix. This incompleteness and the limitation of HI experiments (see the introduction section) present a challenge in interpreting HI results and thus antigenic variant identification. Another important challenge of HI data is the temporal bias effect, which means that entries in an HI matrix are not missing uniformly at random (Figure 1). These are the problems this paper addresses.

As an analog of geographic cartography, influenza cartography can be used to visualize and measure antigenic distances between influenza viruses. An essential criterion for a new influenza vaccine strain is significant antigenic divergence (e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e205.jpg fold change in HI test) from the current vaccine strain. Influenza antigenic cartography can help us identify whether a testing antigen (virus) is antigenically far away from a specific vaccine strain or a specific cluster of antigens (e.g. circulating strains at a specific time period).

In this study, we proposed a new computational framework for constructing an influenza antigenic cartography, and demonstrated its usefulness in antigenic characterization. This computational framework has two integrated steps: (1) through a matrix completion algorithm, influenza antigenic distance matrices are constructed; (2) through MDS (with herd-immunity temporal model), influenza antigens (viruses) are projected onto a two-dimensional cartography. We specifically pay attention to the major challenge that is caused by the temporal bias in HI datasets. That is, the banded structure of HI entries indicates that the matrix entries are not missing uniformly at random (Figure 1), which violates the standard assumption in conventional methods. Our experiment showed that standard approach will not handle this problem very well, and will produce cartographies with incorrect global distances. This paper addresses the problem through a biologically motivated temporal evolution model that is mathematically incorporated into the MDS algorithm. It is shown that more accurate antigenic distances can be obtained from this approach.

Although MC-MDS is presented as a 2D cartography construction method in this paper, it can be extended easily for 3D (or even higher dimensional) cartography by modifying the resulting cartography dimension in the MDS step of our computational framework.

The temporal regularization in MC-MDS is based on the fact that the influenza antigens continue to evolve under the accumulating immune pressures of human population [9]. Within a short time period, the antigenic distances among viruses tend to become larger in temporal order. Such a regularization is important since it can effectively minimize the biases of Type II data. However, such regularization does not necessarily imply that the antigen would always evolve forward. Theoretically, it is possible that the antigenicity (not genetic sequence) of influenza viruses could become similar to earlier circulating strains when the selective pressure from herd immunity disappears. This is supported indirectly by the report that 2009 pandemic H1N1 virus cross-reacted with the serum from the ages over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e206.jpg, who were likely to infect the seasonal H1N1 virus circulating in human population before 1957 [10].

Besides the immunological datasets for the influenza viruses (such as those of human origin) with the accumulating immunity from their hosts, there are other immunological datasets for the influenza viruses from mutations (not necessarily accumulating immunity), such as those of swine or avian origin. For the latter case (e.g. the data of swine or avian origin), our limited experiments in H5 and H7 studies suggested that the users can use MC-MDS directly without temporal model (data not shown). However, there might be additional structures to explore in such data. This requires more extensive investigations in the future.

Conclusion

We introduced a new computational framework for influenza antigenic cartography construction from HI datasets. This approach, which we refer to as MC-MDS, integrates two mathematical procedures: matrix completion and MDS projection (with temporal modeling). Using the AGD matrix completion algorithm on HI datasets from 1968 to 2003, we successfully identified the eleven reported clusters of antigenic variants that represent major antigenic drift events during these An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e207.jpg years. Thus, this method is useful in both influenza antigenic variant identification and influenza vaccine strain selection. Our results also demonstrated that MC-MDS is more robust and effective than our implementation of the metric MDS method [8] in influenza antigenic cartography construction.

Materials and Methods

Dataset and data transformation

H3N2 HI benchmark dataset and data transformation

The benchmark HI dataset is adopted from [8], and it includes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e208.jpg observed HI values from the reactions from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e209.jpg H3N2 influenza A viruses against An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e210.jpg ferret antisera. These viruses were isolated periodically from locations around the world between 1968 and 2003, and the antisera were generated against An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e211.jpg prototype influenza strains, most of which were selected from these An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e212.jpg influenza isolates. Both influenza antigen (virus) and antiserum (antibody) can be roughly clustered into eleven groups, HK68, EN72, VI75, TX77, BK79, SI87, BE89, BE92, WU95, SY97 and FU02, which represent the eleven major events of antigenic drifts resulting in a pandemic or an epidemic from 1968 to 2003. For instance, FU02 represents a group of influenza viruses isolated around the year of 2002 with similar antigenic characteristics.

Within this dataset, three types of data points are present: Type I, a regular HI titre; Type II, the value is defined as ‘less than a threshold’, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e213.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e214.jpg, and this value represents the testing antigen and antiserum do not strongly react with each other; Type III, missing values. Following [8], we preprocess the HI matrix by normalizing the data entries as follows: each Type I entry with the observed value An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e215.jpg is transformed to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e216.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e217.jpg is the largest HI value among all observed entries and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e218.jpg is the maximum HI value for antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e219.jpg; each Type II entry with value An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e220.jpg is transformed into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e221.jpg; Type III data are replaced with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e222.jpg s, representing the missing values.

Simulated HI data

To study the effect of temporal bias on influenza cartography, we simulate HI matrices with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e223.jpg viruses and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e224.jpg antibodies. Both the viruses and antibodies can be partitioned into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e225.jpg blocks by temporal information. In each block, we will have 20 viruses and 10 antibodies. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e226.jpg denote the HI titre for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e227.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e228.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e229.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e230.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e231.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e232.jpg. Each An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e233.jpg is a random value generated uniformly from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e234.jpg.

In the HI matrix with Type II values, all the HI values no more than a titre of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e235.jpg will be replaced with Type II values of the form An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e236.jpg. The matrix is preprocessed according to the same method used in the H3N2 HI benchmark dataset. To generate incomplete HI matrices, we randomly select Type I and Type II HI values (about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e237.jpg) from the entire HI matrix by mimicking the data distribution in the H3N2 HI dataset.

Matrix completion algorithms

The goal of matrix completion is to fill the missing entries in an incomplete matrix based on appropriate mathematical models of the matrix. It is a traditional mathematical problem that has been studied for many decades. Early contributions on this problem include Schur [11], Farahat and Ledermann [12], Friedland [13], Hershkowitz [14], London and Minc [15], Mirsky [16] and Oliveira [17][19]. In the past decade, interest in the problem has grown substantially, especially after the launch of Netflix competition [20] in 2007. The Netflix problem is to predict each user's movie preference (in order for Netflix to make appropriate movie recommendations to each user) from approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e238.jpg observed user ratings. This can be regarded as a matrix completion problem, where we predict missing user/movie ratings from incomplete observations. This is exactly like the problem of predicting antigen/antibody interactions which we consider in this paper. In general, matrix completion is ill-posed and computationally intractable [21], [22]. However, recently, Candes and Rect [22] and Recht et. al [23] proved that under appropriate conditions, the minimum rank matrix solution can be recovered from incomplete entries by solving a convex optimization problem. These theoretical developments generated further interest, and afterwards, a number of new methods have been proposed [24][30].

If we do not consider the temporal bias effect, then the antigenic cartography task can be formulated as a matrix completion problem. Simply, in an HI matrix, there are An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e239.jpg antigens corresponding to the rows, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e240.jpg antisera corresponding to the columns. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e241.jpg denotes the HI value from the reaction between testing antigen An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e242.jpg and antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e243.jpg. The HI matrix can be represented as

equation image

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e245.jpg denote the subset of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e246.jpg's entries corresponding to Type I and Type II data. In practice An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e247.jpg. The goal of matrix completion is to estimate the HI values in the Type II and Type III entries as accurately as possible. In addition, matrix completion can re-estimate Type I entries and remove embedded noises, which were from the uncertainties in experimental measurements. This is a standard matrix completion problem. The standard approach to this problem is to assume that the matrix is low rank, with rank An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e248.jpg. In our application, this means that each antigen An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e249.jpg can be embedded into the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e250.jpg-dimensional space as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e251.jpg, and each antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e252.jpg can be embedded into the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e253.jpg-dimensional space An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e254.jpg. In the low rank model, the interaction An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e255.jpg between antigen An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e256.jpg and antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e257.jpg is given by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e258.jpg for some matrices An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e259.jpg. We can aggregate vectors An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e260.jpg into a matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e261.jpg and aggregate vectors An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e262.jpg into a matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e263.jpg. Mathematically, the low-rank model is to find matrices An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e264.jpg with dimensions An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e265.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e266.jpg with dimensions An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e267.jpg and a diagonal matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e268.jpg with dimensions An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e269.jpg such that

equation image
(1)

Here we describe AGD matrix completion method, which is developed based on gradient decent method. AGD method assumes the low rank matrix completion model (1).

If type II data are not present, one can employ the following optimization formulation to estimate the missing values

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e272.jpg when An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e273.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e274.jpg otherwise.

The function An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e275.jpg is a regularization condition for the matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e276.jpg, which is introduced to stabilize the solution. The solution An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e277.jpg of the optimization problem (2), which does not contain any missing value, will replace An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e278.jpg (which has missing values) as the true (and denoised) HI table, which we can then use for other purposes, such as cartography construction.

In the AGD method, we take An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e279.jpg in (2), where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e280.jpg when An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e281.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e282.jpg otherwise. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e283.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e284.jpg) denotes the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e285.jpgth row of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e286.jpg(An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e287.jpg) and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e288.jpg.

First, the algorithm uses SVD to obtain the factorization An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e289.jpg. Here An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e290.jpg is the trimmed matrix of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e291.jpg where we randomly set some observed values to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e292.jpg from the rows (columns) when a row (column) contains more than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e293.jpg(An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e294.jpg) observed values. The purpose of this trimming step is to guarantee that each row (column) has less than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e295.jpg(An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e296.jpg) non zero values. This is based on the observations of Keshavan et. al. [27] that when An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e297.jpg, the corresponding singular vectors are highly concentrated on high-weight column (or row) indices. It means that those vectors do not provide useful information. After SVD, we set the initial value An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e298.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e299.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e300.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e301.jpg where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e302.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e303.jpg are the first An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e304.jpg columns of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e305.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e306.jpg respectively.

We then apply the following alternating optimization procedure until convergence or when certain number of iterations are reached.

  • Fix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e307.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e308.jpg and calculate the matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e309.jpg to minimize the squared error An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e310.jpg. This is a least squares regression problem with respect to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e311.jpg.
  • Update An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e312.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e313.jpg) using gradient descent: we take steps proportional to its negative of gradient with respect to the objective function. That is, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e314.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e315.jpg where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e316.jpg is the step size parameter which can be optimized by line search algorithm.
  • The first two steps are repeated until convergence or reaching a pre-defined number of iterations.

The gradient of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e317.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e318.jpg are:

equation image
(3)

equation image
(4)

where

equation image

equation image

and

equation image

A general method to handle Type II values

Although Type II data is not as informative as Type I data, they still provide useful information. Therefore we have to modify (2) to include type II data. First we introduce threshold values An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e324.jpg for each entries An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e325.jpg, and let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e326.jpg if the corresponding entry is not type II data. If an entry is type II data, we set An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e327.jpg to be the corresponding threshold. We change the standard matrix completion formulation (2) into the following form that incorporates type II information:

equation image
(5)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e329.jpg is the indicator function: An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e330.jpg if An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e331.jpg; and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e332.jpg if An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e333.jpg. The intuition behind this formulation is that for an type II entry An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e334.jpg, if An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e335.jpg, then we do not have to penalize the error An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e336.jpg because the constraint is satisfied.

One advantage of this formulation is that we can employ any optimization algorithm that solves (2) to solve (5). We start with an initial estimate of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e337.jpg by ignoring the type II data. We then iterate as follows until a certain number of iterations are reached:

  • Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e338.jpg
  • Update An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e339.jpg by solving
    equation image
    using any optimization algorithm (such as AGD) for (2).

The procedure is a principled approach to handle Type II data, and it can be used with any algorithm that optimizes (2).

The two parameters, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e341.jpg in the penalty function and the rank to project data, are trained through 10-fold cross validation. The rank (from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e342.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e343.jpg) and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e344.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e345.jpg) with the smallest RMSE value will be selected as the best rank in matrix completion.

Performance evaluation

The performance of matrix completion is evaluated using the following three criteria in this study: root mean squared error (RMSE), correlation coefficient, and biological interpretation. Given An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e346.jpg values An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e347.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e348.jpg, we define the RMSE as:

equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e350.jpg stands for an observed value and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e351.jpg stands for the corresponding predicted value. If a prediction scheme has a small RMSE value, then the predicted values are close to the true values. For matrix completion, we utilize 10-fold cross validation to calculate the RMSE values. The observed matrix entries are partitioned into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e352.jpg equal parts. Each time, one part is used for testing and the other nine parts for training. That is, each time we use An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e353.jpg parts as observed values in matrix completion; after the completed matrix is generated from these An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e354.jpg parts, we calculate the RMSE between the completed matrix and the observed matrix entries in the remaining part. The process is repeated for every part in the dataset. The RMSE value is the average RMSE value over several runs. The RMSE values were estimated only using Type I values, and thus they are also called local RMSE. Note that we also report the standard deviation numbers calculated from cross validation. It is known that standard deviation calculation based on cross validation is often smaller than the true standard deviation; however, the numbers still provide meaningful indications and hence are included.

For the temporal based MDS, we can define the distance between two viruses as the Euclidean norm between the rows of the completed HI table corresponding to the two viruses. In evaluation, we use the local pairwise distances among temporally close by viruses because these distances are more reliable. In particular, the local pairwise distances are partitioned into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e355.jpg equal parts. Again, each time we left one part as testing samples and applied temporal based MDS by using An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e356.jpg parts. The RMSE between the estimated distance and testing data are calculated, and they are called the pairwise distance RMSEs (PD value).

The correlation coefficient (CC) between two vectors measures the strength and direction of their linear relationship. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e357.jpg denote the vector A and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e358.jpg denote the vector B. The correlation coefficient (CC) between vector A and B is defined as follows:

equation image

Clearly, a larger CC value indicates the two vectors are closely related. For every two runs, we will have two distance vectors and one CC value. In An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e360.jpg experiments, we will have An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e361.jpg CC values, then the mean and standard deviation can be calculated. This test is to assess whether every two runs of the test method (temporal MC-MDS or Metric MDS) are different. For instance, if metric MDS has lower CC value and larger standard deviation, we will conclude it is not stable.

The biological interpretation is based on separation and quantification of the reported antigenic variant groups in the influenza antigenic cartography.

Window size determination

In order to reduce the temporal bias in HI matrices, we adopt a sliding window approach in the matrix completion step. The rational for sliding window matrix completion is that the temporal bias effect becomes much smaller in temporally grouped sub-matrices than in the entire HI matrix. This means that the effect of temporal bias will be reduced when we complete each sub-matrix separately. Therefore in our approach low rank matrix completion will be performed separately in each HI sub-matrix. In order to complete the entire matrix, the algorithm will slide yearly along with both the dimensions of antigen and antisera to ensure the time difference between all antigen and antisera are within a certain window size An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e362.jpg. The missing values or low reactors in HI matrix are estimated as the mean value of the recovering values from the associated sub-matrices. If the missing values or low reactors are not covered by any of the sub-matrices, they will be estimated through matrix completion using the entire HI matrix. The selection of window size An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e363.jpg is based on minimizing the RMSE values from 10-fold cross validation. For H3N2 HI dataset, we have tested different values of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e364.jpg.

Temporal based MDS for Type II data

Multidimensional scaling (MDS) is a statistical technique widely used in information visualization. It embeds a set of data into low dimension vectors while preserving their pair-wise distances. The projection of viruses into two or three dimensional space can be viewed as an analog of a geographic cartography; thus this is referred to as influenza antigenic cartography. Due to the temporal bias effect in HI tables, we have to incorporate a temporal model into the MDS algorithm to reconstruct global distances more accurately. In this work, we consider a biologically motivated temporal regularization criterion. The regulation in our temporal model is based on the fact that the influenza antigens continue to evolve under the accumulating immune pressures of human population [9]. In order to evade the herd immunity, an influenza virus will most likely evolve into a strain with different antigenicity from recently circulating strains in human population. Thus, within a certain time period (e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e365.jpg years, which is within one human generation), the antigenic distances among viruses tend to become larger in temporal order.

This intuition is mathematically incorporated into our temporal regularization condition. Specifically we assume that on the 2D cartography, influenza viruses tend to evolve along (approximate) straight-line segments during short time spans; that is, they tend to evolve in directions as far away from recently appeared viruses as possible. The concrete mathematical formulation is described below.

First we denote by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e366.jpg the average distance between virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e367.jpg and virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e368.jpg using the completed matrix from step 1, and let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e369.jpg denote the isolation year of virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e370.jpg. We partition all influenza viruses into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e371.jpg groups based on their temporal ordering. If a dataset covers year An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e372.jpg to year An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e373.jpg, then the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e374.jpg groups are viruses in year An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e375.jpg. This grouping choice is due to the fact that each influenza season in the northern hemisphere spans two years. Therefore without additional information, it is appropriate to assign every virus to the neighboring years as well. For instance, it is natural to assume that the A/Beijing/32/92(H3N2) virus belongs both to group An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e376.jpg, and to group An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e377.jpg. Denote by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e378.jpg the viruses in the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e379.jpg-th group. In 2D cartography, we represent each virus by a two dimensional vector. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e380.jpg be the distance between virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e381.jpg and virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e382.jpg in the cartography; let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e383.jpg be the center coordinate of group An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e384.jpg and hence An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e385.jpg is the distance between virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e386.jpg to the center An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e387.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e388.jpg is the distance between two centers An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e389.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e390.jpg. The temporal MDS method in our experiments attempts to minimize the following error function:

equation image

The first term is the standard MDS. The second term means that viruses within each group should be close to each other. The third term is the mathematical formulation that formalizes the biological intuition that viruses tend to evolve along straight-line segments during short time spans.

Besides the above error function, we impose constraints on global distances that can be derived from the original dataset. We know that each reference antiserum is associated to an antigen. Let the reference antiserum be An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e392.jpg and its corresponding antigen be An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e393.jpg. If antigen An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e394.jpg has low reactor (Type II) with this reference antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e395.jpg and antigen An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e396.jpg has relatively high reactor with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e397.jpg, we can naturally assume that in the 2D cartography, the distance between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e398.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e399.jpg should be smaller than the distance between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e400.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e401.jpg. In our experiments, if a HI value is equal to or more than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e402.jpg (4-fold higher than the Type II threshold An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e403.jpg), we call it a relatively high reactor. The four parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e404.jpg are optimized using An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e405.jpg-fold cross validations. The configuration that achieves the best RMSE for reconstructing local pairwise distances while satisfying all constraints is selected. The source codes for this implementation are available upon request.

Metric MDS for influenza antigenic cartography construction

The metric MDS method is developed by Smith et. al. [8] to construct antigenic cartography. This method attempts to minimize an error function of the form An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e406.jpg. Here An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e407.jpg is set to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e408.jpg where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e409.jpg is the observed value and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e410.jpg is the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e411.jpg of the maximum reaction for antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e412.jpg. The error function is defined as

equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e414.jpg.

Similar to AGD , this algorithm also requires a pre-defined dimension (rank) An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e415.jpg as input; that is, each virus or antiserum is represented by an An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e416.jpg-dimensional vector. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e417.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e418.jpg represent virus An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e419.jpg and antiserum An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e420.jpg, respectively. The An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e421.jpg is defined as the Euclidean distance between the vector An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e422.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e423.jpg: An external file that holds a picture, illustration, etc.
Object name is pcbi.1000949.e424.jpg. The algorithm generates the random initial vector for each virus or antiserum and the solution is found through conjugate gradient optimization with multiple random restarts.

Acknowledgments

We are grateful for the critical comments from three anonymous reviewers, which helped improve this study very much. Approved for publication as Journal Article No. J11876 of the Mississippi Agricultural and Forestry Experiment Station, Mississippi State University.

Footnotes

The authors have declared that no competing interests exist.

The project described was supported by award number RC1AI086830 from the National Institute Of Allergy And Infectious Diseases. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of Allergy And Infectious Diseases or the National Institutes of Health. This study is also supported by the start-up package from Mississippi State University and NSF-EPS-0903787 subaward 012156-007 to XFW; NSF DMS-1007527 and AFOSR-10097389 to TZ.

References

1. Fouchier R, Munster V, Wallensten A, Bestebroer T, Herfst S, et al. Characterization of a novel influenza a virus hemagglutinin subtype (H16) obtained from black-headed gulls. J of Virol. 2005;79:2814–2822. [PMC free article] [PubMed]
2. Simonsen L, Fukuda K, Schonberger L, Cox N. The impact of influenza epidemics on hospitalizations. J Infect Dis. 2000;181:831–837. [PubMed]
3. Thompson W, Shay D, Weintraub E, Brammer L, Cox N, et al. Mortality associated with influenza and respiratory syncytial virus in the united states. J Am Med Assoc. 2003;289:179–186. [PubMed]
4. Palese P. Influenza: old and new threats. Nature Med. 2004;10:s82–87. [PubMed]
5. Parrish C, Kawaoka Y. The origins of new pandemic viruses: the acquisition of new host ranges by canine parvovirus and influenza a viruses. Annu Rev Micro. 2005;59:553–586. [PubMed]
6. Dawood F, Jain S, Finelli L, Shaw M, Lindstrom S. Emergence of a novel swine-origin influenza a (H1N1) virus in humans. New Engl J Med. 2009;360:2605–2615. [PubMed]
7. Lapedes A, Farber R. The geometry of shape space:application to influenza. J Theor Biol. 2001;212:51–69. [PubMed]
8. Smith D, Lapedes A, Jong J, Bestebroer T, Rimmelzwaan G, et al. Mapping the antigenic and genetic evolution of influenza virus. Science. 2004;35:371–376. [PubMed]
9. Bush R, Bender C, Subbarao K, Cox N, Fitch W. Emergence of a novel swine-origin influenza a (H1N1) virus in humans. Science. 1999;286:1921–1925. [PubMed]
10. Hancock K, Veguilla V, Lu X, Zhong W, Butler E, et al. Cross-reactive antibody responses to the 2009 pandemic H1N1 influenza virus. New Engl J Med. 2009;361:1945–1952. [PubMed]
11. Schur I. über eine klasse von mittelbildungen mit anwendungen auf die determinantentheorie. Sitz ber Berlin Math Ges. 1923;22:9–20.
12. Farahat H, Ledermann W. Matrices with prescribed characteristic polynomials. Proc Edinburgh Math Soc. 1958;11:143–146.
13. Friedland S. Matrices with prescribed off-diagonal elements. Israel J of Math. 1972;11:184–189.
14. Hershkowitz D. Existence of matrices with prescribed eigenvalues and entries. Linear and Multilinear Algebra. 1983;14:315–342.
15. London D, Minc H. Eigenvalues of matrices with prescribed entries. Amer Math Soc. 1972;34:8–14.
16. Mirsky L. Matrices with prescribed characteristic roots and diagonal elements. London Math Soc. 1958;33:14–21.
17. Oliveira G. Matrices with prescribed entries and eigenvalues. Proc Amer Math Soc. 1973;37:380–386.
18. Oliveira G. Matrices with prescribed entries and eigenvalues II. SIAM J Appl Math. 1973;24:414–417.
19. Oliveira G. Matrices with prescribed entries and eigenvalues III. Archiv der Mathematik. 1975;25:57–59.
20. ACM SIGKDD and Netflix. 2007. Available: Proc of KDD Cup and Workshop.
21. Chistov A, Grigoriev D. Complexity of quantifier elimination in the theory of algebraically closed fields. Proc Math Found Comput Sci. 1984:17–31.
22. Candes E, Rect B. Exact matrix completion via convex optimization. 2008. Technical Report.
23. Recht B, Fazel M, Parrilo P. Guaranteed minimum rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review In press
24. Cai J, Candes E, Shen Z. A singular value thresholding algorithm for matrix completion. 2008. Technical Report.
25. Candes E, Tao T. The power of convex relaxation: Near-optimal matrix completion. 2009. Technical Report.
26. Dai W, Milenkovic O. Set: an algorithm for consistent matrix completion. 2009. Technical Report.
27. Keshavan R, Oh S, Montanari A. Matrix completion from a few entries. 2009. Technical Report.
28. Mazumder R, Hastie T, Tibshirani R. Regularization methods for learning incomplete matrices. 2009. Technical Report.
29. Meka R, Jain P, Caramanis C, Dhillon I. Rank minimization via online learning. ICML. 2008. pp. 656–663.
30. Raghu M, Jain P, Dhillon I. Matrix completion from power-law distributed samples. 2009. Available: NIPS.

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science