Troyanskaya et al
) have shown that correlation between gene expression profiles is useful in the imputation of missing values. This will only be the case for gene profiles where the proportion of values that are missing is relatively low. We therefore analyzed the three data sets referred to above to find out how common it is for genes to have only a few values missing. As Table shows, for most genes with missing values, only a few percent are missing. The time series data set has 6850 out of 16 838 genes without missing values in 39 arrays, which is the set used for testing the imputation methods. It is interesting to observe that for 6597 of the genes having missing values, <15% of the values are missing per gene. In the lymphoma data set, only 854 out of 4026 genes are without missing values in all 96 arrays. However, no genes have >20% of the values missing. The NCI60 data set shows a similar pattern, where 2069 out of 6830 genes are without missing values, but 4489 of the remaining 4761 genes have <15% missing values.
Number of genes with different percentages of missing values in three example data sets
These results indicate first of all that missing values is a common problem that has to be addressed. At the same time, they show that the structure of the missing values in these three data sets is such that it is likely to allow imputation methods to make reliable estimates of the missing values for most of the genes. If missing values are dominating an expression profile for a gene, for instance if a gene has 70% missing values, few measurements remain to determine how the gene is correlated with other genes in the data set. However, since the example data sets show that most values are present for each gene, the basis for determining the correlation structure between genes is relatively good. Determining how the arrays are correlated appears to be a smaller problem since typically measurements are present for thousands of genes. However, if we are to impute values for a gene with many missing values, fewer arrays can be included in the array-based multiple regression model used for estimation in LSimpute_array, most probably leading to less accurate estimates.
We compare the LSimpute and EMimpute methods with KNNimpute (12
), to see whether these methods represent an improvement over previously proposed methods. At present, KNNimpute is a widely used method for missing value imputation. The estimates from the KNNimpute method are sensitive to the choice of the parameter K, the number of gene neighbors used to estimate the missing values. Because of this, we have tested KNNimpute over a range of values for K, and only report the best results obtained here. Comparing the results obtained using the methods LSimpute_gene, LSimpute_array, EMimpute_gene and EMimpute_array with the results obtained using KNNimpute, we found that all these methods give a smaller RMSD than KNNimpute when 5% of the data are missing. The results are summarized in Figure and Table . Due to the time it takes to run one iteration of EMimpute_gene (see below), this method is only tested with 5 and 10% missing values. Table also lists the ratio of the RMSD between KNNimpute and the other methods for easier assessment of relative improvement compared with KNNimpute. LSimpute_gene gives a 4.4–9.7% smaller RMSD than KNNimpute with 5% missing values, while LSimpute_array gives a 6.8–19.8% smaller RMSD EMimpute_gene gives a 2.6–8.5% smaller RMSD than KNNimpute with 5% missing values, while EMimpute_ array gives a 5.0–21.7% smaller RMSD. The results clearly favor the array-based estimation methods, as in two of three data sets they give markedly more accurate estimates than gene-based estimation methods. For the lymphoma data set, the RMSD obtained using array-based estimation is marginally worse. The difference is 1.3% comparing LSimpute_gene with LSimpute_array, and 3.5% comparing EMimpute_gene with EMimpute_array. In this case, LSimpute_gene and EMimpute_gene have approximately the same RMSD value, with only a 0.3% difference in favor of EMimpute. Overall, these results indicate that LSimpute_gene performs better than EMimpute_gene, while EMimpute_array may be a bit better overall than LSimpute_array. Somewhat surprisingly, we found that for two of the data sets, KNNimpute gave the best performance using K = 5 (NCI60 and lymphoma), while K = 10 gave the best performance for the last data set (time series). Troyanskaya et al
) reported that KNNimpute produced the most accurate estimates when K had a value in the range 10–20.
Comparison of estimation error (RMSD) for the methods on three data sets.
Comparison of basic LSimpute methods and EMimpute methods against KNNimpute with 5% missing data
We performed a set of tests to evaluate how the methods for combining array- and gene-based estimates perform relative to the other methods. The results are summarized in Figure and Table and show that LSimpute_combined gives a smaller RMSD than each of its two component methods (LSimpute_gene and LSimpute_array), although only marginally for two of the data sets (NCI60 and time series). The marginal improvement over the best of the two, the array-based method, is an effect of the relatively large difference in RMSD between LSimpute_gene and LSimpute_array. For the lymphoma data, the accuracy (RMSD) of the two component methods is more even, and therefore a relatively large improvement compared with the better of the two is obtained by combining them. Our empirical results indicate that by combining the gene- and array-based methods, we obtain estimates that are at least as good as when using the best of the two. The lack of significant improvement for LSimpute_combined over its component methods in the time series data set is a result of the nature of this data set. The data set contains several similar infection time series with different mutants of L.monocytogenes
. Baldwin et al
) report that a difference in host response was undetectable using different mutants. From this, we can practically view the time series as replicated experiments, and therefore we are not surprised by the superiority of the LSimpute_array over LSimpute_gene in estimating the missing values. Still, we expect that LSimpute_combined will give an estimate at least as accurate as each of the component methods, as it takes into consideration the relative strengths of the two underlying methods.
Comparison of basic and combined LSimpute methods and EMimpute methods against KNNimpute with 10% missing data
Using the adaptive estimation model implemented in LSimpute_adaptive, we get an additional improvement for the NCI60 and lymphoma data sets compared with LSimpute_combined. Thus by performing an adaptive weighting of the estimates from LSimpute_gene and LSimpute_array based on the structure of the data, we obtain the most accurate estimates (lowest RMSD) of all methods tested in this study. Performance for the time series data set is equal to that of LSimpute_combined, and approximately equal to LSimpute_array. Thus the nature of the time series data set, containing several closely related cell samples on different arrays, causes array correlations to be the best basis for missing value prediction. Overall, using LSimpute_adaptive, we see an improvement in RMSD of 18–20% compared with KNNimpute for all three data sets.
We want to test whether the prediction errors obtained using our most successful method, LSimpute_adaptive, are significantly smaller on average compared with the prediction errors we obtained using KNNimpute. For this purpose, we use a paired t-test, where the observations are the differences in the size of the errors made by the two methods. By taking the difference di = |ei,KNN| – |ei,adaptive| for each missing value (i = 1, 2, …, n, where n is the number of missing values), we can test whether the average difference in size of prediction error
is significantly larger than 0. Here ei,KNN and ei,adaptive are the errors made by KNNimpute and LSimpute_adaptive, respectively, when estimating missing value number i. The formula for t:
is the empirical standard deviation of the di
s, is t
-distributed with n
– 1 degrees of freedom (df) under the null hypothesis. Our null hypothesis states that d
= 0, while the alternative hypothesis states that
> 0, e.g that KNNimpute on average makes larger prediction errors.
We test whether the average difference in size of prediction error is significant for our three data sets by marking 5% of the values in each of our three data sets as missing, and taking the corresponding dis as our observations. Given these observations, the t-statistic for the lymphoma data set scored 22.89, corresponding to a P-value of 1.86 × 10–112 (df = 7529). For the NCI60 data set, the t-statistic scored 25.68, corresponding to a P-value of 5.31 × 10–139 (df = 6620). For the time series data set, the t-statistic scored 34.22, corresponding to a P-value of 2.16 × 10–246 (df = 13 356). Thus we can conclude that the average estimation error is significantly larger using KNNimpute than that obtained using LSimpute_adaptive.
Finally, we comment on the running time required by the methods we study. All methods have been tested on a computer with a 2.8 GHz Pentium4 CPU running under Linux. For KNNimpute, we used an optimized compilation of the original C++ code made by Troyanskaya et al
). All other methods have been implemented in Java, and Java is started with the option –server that optimizes Java applications. The time required to run one round of missing value imputation is recorded for all methods and summarized in Table . Note that the running time for LSimpute_array includes the time it takes to run LSimpute_gene, which is done in order to initialize the missing values before array-based estimation. Our most CPU-intensive LSimpute method, LSimpute_adaptive, runs equally fast or faster than KNNimpute in all cases except one. We also note that EMimpute_array is relatively fast, considering that it is an iterative method, while EMimpute_gene is by far the slowest method.
Summary of time usage of all methods with different percentages of missing values; all results are in seconds