An ideal prediction model should be highly accurate, robust and simple for clinical utility. To pursue these standards, we developed the MBP method, which takes advantage of information from genes sharing similar expression patterns. The results of the current study show that the prediction accuracies of the MBP method are slightly better than those of the GBP method in both within-study and inter-study predictions. Furthermore, the MBP method is superior to the GBP method in being robust to gene missingness and to experimental noise. The results show great potential for MBP to improve inter-study prediction in microarray studies and enhance the application of this technology to clinical practice.
In the literature, it has been shown that multiple completely different prediction models may generate equally high prediction accuracy. For example, the well-known 70-gene signature to predict breast cancer patient survival was first proposed (van't Veer et al.
). Other investigators derived an additional six classifiers that performed as well as the 70-gene signature using the same dataset (Ein-Dor et al.
). Also, disparity in using different gene signatures to predict similar outcomes in different studies has been reported (Ramaswamy et al.
; Sorlie et al.
; van't Veer et al.
). It is important to allow reasonable inter-study prediction validations in relevant published studies. The stability of the MBP method observed in the present study is the result of grouping genes sharing a similar expression pattern and selecting a gene that can represent the group of genes. It has been postulated that using a cluster average would yield a higher prediction accuracy under certain conditions (Park et al.
). Although the MBP method only slightly outperforms the GBP method in prediction accuracy, the prediction robustness of MBP remains its major advantage.
The clinical utility of a genomic prediction model relies heavily on the model's simplicity and reproducibility. Recent cross-platform analyses used intersection genes across datasets (Bhanot et al.
; Bloom et al.
; Bosotti et al.
; Cheadle et al.
; Nilsson et al.
), an approach that required information from all datasets involved in the analysis. This approach is appropriate for meta-analysis of biomarker detection but is inadequate for cross-platform prediction. There are two elements needed for a prediction: (i) a selected gene signature and (ii) a prediction model. When the construction of a prediction model requires the common genes of training and test studies, the selected prediction signature must be readjusted whenever a different platform of the test study is applied, making it inconvenient to validate and for clinical use. Furthermore, loss of training data information by including only intersection genes to build the prediction model makes this approach less desirable. MBP is a natural solution to these hurdles.
A lack of reproducibility hinders the application of genomic prediction models. Many factors may affect model reproducibility. The MBP method focuses on two factors to increase model reproducibility: gene missingness and experimental noise. The robustness of the MBP method toward missing genes was provided by grouped decision in modules and the rare probability of model failure is controlled by merging small modules to nearest modules in our algorithm. The robustness of the method regarding expression measurement noise was assessed by testing on the Luo dataset. Although the MBP method was robust to added noise, the pattern of noise added may not adequately represent experimental variations in real data. Further study will focus on evaluating real data or introducing variation other than Gaussian noise.
In addition to demonstrating the clinical applicability of MBP, this study demonstrated some novel approaches in the algorithm. First, this is the first time that cluster sizes generated by K-means are demonstrated to consistently follow a multinomial distribution and a cluster merging procedure is proposed to avoid model prediction failure due to gene missingness. Second, we used a representative gene with the closest summed distance to all other genes within a module (similar to ‘sample median’ concept in estimating mean parameter) to summarize the module information, which is an actual gene with better annotation and interpretation rather than using a pseudo-gene such as eigen-gene or averaged gene vector used in many methods. Although we do not have enough evidence to prove or argue the superiority of adopting median representative genes, this procedure is conceptually more robust to accidental noises and has better interpretability. Third, MBP reduced redundant gene features by summarizing similar gene expression profiles within each module, diminishing gene collinearity and adding a novel technique for data reduction.
One limitation of the MBP method is the lack of correlation and interpretation of each module to known biological pathways. Further investigation will be made to integrate pathways from biological databases as supervised modules to improve the performance. Proper normalization across studies is another key to successful inter-study predictions. Our recent publication (Cheng et al.
) has discussed the issue of genewise normalization in addition to commonly practiced sample-wise normalization. MBP proposed in this article focuses on robust inter-study prediction from another angle and can potentially be combined with these advanced normalization methods to enhance prediction accuracy.
Recently, deep sequencing technology is emerging as an attractive alternative to microarrays for genotyping, analysis of methylation patterns, identification of transcription factor binding sites and quantification of gene expression. The digital quantification is far more precise than microarray although its widespread applicability is still now limited by its high cost. As the price goes down in the near future, we expect increased popularity of this technology. Our proposed MBP method can be extended to analyze deep sequencing data, where the feature dimensionality is even higher than microarray data. The fast algorithm of K-means clustering and the advantage of rapidly reducing dimensionality by gene modules make MBP a perfect tool for such type of extremely high-throughput technology.