For complex diseases such as cancer, diabetes and obesity, extensive biomedical studies have shown that clinical and environment risk factors may not have sufficient predictive power for prognosis. In the past decade, we have witnessed unparalleled development in high-throughput technologies. Such development makes it possible to survey the whole genome and search for genomic markers that may have predictive power for disease prognosis. Gene signatures have been constructed for prognosis of breast cancer, ovarian cancer, lymphoma, obesity and many other diseases [1
]. To avoid ambiguity, we limit ourselves to gene expressions measured using microarrays but note that the results may also be applicable to other profiling studies.
Denote T as the survival time, which can be progression free, overall, or other type of survival. Denote C as the censoring time. Denote Z
as the length d gene expression measurement. Under right censoring, one observes
We assume that
, where β is the regression coefficient. The most common choice for ϕ is the Cox proportional hazards model [3
], while other models have also been adopted. Of note, there are a few model-free approaches. However, they have not been extensively used and will not be discussed.
Data generated in microarray studies has the ‘large d, small n’ characteristic—a typical study measures expressions of 103–4
genes on 101–3
subjects. When fitting regression models with number of genes larger than sample size, proper regularization is needed. In addition, among the thousands of genes profiled, only a subset are associated with disease prognosis. Identification of disease susceptibility genes can not only improve model fitting by removing noises but also lead to better understanding of the mechanisms underlying disease development. Many statistical approaches have been developed for regularized estimation and gene selection. Examples include dimension reduction approaches such as the partial least squares, singular value decomposition and principal component analysis, feature selection approaches including the filter, wrapper and embedded approaches, and hybrid approaches such as the sparse principal component analysis. Comprehensive reviews of dimension reduction and variable selection methods have been rendered by various authors [5–7
Despite great efforts on gene selection techniques, insufficient attention has been paid to underlying prognosis models. Usually, it is assumed that a certain model (for example, the Cox model) holds, while insufficient justification is provided to support such an assumption. As shown by Fleming and Harrington [8
] and Klein and Moeschberger [4
], the Cox model may fail, and alternative models—such as the accelerated failure time (AFT) model and additive risk model—may fit the data better. Although various model diagnosis methods have been proposed, their validity is established under the assumption that number of covariates is much smaller than sample size. To the best of our knowledge, none of these methods has been proved valid under the ‘large d, small n’ setup.
In this article, we first review three semiparametric prognosis models. We then describe the Lasso approach, which will be used for gene identification in this study. We also describe cross validation-based approaches for evaluation of overall prediction performance (of all identified genes combined) and evaluation of reproducibility of each identified gene. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze data from three cancer prognosis studies under the three different models. We show that the gene identification results, prediction performance and reproducibility of identified genes are model-dependent. The article concludes with discussion.
It is not our intention to show the superiority of certain models or to develop a generically applicable model comparison/selection criterion. Rather, our intention is to raise the awareness of multiple available prognosis models and demonstrate that different datasets may demand different prognosis models.