Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
genomic studies; semiparametric prognosis models; model comparison
In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution.
We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects.
Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance.
Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene Expression; Integrative analysis; Sparse boosting
Transcriptional coactivator p300 has been shown to play a variety of roles in the transcription process and mutation of p300 has been found in certain types of human cancers. However, the expression dynamics of p300 in breast cancer (BC) and its effect on BC patients’ prognosis are poorly understood.
In the present study, the methods of tissue microarray and immunohistochemistry (IHC) were used to investigate the protein expression of p300 in BCs. Receiver operating characteristic (ROC) curve analysis, Spearman’s rank correlation, Kaplan-Meier plots and Cox proportional hazards regression model were utilized to analyze the data.
Based on the ROC curve analysis, the cutoff value for p300 high expression was defined when the H score for p300 was more than 105. High expression of p300 could be observed in 105/193 (54.4%) of BCs, in 6/25 (24.0%) of non-malignant breast tissues, respectively (P=0.004). Further correlation analysis showed that high expression of p300 was positively correlated with higher histological grade, advanced clinical stage and tumor recurrence (P<0.05). In univariate survival analysis, a significant association between high expression of p300 and shortened patients’ survival and poor progression-free survival was found (P<0.05). Importantly, p300 expression was evaluated as an independent prognostic factor in multivariate analysis (P<0.05).
Our findings provide a basis for the concept that high expression of p300 in BC may be important in the acquisition of a recurrence phenotype, suggesting that p300 high expression, as examined by IHC, is an independent biomarker for poor prognosis of patients with BC.
Breast cancer; p300; Tumor recurrence; Prognosis
The objective of this study was to investigate the number of metastatic lymph nodes (pN) and the metastatic lymph node ratio (MLR) on the post-surgical prognosis of Chinese patients with esophageal cancer (EC) and lymph node metastasis.
We enrolled 353 patients who received primary curative resection for EC from 1990 to 2003. The association of pN and MLR with 5-year overall survival (OS) was examined by receiver operating characteristic (ROC) and area under the curve (AUC) analysis. The Kaplan-Meier method was used to calculate survival rates, and survival curves were compared with the log-rank test. The Cox model was employed for univariate and multivariate analyses of factors associated with 5-year OS.
The median follow-up time was 41 months, and the 1-, 3- and 5-year OS rates were 71.2%, 30.4%, and 19.5%, respectively. Univariate analysis showed that age, pN stage, and the MLR were prognostic factors for OS. Patients with MLRs less than 0.15, MLRs of 0.15-0.30, and MLRs greater than 0.30 had 5-year OS rates of 30.1%, 17.8%, and 9.5%, respectively (p < 0.001). Patients classified as pN1, pN2, and pN3 had 5-year OS rates of 23.7%, 11.4%, and 9.9%, respectively (p < 0.001). Multivariate analysis indicated that a high MLR and advanced age were significant and independent risk factors for poor OS. Patients classified as pN2 had significantly worse OS than those classified as pN1 (p = 0.022), but those classified as pN3 had similar OS as those classified as pN1 (p = 0.166). ROC analysis indicated that MLR (AUC = 0.585, p = 0.016) had better predictive value than pN (AUC = 0.565, p = 0.068).
The integrated use of MLR and pN may be suitable for evaluation of OS in Chinese patients with EC and positive nodal metastasis after curative resection.
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
To validate whether FAM70B, which was found in our micro-array profiling as a prognostic marker for cancer survival, could accurately predict prognosis in patients with muscle-invasive bladder cancer (MIBC).
Materials and Methods
A total of 124 patients with MIBC were enrolled in this study. The FAM70B expression level was analyzed by real-time polymerase chain reaction by using RNA from tumor tissues. The prognostic effect of FAM70B was evaluated by Kaplan-Meier analysis and a multivariate Cox regression model.
Kaplan-Meier estimates showed a significant difference in progression-free survival (log-rank test, p=0.011) and cancer-specific survival (log-rank test, p=0.017) according to FAM70B gene expression level. By multivariate Cox regression analysis, high FAM70B expression was predictive of cancer progression (hazard ratio [HR], 2.115, p=0.013) and cancer-specific death (HR, 1.925; p=0.033). In the subgroup analysis, high expression of FAM70B was associated with poor cancer-specific survival, progression-free survival, and overall survival in the patients who underwent cystectomy (log-rank test, p=0.013, p=0.036, p=0.005, respectively). In the chemotherapy group, FAM70B expression was associated with cancer-specific survival and progression-free survival (log-rank test, p=0.013, p=0.042, respectively). Moreover, high FAM70B expression was associated with shorter cancer-specific survival in localized or locally advanced tumor stages (log-rank test, p=0.016).
We confirmed the significance of FAM70B as a prognostic marker in a validation cohort. Therefore, we propose that the FAM70B gene could be used to more precisely predict cancer progression and cancer-specific death in patients with MIBC.
Bladder cancer; Gene expression profiling; Micro-array; Prognosis
ZEB2 has been suggested to mediate EMT and disease aggressiveness in several types of human cancers. However, the expression patterns of ZEB2 in hepatocellular carcinoma (HCC) and its effect on prognosis of HCC patients treated with hepatectomy are unclear.
In this study, the methods of tissue microarray and immunohistochemistry (IHC) were utilized to investigate ZEB2 expression in HCC and peritumoral liver tissue (PLT). Receiver operating characteristic (ROC), spearman's rank correlation, Kaplan-Meier plots and Cox proportional hazards regression model were used to analyze the data. Up-regulated expression of cytoplasmic/nuclear ZEB2 protein was observed in the majority of PLTs, when compared to HCCs. Further analysis showed that overexpression of cytoplasmic ZEB2 in HCCs was inversely correlated with AFP level, tumor size and differentiation (P<0.05). Also, overexpression of cytoplasmic ZEB2 in PLTs correlated with lower AFP level (P<0.05). In univariate survival analysis, a significant association between overexpression of cytoplasmic ZEB2 by HCCs/PLTs and longer patients' survival was found (P<0.05). Importantly, cytoplasmic ZEB2 expression in PLTs was evaluated as an independent prognostic factor in multivariate analysis (P<0.05). Consequently, a new clinicopathologic prognostic model with cytoplasmic ZEB2 expression (including HCCs and PLTs) was constructed. The model could significantly stratify risk (low, intermediate and high) for overall survival (P = 0.002).
Our findings provide a basis for the concept that cytoplasmic ZEB2 expressed by PLTs can predict the postoperative survival of patients with HCC. The combined cytoplasmic ZEB2 prognostic model may become a useful tool for identifying patients with different clinical outcomes.
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
ranking biomarkers; ROC; confounders; high-throughput data
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
Extensive biomedical studies have shown that clinical and environmental risk factors may not have sufficient predictive power for cancer prognosis. The development of high-throughput profiling technologies makes it possible to survey the whole genome and search for genomic markers with predictive power. Many existing studies assume the interchangeability of gene effects and ignore the coordination among them.
We adopt the weighted co-expression network to describe the interplay among genes. Although there are several different ways of defining gene networks, the weighted co-expression network may be preferred because of its computational simplicity, satisfactory empirical performance, and because it does not demand additional biological experiments. For cancer prognosis studies with gene expression measurements, we propose a new marker selection method that can properly incorporate the network connectivity of genes. We analyze six prognosis studies on breast cancer and lymphoma. We find that the proposed approach can identify genes that are significantly different from those using alternatives. We search published literature and find that genes identified using the proposed approach are biologically meaningful. In addition, they have better prediction performance and reproducibility than genes identified using alternatives.
The network contains important information on the functionality of genes. Incorporating the network structure can improve cancer marker identification.
We applied stochastic search variable selection (SSVS), a Bayesian model selection method, to the simulated data of Genetic Analysis Workshop 13. We used SSVS with the revisited Haseman-Elston method to find the markers linked to the loci determining change in cholesterol over time. To study gene-gene interaction (epistasis) and gene-environment interaction, we adopted prior structures, which incorporate the relationship among the predictors. This allows SSVS to search in the model space more efficiently and avoid the less likely models.
In applying SSVS, instead of looking at the posterior distribution of each of the candidate models, which is sensitive to the setting of the prior, we ranked the candidate variables (markers) according to their marginal posterior probability, which was shown to be more robust to the prior. Compared with traditional methods that consider one marker at a time, our method considers all markers simultaneously and obtains more favorable results.
We showed that SSVS is a powerful method for identifying linked markers using the Haseman-Elston method, even for weak effects. SSVS is very effective because it does a smart search over the entire model space.
The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.
biomarker; case-control study; classification; Hosmer-Lemeshow test; predictiveness curve; risk; ROC curve
It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis.
We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approach. When it was applied to data on diffuse large B-cell lymphoma (DLBCL), we were able to identify some new candidate signatures for disease prognosis.
This multivariate approach provides researchers with a more reliable list of molecular features profiled in terms of their prognostic relationship to the event times, and generates dependable information for subsequent identification of prognostic molecular signatures through either biological procedures or further data analysis.
Despite recent progress in the identification of genetic and molecular alterations in prostate cancer, markers associated with tumor progression are scarce. Therefore precise diagnosis of patients and prognosis of the disease remain difficult. This study investigated novel molecular markers discriminating between low and highly aggressive types of prostate cancer.
Using 52 microdissected cell populations of low- and high-risk prostate tumors, we identified via global cDNA microarrays analysis almost 1200 genes being differentially expressed among these groups. These genes were analyzed by statistical, pathway and gene enrichment methods. Twenty selected candidate genes were verified by quantitative real time PCR and immunohistochemistry. In concordance with the mRNA levels, two genes MAP3K5 and PDIA3 exposed differential protein expression. Functional characterization of PDIA3 revealed a pro-apoptotic role of this gene in PC3 prostate cancer cells.
Our analyses provide deeper insights into the molecular changes occurring during prostate cancer progression. The genes MAP3K5 and PDIA3 are associated with malignant stages of prostate cancer and therefore provide novel potential biomarkers.
Human heparanase plays an important role in cancer development and single nucleotide polymorphisms (SNPs) in the heparanase gene (HPSE) have been shown to be correlated with gastric cancer. The present study examined the associations between individual SNPs or haplotypes in HPSE and susceptibility, clinicopathological parameters and prognosis of gastric cancer in a large sample of the Han population in northern China.
Genomic DNA was extracted from formalin-fixed, paraffin-embedded normal gastric tissue samples from 404 patients and from blood from 404 healthy controls. Six SNPs were genotyped by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. A chi-square (χ2) test and unconditional logistic regression were used to analyze the risk of gastric cancer; a Log-rank test and Cox proportional hazards model were used to produce survival analysis and a Kaplan-Meier method was used to map survival curves. The mean genotyping success rates were more than 99% in both groups. Haplotype CA in the block composed of rs11099592 and rs4693608 had a greater distribution in the group of Borrmann types 3 and 4 (P = 0.037), the group of a greater number of lymph node metastases (N3 vs N0 group, P = 0.046), and moreover was correlated to poor survival (CG vs CA: HR = 0.645, 95%CI: 0.421–0.989, P = 0.044). In addition, genotypes rs4693608 AA and rs4364254 TT were associated with poor survival (P = 0.030, HR = 1.527, 95%CI: 1.042–2.238 for rs4693608 AA; P = 0.013, HR = 1.546, 95%CI: 1.096–2.181 for rs4364254 TT). There were no correlations between individual SNPs or haplotypes and gastric cancer risk.
A functional haplotype in HPSE was found, which included the important SNP rs4693608. SNPs in HPSE play an important role in gastric cancer progression and survival, and perhaps may be a molecular marker for prognosis and treatment values.
Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.
We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.
From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.
With the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups. Such a combination allows the identification of common genomic factors across multiple cancer types and provides new insights into the disease process. In the framework of the proportional hazards model, classical procedures, which consist of ranking genes according to the estimated hazard ratio or the p-value obtained from a test statistic of no association between survival and gene expression level, are not suitable for gene selection across multiple genomic datasets with different sample sizes. We propose a novel index for identifying genes with a common effect across heterogeneous genomic studies designed to remain stable whatever the sample size and which has a straightforward interpretation in terms of the percentage of separability between patients according to their survival times and gene expression measurements.
The simulations results show that the proposed index is not substantially affected by the sample size of the study and the censoring. They also show that its separability performance is higher than indices of predictive accuracy relying on the likelihood function. A simulated example illustrates the good operating characteristics of our index. In addition, we demonstrate that it is linked to the score statistic and possesses a biologically relevant interpretation.
The practical use of the index is illustrated for identifying genes with common effects across eight independent genomic cancer studies of different sample sizes. The meta-selection allows the identification of four genes (ESPL1, KIF4A, HJURP, LRIG1) that are biologically relevant to the carcinogenesis process and have a prognostic impact on survival outcome across various solid tumors.
The proposed index is a promising tool for identifying factors having a prognostic impact across a collection of heterogeneous genomic datasets of various sizes.
A rank-based variable selection procedure is developed for the semiparametric accelerated failure time model with censored observations where the penalized likelihood (partial likelihood) method is not directly applicable.
The new method penalizes the rank-based Gehan-type loss function with the ℓ1 penalty. To correctly choose the tuning parameters, a novel likelihood-based χ2-type criterion is proposed. Desirable properties of the estimator such as the oracle properties are established through the local quadratic expansion of the Gehan loss function.
In particular, our method can be easily implemented by the standard linear programming packages and hence numerically convenient. Extensions to marginal models for multivariate failure time are also considered. The performance of the new procedure is assessed through extensive simulation studies and illustrated with two real examples.
Accelerated failure time model; Adaptive Lasso; BIC; Gehan-type loss function; Lasso; Variable selection
For censored survival outcomes, it can be of great interest to evaluate the predictive power of individual markers or their functions. Compared with alternative evaluation approaches, the time-dependent ROC (receiver operating characteristics) based approaches rely on much weaker assumptions, can be more robust, and hence are preferred. In this article, we examine evaluation of markers’ predictive power using the time-dependent ROC curve and a concordance measure which can be viewed as a weighted area under the time-dependent AUC (area under the ROC curve) profile. This study significantly advances from existing time-dependent ROC studies by developing nonparametric estimators of the summary indexes and, more importantly, rigorously establishing their asymptotic properties. It reinforces the statistical foundation of the time-dependent ROC based evaluation approaches for censored survival outcomes. Numerical studies, including simulations and application to an HIV clinical trial, demonstrate the satisfactory finite-sample performance of the proposed approaches.
time-dependent ROC; concordance measure; inverse-probability-of-censoring weighting; marker evaluation; survival outcomes
For patients with identical clinical-pathological characteristics or the same stage of lung cancer, great uncertainties remain regarding how some patients will be cured while other patients will have cancer recurrence, metastasis, or death after surgical resection. Identification of patients at high risk of recurrence, those who are unlikely to respond to specific chemotherapeutic agents, is the rationale for measuring specific biochemical markers. Thus, main investigational studies nowadays are focused in identifying molecular markers of recurrence, beyond pathologic stage, after surgical treatment and factors that can predict a benefit from adjuvant chemotherapy in poor prognosis subgroups, to individualize treatments. Advances in genomics and proteomics have generated many candidate markers with potential clinical value. Gene expression profiling (GEP) by microarray or real-time quantitative reverse-transcriptase polymerase chain reaction (qRT-PCR) can be useful in the classification or prognosis of various types of cancer, including lung cancer. A number of prognostic gene expression signatures have been reported to predict survival in non-small cell lung cancer (NSCLC). In this review, we focus on the role of GEP in early-stage NSCLC as predictive and prognostic biomarker and its potential use for a ‘personalized’ medicine in the years to come.
non-small cell lung cancer; prognostic biomarker; gene expression profiling
High-throughput gene profiling studies have been extensively conducted, searching for markers associated with cancer development and progression. In this study, we analyse cancer prognosis studies with right censored survival responses. With gene expression data, we adopt the weighted gene co-expression network analysis (WGCNA) to describe the interplay among genes. In network analysis, nodes represent genes. There are subsets of nodes, called modules, which are tightly connected to each other. Genes within the same modules tend to have co-regulated biological functions. For cancer prognosis data with gene expression measurements, our goal is to identify cancer markers, while properly accounting for the network module structure. A two-step sparse boosting approach, called Network Sparse Boosting (NSBoost), is proposed for marker selection. In the first step, for each module separately, we use a sparse boosting approach for within-module marker selection and construct module-level ‘super markers ’. In the second step, we use the super markers to represent the effects of all genes within the same modules and conduct module-level selection using a sparse boosting approach. Simulation study shows that NSBoost can more accurately identify cancer-associated genes and modules than alternatives. In the analysis of breast cancer and lymphoma prognosis studies, NSBoost identifies genes with important biological implications. It outperforms alternatives including the boosting and penalization approaches by identifying a smaller number of genes/modules and/or having better prediction performance.
Genomic selection refers to the use of genomewide dense markers for breeding value estimation and subsequently for selection. The main challenge of genomic breeding value estimation is the estimation of many effects from a limited number of observations. Bayesian methods have been proposed to successfully cope with these challenges. As an alternative class of models, non- and semiparametric models were recently introduced. The present study investigated the ability of nonparametric additive regression models to predict genomic breeding values. The genotypes were modelled for each marker or pair of flanking markers (i.e. the predictors) separately. The nonparametric functions for the predictors were estimated simultaneously using additive model theory, applying a binomial kernel. The optimal degree of smoothing was determined by bootstrapping. A mutation-drift-balance simulation was carried out. The breeding values of the last generation (genotyped) was predicted using data from the next last generation (genotyped and phenotyped). The results show moderate to high accuracies of the predicted breeding values. A determination of predictor specific degree of smoothing increased the accuracy.
The era of personalized medicine for cancer therapeutics has taken an important step forward in making accurate prognoses for individual patients with the adoption of high-throughput microarray technology. However, microarray technology in cancer diagnosis or prognosis has been primarily used for the statistical evaluation of patient populations, and thus excludes inter-individual variability and patient-specific predictions. Here we propose a metric called clinical confidence that serves as a measure of prognostic reliability to facilitate the shift from population-wide to personalized cancer prognosis using microarray-based predictive models. The performance of sample-based models predicted with different clinical confidences was evaluated and compared systematically using three large clinical datasets studying the following cancers: breast cancer, multiple myeloma, and neuroblastoma. Survival curves for patients, with different confidences, were also delineated. The results show that the clinical confidence metric separates patients with different prediction accuracies and survival times. Samples with high clinical confidence were likely to have accurate prognoses from predictive models. Moreover, patients with high clinical confidence would be expected to live for a notably longer or shorter time if their prognosis was good or grim based on the models, respectively. We conclude that clinical confidence could serve as a beneficial metric for personalized cancer prognosis prediction utilizing microarrays. Ascribing a confidence level to prognosis with the clinical confidence metric provides the clinician an objective, personalized basis for decisions, such as choosing the severity of the treatment.
High-level expression of Rad51, a key factor in homologous recombination, has been observed in a variety of human malignancies. This study was aimed to evaluate Rad51 expression to serve as prognostic marker in non-small-cell lung cancer (NSCLC). A total of 383 non-small-cell lung tumours were analysed immunohistochemically on NSCLC tissue microarrays. High-level Rad51 expression was observed in 29.4% (100 out of 340) of cases. Patients whose tumours displayed high-level Rad51 expression showed a significantly shorter median survival time of 19 vs 68 months (P<0.0001, log-rank test). Similarly T status, N status, M status, clinical stage and histological tumour grade were significant prognostic markers in univariate Cox survival analysis. Importantly, Rad51 expression (P<0.0001) together with tumour differentiation (P<0.009), clinical stage (P=0.004) and N status (P=0.0001) proved to be independent prognostic parameters in multivariate analysis. Rad51 expression predicted the outcome of squamous cell cancer as well as adenocarcinoma of the lung. Our results suggest that Rad51 expression provides additional prognostic information for surgically treated NSCLC patients. We hypothesise that the decreased survival of NSCLC patients with high-level expression of Rad51 is related to an enhanced propensity of tumour cells for survival, antiapoptosis and chemo-/radioresistance.
non-small-cell lung carcinoma; prognosis; tissue microarray; Rad51