Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
genomic studies; semiparametric prognosis models; model comparison
In high-throughput cancer genomic studies, markers identified from the analysis of single datasets may have unsatisfactory properties because of low sample sizes. Integrative analysis pools and analyzes raw data from multiple studies, and can effectively increase sample size and lead to improved marker identification results. In this study, we consider the integrative analysis of multiple high-throughput cancer prognosis studies. In the existing integrative analysis studies, the interplay among genes, which can be described using the network structure, has not been effectively accounted for. In network analysis, tightly-connected nodes (genes) are more likely to have related biological functions and similar regression coefficients. The goal of this study is to develop an analysis approach that can incorporate the gene network structure in integrative analysis. To this end, we adopt an AFT (accelerated failure time) model to describe survival. A weighted least squares approach, which has low computational cost, is adopted for estimation. For marker selection, we propose a new penalization approach. The proposed penalty is composed of two parts. The first part is a group MCP penalty, and conducts gene selection. The second part is a Laplacian penalty, and smoothes the differences of coefficients for tightly-connected genes. A group coordinate descent approach is developed to compute the proposed estimate. Simulation study shows satisfactory performance of the proposed approach when there exist moderate to strong correlations among genes. We analyze three lung cancer prognosis datasets, and demonstrate that incorporating the network structure can lead to the identification of important genes and improved prediction performance.
Integrative analysis; Cancer prognosis; Gene network; Penalized selection; Laplacian shrinkage
In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution.
We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects.
Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance.
Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene Expression; Integrative analysis; Sparse boosting
In breast cancer research, it is of great interest to identify genomic markers associated with prognosis. Multiple gene profiling studies have been conducted for such a purpose. Genomic markers identified from the analysis of single datasets often do not have satisfactory reproducibility. Among the multiple possible reasons, the most important one is the small sample sizes of individual studies. A cost-effective solution is to pool data from multiple comparable studies and conduct integrative analysis. In this study, we collect four breast cancer prognosis studies with gene expression measurements. We describe the relationship between prognosis and gene expressions using the accelerated failure time (AFT) models. We adopt a 2-norm group bridge penalization approach for marker identification. This integrative analysis approach can effectively identify markers with consistent effects across multiple datasets and naturally accommodate the heterogeneity among studies. Statistical and simulation studies demonstrate satisfactory performance of this approach. Breast cancer prognosis markers identified using this approach have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene expression; Marker identification; Integrative analysis; 2-norm group bridge
To validate whether FAM70B, which was found in our micro-array profiling as a prognostic marker for cancer survival, could accurately predict prognosis in patients with muscle-invasive bladder cancer (MIBC).
Materials and Methods
A total of 124 patients with MIBC were enrolled in this study. The FAM70B expression level was analyzed by real-time polymerase chain reaction by using RNA from tumor tissues. The prognostic effect of FAM70B was evaluated by Kaplan-Meier analysis and a multivariate Cox regression model.
Kaplan-Meier estimates showed a significant difference in progression-free survival (log-rank test, p=0.011) and cancer-specific survival (log-rank test, p=0.017) according to FAM70B gene expression level. By multivariate Cox regression analysis, high FAM70B expression was predictive of cancer progression (hazard ratio [HR], 2.115, p=0.013) and cancer-specific death (HR, 1.925; p=0.033). In the subgroup analysis, high expression of FAM70B was associated with poor cancer-specific survival, progression-free survival, and overall survival in the patients who underwent cystectomy (log-rank test, p=0.013, p=0.036, p=0.005, respectively). In the chemotherapy group, FAM70B expression was associated with cancer-specific survival and progression-free survival (log-rank test, p=0.013, p=0.042, respectively). Moreover, high FAM70B expression was associated with shorter cancer-specific survival in localized or locally advanced tumor stages (log-rank test, p=0.016).
We confirmed the significance of FAM70B as a prognostic marker in a validation cohort. Therefore, we propose that the FAM70B gene could be used to more precisely predict cancer progression and cancer-specific death in patients with MIBC.
Bladder cancer; Gene expression profiling; Micro-array; Prognosis
In cancer research, high-throughput profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Despite seemingly significant differences, different subtypes of the same cancer (or different types of cancers) may share common susceptibility genes. In this study, we analyze prognosis data on multiple subtypes of the same cancer, but note that the proposed approach is directly applicable to the analysis of data on multiple types of cancers. We describe the genetic basis of multiple subtypes using the heterogeneity model, which allows overlapping but different sets of susceptibility genes/SNPs for different subtypes. An accelerated failure time (AFT) model is adopted to describe prognosis. We develop a regularized gradient descent approach, which conducts gene-level analysis and identifies genes that contain important SNPs associated with prognosis. The proposed approach belongs to the family of gradient descent approaches, is intuitively reasonable, and has affordable computational cost. Simulation study shows that when prognosis-associated SNPs are clustered in a small number of genes, the proposed approach outperforms alternatives with significantly more true positives and fewer false positives. We analyze an NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements, and identify genes associated with the three major subtypes of NHL, namely DLBCL, FL and CLL/SLL. The proposed approach identifies genes different from using alternative approaches and has the best prediction performance.
Integrative analysis; Cancer Prognosis; Gradient descent; NHL; SNP
The objective of this study was to investigate the number of metastatic lymph nodes (pN) and the metastatic lymph node ratio (MLR) on the post-surgical prognosis of Chinese patients with esophageal cancer (EC) and lymph node metastasis.
We enrolled 353 patients who received primary curative resection for EC from 1990 to 2003. The association of pN and MLR with 5-year overall survival (OS) was examined by receiver operating characteristic (ROC) and area under the curve (AUC) analysis. The Kaplan-Meier method was used to calculate survival rates, and survival curves were compared with the log-rank test. The Cox model was employed for univariate and multivariate analyses of factors associated with 5-year OS.
The median follow-up time was 41 months, and the 1-, 3- and 5-year OS rates were 71.2%, 30.4%, and 19.5%, respectively. Univariate analysis showed that age, pN stage, and the MLR were prognostic factors for OS. Patients with MLRs less than 0.15, MLRs of 0.15-0.30, and MLRs greater than 0.30 had 5-year OS rates of 30.1%, 17.8%, and 9.5%, respectively (p < 0.001). Patients classified as pN1, pN2, and pN3 had 5-year OS rates of 23.7%, 11.4%, and 9.9%, respectively (p < 0.001). Multivariate analysis indicated that a high MLR and advanced age were significant and independent risk factors for poor OS. Patients classified as pN2 had significantly worse OS than those classified as pN1 (p = 0.022), but those classified as pN3 had similar OS as those classified as pN1 (p = 0.166). ROC analysis indicated that MLR (AUC = 0.585, p = 0.016) had better predictive value than pN (AUC = 0.565, p = 0.068).
The integrated use of MLR and pN may be suitable for evaluation of OS in Chinese patients with EC and positive nodal metastasis after curative resection.
Transcriptional coactivator p300 has been shown to play a variety of roles in the transcription process and mutation of p300 has been found in certain types of human cancers. However, the expression dynamics of p300 in breast cancer (BC) and its effect on BC patients’ prognosis are poorly understood.
In the present study, the methods of tissue microarray and immunohistochemistry (IHC) were used to investigate the protein expression of p300 in BCs. Receiver operating characteristic (ROC) curve analysis, Spearman’s rank correlation, Kaplan-Meier plots and Cox proportional hazards regression model were utilized to analyze the data.
Based on the ROC curve analysis, the cutoff value for p300 high expression was defined when the H score for p300 was more than 105. High expression of p300 could be observed in 105/193 (54.4%) of BCs, in 6/25 (24.0%) of non-malignant breast tissues, respectively (P=0.004). Further correlation analysis showed that high expression of p300 was positively correlated with higher histological grade, advanced clinical stage and tumor recurrence (P<0.05). In univariate survival analysis, a significant association between high expression of p300 and shortened patients’ survival and poor progression-free survival was found (P<0.05). Importantly, p300 expression was evaluated as an independent prognostic factor in multivariate analysis (P<0.05).
Our findings provide a basis for the concept that high expression of p300 in BC may be important in the acquisition of a recurrence phenotype, suggesting that p300 high expression, as examined by IHC, is an independent biomarker for poor prognosis of patients with BC.
Breast cancer; p300; Tumor recurrence; Prognosis
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
ranking biomarkers; ROC; confounders; high-throughput data
Developing individualized prediction rules for disease risk and prognosis has played a key role in modern medicine. When new genomic or biological markers become available to assist in risk prediction, it is essential to assess the improvement in clinical usefulness of the new markers over existing routine variables. Net reclassification improvement (NRI) has been proposed to assess improvement in risk reclassification in the context of comparing two risk models and the concept has been quickly adopted in medical journals. We propose both nonparametric and semiparametric procedures for calculating NRI as a function of a future prediction time t with a censored failure time outcome. The proposed methods accommodate covariate-dependent censoring, therefore providing more robust and sometimes more efficient procedures compared with the existing nonparametric-based estimators. Simulation results indicate that the proposed procedures perform well in finite samples. We illustrate these procedures by evaluating a new risk model for predicting the onset of cardiovascular disease.
Inverse probability weighted (IPW) estimator; Net reclassification improvement (NRI); Risk prediction; Survival analysis
Extensive biomedical studies have shown that clinical and environmental risk factors may not have sufficient predictive power for cancer prognosis. The development of high-throughput profiling technologies makes it possible to survey the whole genome and search for genomic markers with predictive power. Many existing studies assume the interchangeability of gene effects and ignore the coordination among them.
We adopt the weighted co-expression network to describe the interplay among genes. Although there are several different ways of defining gene networks, the weighted co-expression network may be preferred because of its computational simplicity, satisfactory empirical performance, and because it does not demand additional biological experiments. For cancer prognosis studies with gene expression measurements, we propose a new marker selection method that can properly incorporate the network connectivity of genes. We analyze six prognosis studies on breast cancer and lymphoma. We find that the proposed approach can identify genes that are significantly different from those using alternatives. We search published literature and find that genes identified using the proposed approach are biologically meaningful. In addition, they have better prediction performance and reproducibility than genes identified using alternatives.
The network contains important information on the functionality of genes. Incorporating the network structure can improve cancer marker identification.
We applied stochastic search variable selection (SSVS), a Bayesian model selection method, to the simulated data of Genetic Analysis Workshop 13. We used SSVS with the revisited Haseman-Elston method to find the markers linked to the loci determining change in cholesterol over time. To study gene-gene interaction (epistasis) and gene-environment interaction, we adopted prior structures, which incorporate the relationship among the predictors. This allows SSVS to search in the model space more efficiently and avoid the less likely models.
In applying SSVS, instead of looking at the posterior distribution of each of the candidate models, which is sensitive to the setting of the prior, we ranked the candidate variables (markers) according to their marginal posterior probability, which was shown to be more robust to the prior. Compared with traditional methods that consider one marker at a time, our method considers all markers simultaneously and obtains more favorable results.
We showed that SSVS is a powerful method for identifying linked markers using the Haseman-Elston method, even for weak effects. SSVS is very effective because it does a smart search over the entire model space.
With the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups. Such a combination allows the identification of common genomic factors across multiple cancer types and provides new insights into the disease process. In the framework of the proportional hazards model, classical procedures, which consist of ranking genes according to the estimated hazard ratio or the p-value obtained from a test statistic of no association between survival and gene expression level, are not suitable for gene selection across multiple genomic datasets with different sample sizes. We propose a novel index for identifying genes with a common effect across heterogeneous genomic studies designed to remain stable whatever the sample size and which has a straightforward interpretation in terms of the percentage of separability between patients according to their survival times and gene expression measurements.
The simulations results show that the proposed index is not substantially affected by the sample size of the study and the censoring. They also show that its separability performance is higher than indices of predictive accuracy relying on the likelihood function. A simulated example illustrates the good operating characteristics of our index. In addition, we demonstrate that it is linked to the score statistic and possesses a biologically relevant interpretation.
The practical use of the index is illustrated for identifying genes with common effects across eight independent genomic cancer studies of different sample sizes. The meta-selection allows the identification of four genes (ESPL1, KIF4A, HJURP, LRIG1) that are biologically relevant to the carcinogenesis process and have a prognostic impact on survival outcome across various solid tumors.
The proposed index is a promising tool for identifying factors having a prognostic impact across a collection of heterogeneous genomic datasets of various sizes.
High-throughput gene profiling studies have been extensively conducted, searching for markers associated with cancer development and progression. In this study, we analyse cancer prognosis studies with right censored survival responses. With gene expression data, we adopt the weighted gene co-expression network analysis (WGCNA) to describe the interplay among genes. In network analysis, nodes represent genes. There are subsets of nodes, called modules, which are tightly connected to each other. Genes within the same modules tend to have co-regulated biological functions. For cancer prognosis data with gene expression measurements, our goal is to identify cancer markers, while properly accounting for the network module structure. A two-step sparse boosting approach, called Network Sparse Boosting (NSBoost), is proposed for marker selection. In the first step, for each module separately, we use a sparse boosting approach for within-module marker selection and construct module-level ‘super markers ’. In the second step, we use the super markers to represent the effects of all genes within the same modules and conduct module-level selection using a sparse boosting approach. Simulation study shows that NSBoost can more accurately identify cancer-associated genes and modules than alternatives. In the analysis of breast cancer and lymphoma prognosis studies, NSBoost identifies genes with important biological implications. It outperforms alternatives including the boosting and penalization approaches by identifying a smaller number of genes/modules and/or having better prediction performance.
Human heparanase plays an important role in cancer development and single nucleotide polymorphisms (SNPs) in the heparanase gene (HPSE) have been shown to be correlated with gastric cancer. The present study examined the associations between individual SNPs or haplotypes in HPSE and susceptibility, clinicopathological parameters and prognosis of gastric cancer in a large sample of the Han population in northern China.
Genomic DNA was extracted from formalin-fixed, paraffin-embedded normal gastric tissue samples from 404 patients and from blood from 404 healthy controls. Six SNPs were genotyped by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. A chi-square (χ2) test and unconditional logistic regression were used to analyze the risk of gastric cancer; a Log-rank test and Cox proportional hazards model were used to produce survival analysis and a Kaplan-Meier method was used to map survival curves. The mean genotyping success rates were more than 99% in both groups. Haplotype CA in the block composed of rs11099592 and rs4693608 had a greater distribution in the group of Borrmann types 3 and 4 (P = 0.037), the group of a greater number of lymph node metastases (N3 vs N0 group, P = 0.046), and moreover was correlated to poor survival (CG vs CA: HR = 0.645, 95%CI: 0.421–0.989, P = 0.044). In addition, genotypes rs4693608 AA and rs4364254 TT were associated with poor survival (P = 0.030, HR = 1.527, 95%CI: 1.042–2.238 for rs4693608 AA; P = 0.013, HR = 1.546, 95%CI: 1.096–2.181 for rs4364254 TT). There were no correlations between individual SNPs or haplotypes and gastric cancer risk.
A functional haplotype in HPSE was found, which included the important SNP rs4693608. SNPs in HPSE play an important role in gastric cancer progression and survival, and perhaps may be a molecular marker for prognosis and treatment values.
Among the gynaecological malignancies, ovarian cancer is one of the neoplastic forms with the poorest prognosis and with the bad overall and disease-free survival rates than other gynaecological cancers. Ovarian tumors can be classified on the basis of the cells of origin in epithelial, stromal and germ cell tumors. Epithelial ovarian tumors display great histological heterogeneity and can be further subdivided into benign, intermediate or borderline, and invasive tumors. Several studies on ovarian tumors, have focused on the identification of both diagnostic and prognostic markers for applications in clinical practice. High-throughput technologies have accelerated the process of biomolecular study and genomic discovery; unfortunately, validity of these should be still demonstrated by extensive researches on sensibility and sensitivity of ovarian cancer novel biomarkers, determining whether gene profiling and proteomics could help differentiate between patients with metastatic ovarian cancer and primary ovarian carcinomas, and their potential impact on management. Therefore, considerable interest lies in identifying molecular and protein biomarkers and indicators to guide treatment decisions and clinical follow up. In this review, the current state of knowledge about the genoproteomic and potential clinical value of gene expression profiling in ovarian cancer and ovarian borderline tumors is discussed, focusing on three main areas: distinguishing normal ovarian tissue from ovarian cancers and borderline tumors, identifying different genotypes of ovarian tissue and identifying proteins linked to cancer or tumor development. By these targets, authors focus on the use of novel molecules, developed on the proteomics and genomics researches, as potential protein biomarkers in the management of ovarian cancer or borderline tumor, overlooking on current state of the art and on future perspectives of researches.
Ovarian cancer; borderline ovarian tumors; markers; genomics; proteomics; oncogenes.
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
ZEB2 has been suggested to mediate EMT and disease aggressiveness in several types of human cancers. However, the expression patterns of ZEB2 in hepatocellular carcinoma (HCC) and its effect on prognosis of HCC patients treated with hepatectomy are unclear.
In this study, the methods of tissue microarray and immunohistochemistry (IHC) were utilized to investigate ZEB2 expression in HCC and peritumoral liver tissue (PLT). Receiver operating characteristic (ROC), spearman's rank correlation, Kaplan-Meier plots and Cox proportional hazards regression model were used to analyze the data. Up-regulated expression of cytoplasmic/nuclear ZEB2 protein was observed in the majority of PLTs, when compared to HCCs. Further analysis showed that overexpression of cytoplasmic ZEB2 in HCCs was inversely correlated with AFP level, tumor size and differentiation (P<0.05). Also, overexpression of cytoplasmic ZEB2 in PLTs correlated with lower AFP level (P<0.05). In univariate survival analysis, a significant association between overexpression of cytoplasmic ZEB2 by HCCs/PLTs and longer patients' survival was found (P<0.05). Importantly, cytoplasmic ZEB2 expression in PLTs was evaluated as an independent prognostic factor in multivariate analysis (P<0.05). Consequently, a new clinicopathologic prognostic model with cytoplasmic ZEB2 expression (including HCCs and PLTs) was constructed. The model could significantly stratify risk (low, intermediate and high) for overall survival (P = 0.002).
Our findings provide a basis for the concept that cytoplasmic ZEB2 expressed by PLTs can predict the postoperative survival of patients with HCC. The combined cytoplasmic ZEB2 prognostic model may become a useful tool for identifying patients with different clinical outcomes.
The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.
biomarker; case-control study; classification; Hosmer-Lemeshow test; predictiveness curve; risk; ROC curve
Despite recent progress in the identification of genetic and molecular alterations in prostate cancer, markers associated with tumor progression are scarce. Therefore precise diagnosis of patients and prognosis of the disease remain difficult. This study investigated novel molecular markers discriminating between low and highly aggressive types of prostate cancer.
Using 52 microdissected cell populations of low- and high-risk prostate tumors, we identified via global cDNA microarrays analysis almost 1200 genes being differentially expressed among these groups. These genes were analyzed by statistical, pathway and gene enrichment methods. Twenty selected candidate genes were verified by quantitative real time PCR and immunohistochemistry. In concordance with the mRNA levels, two genes MAP3K5 and PDIA3 exposed differential protein expression. Functional characterization of PDIA3 revealed a pro-apoptotic role of this gene in PC3 prostate cancer cells.
Our analyses provide deeper insights into the molecular changes occurring during prostate cancer progression. The genes MAP3K5 and PDIA3 are associated with malignant stages of prostate cancer and therefore provide novel potential biomarkers.
Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.
We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.
From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.
A rank-based variable selection procedure is developed for the semiparametric accelerated failure time model with censored observations where the penalized likelihood (partial likelihood) method is not directly applicable.
The new method penalizes the rank-based Gehan-type loss function with the ℓ1 penalty. To correctly choose the tuning parameters, a novel likelihood-based χ2-type criterion is proposed. Desirable properties of the estimator such as the oracle properties are established through the local quadratic expansion of the Gehan loss function.
In particular, our method can be easily implemented by the standard linear programming packages and hence numerically convenient. Extensions to marginal models for multivariate failure time are also considered. The performance of the new procedure is assessed through extensive simulation studies and illustrated with two real examples.
Accelerated failure time model; Adaptive Lasso; BIC; Gehan-type loss function; Lasso; Variable selection
For censored survival outcomes, it can be of great interest to evaluate the predictive power of individual markers or their functions. Compared with alternative evaluation approaches, the time-dependent ROC (receiver operating characteristics) based approaches rely on much weaker assumptions, can be more robust, and hence are preferred. In this article, we examine evaluation of markers’ predictive power using the time-dependent ROC curve and a concordance measure which can be viewed as a weighted area under the time-dependent AUC (area under the ROC curve) profile. This study significantly advances from existing time-dependent ROC studies by developing nonparametric estimators of the summary indexes and, more importantly, rigorously establishing their asymptotic properties. It reinforces the statistical foundation of the time-dependent ROC based evaluation approaches for censored survival outcomes. Numerical studies, including simulations and application to an HIV clinical trial, demonstrate the satisfactory finite-sample performance of the proposed approaches.
time-dependent ROC; concordance measure; inverse-probability-of-censoring weighting; marker evaluation; survival outcomes
Gene copy number changes are common characteristics of many genetic disorders. A new technology, array comparative genomic hybridization (a-CGH), is widely used today to screen for gains and losses in cancers and other genetic diseases with high resolution at the genome level or for specific chromosomal region. Statistical methods for analyzing such a-CGH data have been developed. However, most of the existing methods are for unrelated individual data and the results from them provide explanation for horizontal variations in copy number changes. It is potentially meaningful to develop a statistical method that will allow for the analysis of family data to investigate the vertical kinship effects as well. Here we consider a semiparametric model based on clustering method in which the marginal distributions are estimated nonparametrically, and the familial dependence structure is modeled by copula. The model is illustrated and evaluated using simulated data. Our results show that the proposed method is more robust than the commonly used multivariate normal model. Finally, we demonstrated the utility of our method using a real dataset.
cluster; copula; family data; gene copy number; semiparametric model