Related Articles
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
doi:10.1093/bib/bbp070
PMCID: PMC2905523
PMID: 20123942
genomic studies; semiparametric prognosis models; model comparison
Summary
Objectives
In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution.
Methods
We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects.
Results
Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance.
Conclusions
Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
doi:10.3414/ME11-02-0019
PMCID: PMC3598607
PMID: 22344268
Breast cancer prognosis; Gene Expression; Integrative analysis; Sparse boosting
Objective
Transcriptional coactivator p300 has been shown to play a variety of roles in the transcription process and mutation of p300 has been found in certain types of human cancers. However, the expression dynamics of p300 in breast cancer (BC) and its effect on BC patients’ prognosis are poorly understood.
Methods
In the present study, the methods of tissue microarray and immunohistochemistry (IHC) were used to investigate the protein expression of p300 in BCs. Receiver operating characteristic (ROC) curve analysis, Spearman’s rank correlation, Kaplan-Meier plots and Cox proportional hazards regression model were utilized to analyze the data.
Results
Based on the ROC curve analysis, the cutoff value for p300 high expression was defined when the H score for p300 was more than 105. High expression of p300 could be observed in 105/193 (54.4%) of BCs, in 6/25 (24.0%) of non-malignant breast tissues, respectively (P=0.004). Further correlation analysis showed that high expression of p300 was positively correlated with higher histological grade, advanced clinical stage and tumor recurrence (P<0.05). In univariate survival analysis, a significant association between high expression of p300 and shortened patients’ survival and poor progression-free survival was found (P<0.05). Importantly, p300 expression was evaluated as an independent prognostic factor in multivariate analysis (P<0.05).
Conclusion
Our findings provide a basis for the concept that high expression of p300 in BC may be important in the acquisition of a recurrence phenotype, suggesting that p300 high expression, as examined by IHC, is an independent biomarker for poor prognosis of patients with BC.
doi:10.1007/s11670-011-0201-5
PMCID: PMC3587557
PMID: 23467396
Breast cancer; p300; Tumor recurrence; Prognosis
Purpose
To validate whether FAM70B, which was found in our micro-array profiling as a prognostic marker for cancer survival, could accurately predict prognosis in patients with muscle-invasive bladder cancer (MIBC).
Materials and Methods
A total of 124 patients with MIBC were enrolled in this study. The FAM70B expression level was analyzed by real-time polymerase chain reaction by using RNA from tumor tissues. The prognostic effect of FAM70B was evaluated by Kaplan-Meier analysis and a multivariate Cox regression model.
Results
Kaplan-Meier estimates showed a significant difference in progression-free survival (log-rank test, p=0.011) and cancer-specific survival (log-rank test, p=0.017) according to FAM70B gene expression level. By multivariate Cox regression analysis, high FAM70B expression was predictive of cancer progression (hazard ratio [HR], 2.115, p=0.013) and cancer-specific death (HR, 1.925; p=0.033). In the subgroup analysis, high expression of FAM70B was associated with poor cancer-specific survival, progression-free survival, and overall survival in the patients who underwent cystectomy (log-rank test, p=0.013, p=0.036, p=0.005, respectively). In the chemotherapy group, FAM70B expression was associated with cancer-specific survival and progression-free survival (log-rank test, p=0.013, p=0.042, respectively). Moreover, high FAM70B expression was associated with shorter cancer-specific survival in localized or locally advanced tumor stages (log-rank test, p=0.016).
Conclusions
We confirmed the significance of FAM70B as a prognostic marker in a validation cohort. Therefore, we propose that the FAM70B gene could be used to more precisely predict cancer progression and cancer-specific death in patients with MIBC.
doi:10.4111/kju.2012.53.9.598
PMCID: PMC3460001
PMID: 23060996
Bladder cancer; Gene expression profiling; Micro-array; Prognosis
Background
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
Results
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
Conclusion
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
doi:10.1186/1471-2105-7-253
PMCID: PMC1513612
PMID: 16684357
Background
Extensive biomedical studies have shown that clinical and environmental risk factors may not have sufficient predictive power for cancer prognosis. The development of high-throughput profiling technologies makes it possible to survey the whole genome and search for genomic markers with predictive power. Many existing studies assume the interchangeability of gene effects and ignore the coordination among them.
Results
We adopt the weighted co-expression network to describe the interplay among genes. Although there are several different ways of defining gene networks, the weighted co-expression network may be preferred because of its computational simplicity, satisfactory empirical performance, and because it does not demand additional biological experiments. For cancer prognosis studies with gene expression measurements, we propose a new marker selection method that can properly incorporate the network connectivity of genes. We analyze six prognosis studies on breast cancer and lymphoma. We find that the proposed approach can identify genes that are significantly different from those using alternatives. We search published literature and find that genes identified using the proposed approach are biologically meaningful. In addition, they have better prediction performance and reproducibility than genes identified using alternatives.
Conclusions
The network contains important information on the functionality of genes. Incorporating the network structure can improve cancer marker identification.
doi:10.1186/1471-2105-11-271
PMCID: PMC2881088
PMID: 20487548
Background
ZEB2 has been suggested to mediate EMT and disease aggressiveness in several types of human cancers. However, the expression patterns of ZEB2 in hepatocellular carcinoma (HCC) and its effect on prognosis of HCC patients treated with hepatectomy are unclear.
Methodology/Principal Findings
In this study, the methods of tissue microarray and immunohistochemistry (IHC) were utilized to investigate ZEB2 expression in HCC and peritumoral liver tissue (PLT). Receiver operating characteristic (ROC), spearman's rank correlation, Kaplan-Meier plots and Cox proportional hazards regression model were used to analyze the data. Up-regulated expression of cytoplasmic/nuclear ZEB2 protein was observed in the majority of PLTs, when compared to HCCs. Further analysis showed that overexpression of cytoplasmic ZEB2 in HCCs was inversely correlated with AFP level, tumor size and differentiation (P<0.05). Also, overexpression of cytoplasmic ZEB2 in PLTs correlated with lower AFP level (P<0.05). In univariate survival analysis, a significant association between overexpression of cytoplasmic ZEB2 by HCCs/PLTs and longer patients' survival was found (P<0.05). Importantly, cytoplasmic ZEB2 expression in PLTs was evaluated as an independent prognostic factor in multivariate analysis (P<0.05). Consequently, a new clinicopathologic prognostic model with cytoplasmic ZEB2 expression (including HCCs and PLTs) was constructed. The model could significantly stratify risk (low, intermediate and high) for overall survival (P = 0.002).
Conclusions/Significance
Our findings provide a basis for the concept that cytoplasmic ZEB2 expressed by PLTs can predict the postoperative survival of patients with HCC. The combined cytoplasmic ZEB2 prognostic model may become a useful tool for identifying patients with different clinical outcomes.
doi:10.1371/journal.pone.0032838
PMCID: PMC3290607
PMID: 22393452
Background
We applied stochastic search variable selection (SSVS), a Bayesian model selection method, to the simulated data of Genetic Analysis Workshop 13. We used SSVS with the revisited Haseman-Elston method to find the markers linked to the loci determining change in cholesterol over time. To study gene-gene interaction (epistasis) and gene-environment interaction, we adopted prior structures, which incorporate the relationship among the predictors. This allows SSVS to search in the model space more efficiently and avoid the less likely models.
Results
In applying SSVS, instead of looking at the posterior distribution of each of the candidate models, which is sensitive to the setting of the prior, we ranked the candidate variables (markers) according to their marginal posterior probability, which was shown to be more robust to the prior. Compared with traditional methods that consider one marker at a time, our method considers all markers simultaneously and obtains more favorable results.
Conclusions
We showed that SSVS is a powerful method for identifying linked markers using the Haseman-Elston method, even for weak effects. SSVS is very effective because it does a smart search over the entire model space.
doi:10.1186/1471-2156-4-S1-S69
PMCID: PMC1866507
PMID: 14975137
Summary
The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.
doi:10.1002/sim.3876
PMCID: PMC3045657
PMID: 20527013
biomarker; case-control study; classification; Hosmer-Lemeshow test; predictiveness curve; risk; ROC curve
Background
Human heparanase plays an important role in cancer development and single nucleotide polymorphisms (SNPs) in the heparanase gene (HPSE) have been shown to be correlated with gastric cancer. The present study examined the associations between individual SNPs or haplotypes in HPSE and susceptibility, clinicopathological parameters and prognosis of gastric cancer in a large sample of the Han population in northern China.
Methodology/Principal Findings
Genomic DNA was extracted from formalin-fixed, paraffin-embedded normal gastric tissue samples from 404 patients and from blood from 404 healthy controls. Six SNPs were genotyped by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. A chi-square (χ2) test and unconditional logistic regression were used to analyze the risk of gastric cancer; a Log-rank test and Cox proportional hazards model were used to produce survival analysis and a Kaplan-Meier method was used to map survival curves. The mean genotyping success rates were more than 99% in both groups. Haplotype CA in the block composed of rs11099592 and rs4693608 had a greater distribution in the group of Borrmann types 3 and 4 (P = 0.037), the group of a greater number of lymph node metastases (N3 vs N0 group, P = 0.046), and moreover was correlated to poor survival (CG vs CA: HR = 0.645, 95%CI: 0.421–0.989, P = 0.044). In addition, genotypes rs4693608 AA and rs4364254 TT were associated with poor survival (P = 0.030, HR = 1.527, 95%CI: 1.042–2.238 for rs4693608 AA; P = 0.013, HR = 1.546, 95%CI: 1.096–2.181 for rs4364254 TT). There were no correlations between individual SNPs or haplotypes and gastric cancer risk.
Conclusions/Significance
A functional haplotype in HPSE was found, which included the important SNP rs4693608. SNPs in HPSE play an important role in gastric cancer progression and survival, and perhaps may be a molecular marker for prognosis and treatment values.
doi:10.1371/journal.pone.0030277
PMCID: PMC3262795
PMID: 22276173
Background
With the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups. Such a combination allows the identification of common genomic factors across multiple cancer types and provides new insights into the disease process. In the framework of the proportional hazards model, classical procedures, which consist of ranking genes according to the estimated hazard ratio or the p-value obtained from a test statistic of no association between survival and gene expression level, are not suitable for gene selection across multiple genomic datasets with different sample sizes. We propose a novel index for identifying genes with a common effect across heterogeneous genomic studies designed to remain stable whatever the sample size and which has a straightforward interpretation in terms of the percentage of separability between patients according to their survival times and gene expression measurements.
Results
The simulations results show that the proposed index is not substantially affected by the sample size of the study and the censoring. They also show that its separability performance is higher than indices of predictive accuracy relying on the likelihood function. A simulated example illustrates the good operating characteristics of our index. In addition, we demonstrate that it is linked to the score statistic and possesses a biologically relevant interpretation.
The practical use of the index is illustrated for identifying genes with common effects across eight independent genomic cancer studies of different sample sizes. The meta-selection allows the identification of four genes (ESPL1, KIF4A, HJURP, LRIG1) that are biologically relevant to the carcinogenesis process and have a prognostic impact on survival outcome across various solid tumors.
Conclusion
The proposed index is a promising tool for identifying factors having a prognostic impact across a collection of heterogeneous genomic datasets of various sizes.
doi:10.1186/1471-2105-11-150
PMCID: PMC2863163
PMID: 20334636
Summary
High-throughput gene profiling studies have been extensively conducted, searching for markers associated with cancer development and progression. In this study, we analyse cancer prognosis studies with right censored survival responses. With gene expression data, we adopt the weighted gene co-expression network analysis (WGCNA) to describe the interplay among genes. In network analysis, nodes represent genes. There are subsets of nodes, called modules, which are tightly connected to each other. Genes within the same modules tend to have co-regulated biological functions. For cancer prognosis data with gene expression measurements, our goal is to identify cancer markers, while properly accounting for the network module structure. A two-step sparse boosting approach, called Network Sparse Boosting (NSBoost), is proposed for marker selection. In the first step, for each module separately, we use a sparse boosting approach for within-module marker selection and construct module-level ‘super markers ’. In the second step, we use the super markers to represent the effects of all genes within the same modules and conduct module-level selection using a sparse boosting approach. Simulation study shows that NSBoost can more accurately identify cancer-associated genes and modules than alternatives. In the analysis of breast cancer and lymphoma prognosis studies, NSBoost identifies genes with important biological implications. It outperforms alternatives including the boosting and penalization approaches by identifying a smaller number of genes/modules and/or having better prediction performance.
doi:10.1017/S0016672312000419
PMCID: PMC3573352
PMID: 22950901
Genomic selection refers to the use of genomewide dense markers for breeding value estimation and subsequently for selection. The main challenge of genomic breeding value estimation is the estimation of many effects from a limited number of observations. Bayesian methods have been proposed to successfully cope with these challenges. As an alternative class of models, non- and semiparametric models were recently introduced. The present study investigated the ability of nonparametric additive regression models to predict genomic breeding values. The genotypes were modelled for each marker or pair of flanking markers (i.e. the predictors) separately. The nonparametric functions for the predictors were estimated simultaneously using additive model theory, applying a binomial kernel. The optimal degree of smoothing was determined by bootstrapping. A mutation-drift-balance simulation was carried out. The breeding values of the last generation (genotyped) was predicted using data from the next last generation (genotyped and phenotyped). The results show moderate to high accuracies of the predicted breeding values. A determination of predictor specific degree of smoothing increased the accuracy.
doi:10.1186/1297-9686-41-20
PMCID: PMC2657215
PMID: 19284696
Breast cancer is the most common non-skin cancer in women and the second most common cause of cancer-related death in U.S. women. It is well known that the breast cancer survival varies by age at diagnosis. For most cancers, the relative survival decreases with age but breast cancer may have the unusual age pattern. In order to reveal the stage risk and age effects pattern, we propose the semiparametric accelerated failure time partial linear model and develop its estimation method based on the P-spline and the rank estimation approach. The simulation studies demonstrate that the proposed method is comparable to the parametric approach when data is not contaminated, and more stable than the parametric methods when data is contaminated. By applying the proposed model and method to the breast cancer data set of Atlantic county, New Jersey from SEER program, we successfully reveal the significant effects of stage, and show that women diagnosed around 38s have consistently higher survival rates than either younger or older women.
doi:10.1016/j.csda.2010.10.012
PMCID: PMC3076955
PMID: 21499529
Accelerated failure time model; Partial linear model; Penalized spline; Rank estimation; Robustness
Among the gynaecological malignancies, ovarian cancer is one of the neoplastic forms with the poorest prognosis and with the bad overall and disease-free survival rates than other gynaecological cancers. Ovarian tumors can be classified on the basis of the cells of origin in epithelial, stromal and germ cell tumors. Epithelial ovarian tumors display great histological heterogeneity and can be further subdivided into benign, intermediate or borderline, and invasive tumors. Several studies on ovarian tumors, have focused on the identification of both diagnostic and prognostic markers for applications in clinical practice. High-throughput technologies have accelerated the process of biomolecular study and genomic discovery; unfortunately, validity of these should be still demonstrated by extensive researches on sensibility and sensitivity of ovarian cancer novel biomarkers, determining whether gene profiling and proteomics could help differentiate between patients with metastatic ovarian cancer and primary ovarian carcinomas, and their potential impact on management. Therefore, considerable interest lies in identifying molecular and protein biomarkers and indicators to guide treatment decisions and clinical follow up. In this review, the current state of knowledge about the genoproteomic and potential clinical value of gene expression profiling in ovarian cancer and ovarian borderline tumors is discussed, focusing on three main areas: distinguishing normal ovarian tissue from ovarian cancers and borderline tumors, identifying different genotypes of ovarian tissue and identifying proteins linked to cancer or tumor development. By these targets, authors focus on the use of novel molecules, developed on the proteomics and genomics researches, as potential protein biomarkers in the management of ovarian cancer or borderline tumor, overlooking on current state of the art and on future perspectives of researches.
doi:10.2174/138920209788488553
PMCID: PMC2709935
PMID: 19949545
Ovarian cancer; borderline ovarian tumors; markers; genomics; proteomics; oncogenes.
In a prospective cohort study, information on clinical parameters, tests and molecular markers is often collected. Such information is useful to predict patient prognosis and to select patients for targeted therapy. We propose a new graphical approach, the positive predictive value (PPV) curve, to quantify the predictive accuracy of prognostic markers measured on a continuous scale with censored failure time outcome. The proposed method highlights the need to consider both predictive values and the marker distribution in the population when evaluating a marker, and it provides a common scale for comparing different markers. We consider both semiparametric and nonparametric based estimating procedures. In addition, we provide asymptotic distribution theory and resampling based procedures for making statistical inference. We illustrate our approach with numerical studies and datasets from the Seattle Heart Failure Study.
doi:10.1198/016214507000001481
PMCID: PMC2719907
PMID: 19655041
Prognostic accuracy; Positive predictive value; Survival analysis
Single nucleotide polymorphisms (SNPs) are valuable tools for ecological and evolutionary studies. In non-model species, the use of SNPs has been limited by the number of markers available. However, new technologies and decreasing technology costs have facilitated the discovery of a constantly increasing number of SNPs. With hundreds or thousands of SNPs potentially available, there is interest in comparing and developing methods for evaluating SNPs to create panels of high-throughput assays that are customized for performance, research questions, and resources. Here we use five different methods to rank 43 new SNPs and 71 previously published SNPs for sockeye salmon: FST, informativeness (In), average contribution to principal components (LC), and the locus-ranking programs BELS and WHICHLOCI. We then tested the performance of these different ranking methods by creating 48- and 96-SNP panels of the top-ranked loci for each method and used empirical and simulated data to obtain the probability of assigning individuals to the correct population using each panel. All 96-SNP panels performed similarly and better than the 48-SNP panels except for the 96-SNP BELS panel. Among the 48-SNP panels, panels created from FST, In, and LC ranks performed better than panels formed using the top-ranked loci from the programs BELS and WHICHLOCI. The application of ranking methods to optimize panel performance will become more important as more high-throughput assays become available.
doi:10.1371/journal.pone.0049018
PMCID: PMC3502385
PMID: 23185290
Background
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
Results
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
Conclusion
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
doi:10.1186/1471-2105-10-62
PMCID: PMC2669484
PMID: 19228411
Background:
The goal of this study is to extend the applications of parametric survival models so that they include cases in which accelerated failure time (AFT) assumption is not satisfied, and examine parametric and semiparametric models under different proportional hazards (PH) and AFT assumptions.
Methods:
The data for 12,531 women diagnosed with breast cancer in British Columbia, Canada, during 1990–1999 were divided into eight groups according to patients’ ages and stage of disease, and each group was assumed to have different AFT and PH assumptions. For parametric models, we fitted the saturated generalized gamma (GG) distribution, and compared this with the conventional AFT model. Using a likelihood ratio statistic, both models were compared to the simpler forms including the Weibull and lognormal. For semiparametric models, either Cox's PH model or stratified Cox model was fitted according to the PH assumption and tested using Schoenfeld residuals. The GG family was compared to the log-logistic model using Akaike information criterion (AIC) and Baysian information criterion (BIC).
Results:
When PH and AFT assumptions were satisfied, semiparametric and parametric models both provided valid descriptions of breast cancer patient survival. When PH assumption was not satisfied but AFT condition held, the parametric models performed better than the stratified Cox model. When neither the PH nor the AFT assumptions were met, the log normal distribution provided a reasonable fit.
Conclusions:
When both the PH and AFT assumptions are satisfied, the parametric and semiparametric models provide complementary information. When PH assumption is not satisfied, the parametric models should be considered, whether the AFT assumption is met or not.
PMCID: PMC3445281
PMID: 23024854
Breast cancer; generalized gamma distribution; parametric regression; stratified Cox model; survival analysis
SUMMARY
This work focuses on the estimation of distribution functions with incomplete data, where the variable of interest Y has ignorable missingness but the covariate X is always observed. When X is high dimensional, parametric approaches to incorporate X — information is encumbered by the risk of model misspecification and nonparametric approaches by the curse of dimensionality. We propose a semiparametric approach, which is developed under a nonparametric kernel regression framework, but with a parametric working index to condense the high dimensional X — information for reduced dimension. This kernel dimension reduction estimator has double robustness to model misspecification and is most efficient if the working index adequately conveys the X — information about the distribution of Y. Numerical studies indicate better performance of the semiparametric estimator over its parametric and nonparametric counterparts. We apply the kernel dimension reduction estimation to an HIV study for the effect of antiretroviral therapy on HIV virologic suppression.
doi:10.1016/j.jspi.2011.03.030
PMCID: PMC3127551
PMID: 21731174
curse of dimensionality; dimension reduction; distribution function; ignorable missingness; kernel regression; quantile
The construction of the components of Partial Least Squares (PLS) is based on the maximization of the covariance/correlation between linear combinations of the predictors and the response. However, the usual Pearson correlation is influenced by outliers in the response or in the predictors. To cope with outliers, we replace the Pearson correlation with the Spearman rank correlation in the optimization criteria of PLS. The rank-based method of PLS is insensitive to outlying values in both the predictors and response, and incorporates the censoring information by using an approach of Nguyen and Rocke (2004) and two approaches of reweighting and mean imputation of Datta et al. (2007). The performance of the rank-based approaches of PLS, denoted by Rank-based Modified Partial Least Squares (RMPLS), Rank-based Reweighted Partial Least Squares (RRWPLS), and Rank-based Mean-Imputation Partial Least Squares (RMIPLS), is investigated in a simulation study and on four real datasets, under an Accelerated Failure Time (AFT) model, against their un-ranked counterparts, and several other dimension reduction techniques. The results indicate that RMPLS is a better dimension reduction method than other variants of PLS as well as other considered methods in terms of the minimized cross-validation error of fit and the mean squared error of fit in the presence of outliers in the response, and is comparable to other variants of PLS in the absence of outliers.
PMCID: PMC2796584
PMID: 20014472
rank-based PLS; dimension reduction; censored response; outliers
Pressinotti, Nicole Chui | Klocker, Helmut | Schäfer, Georg | Luu, Van-Duc | Ruschhaupt, Markus | Kuner, Ruprecht | Steiner, Eberhard | Poustka, Annemarie | Bartsch, Georg | Sültmann, Holger
Background
Despite recent progress in the identification of genetic and molecular alterations in prostate cancer, markers associated with tumor progression are scarce. Therefore precise diagnosis of patients and prognosis of the disease remain difficult. This study investigated novel molecular markers discriminating between low and highly aggressive types of prostate cancer.
Results
Using 52 microdissected cell populations of low- and high-risk prostate tumors, we identified via global cDNA microarrays analysis almost 1200 genes being differentially expressed among these groups. These genes were analyzed by statistical, pathway and gene enrichment methods. Twenty selected candidate genes were verified by quantitative real time PCR and immunohistochemistry. In concordance with the mRNA levels, two genes MAP3K5 and PDIA3 exposed differential protein expression. Functional characterization of PDIA3 revealed a pro-apoptotic role of this gene in PC3 prostate cancer cells.
Conclusions
Our analyses provide deeper insights into the molecular changes occurring during prostate cancer progression. The genes MAP3K5 and PDIA3 are associated with malignant stages of prostate cancer and therefore provide novel potential biomarkers.
doi:10.1186/1476-4598-8-130
PMCID: PMC2807430
PMID: 20035634
Background
It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis.
Results
We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approach. When it was applied to data on diffuse large B-cell lymphoma (DLBCL), we were able to identify some new candidate signatures for disease prognosis.
Conclusion
This multivariate approach provides researchers with a more reliable list of molecular features profiled in terms of their prognostic relationship to the event times, and generates dependable information for subsequent identification of prognostic molecular signatures through either biological procedures or further data analysis.
doi:10.1186/1742-4682-4-3
PMCID: PMC1796541
PMID: 17239251
Gene copy number changes are common characteristics of many genetic disorders. A new technology, array comparative genomic hybridization (a-CGH), is widely used today to screen for gains and losses in cancers and other genetic diseases with high resolution at the genome level or for specific chromosomal region. Statistical methods for analyzing such a-CGH data have been developed. However, most of the existing methods are for unrelated individual data and the results from them provide explanation for horizontal variations in copy number changes. It is potentially meaningful to develop a statistical method that will allow for the analysis of family data to investigate the vertical kinship effects as well. Here we consider a semiparametric model based on clustering method in which the marginal distributions are estimated nonparametrically, and the familial dependence structure is modeled by copula. The model is illustrated and evaluated using simulated data. Our results show that the proposed method is more robust than the commonly used multivariate normal model. Finally, we demonstrated the utility of our method using a real dataset.
PMCID: PMC2735963
PMID: 19812787
cluster; copula; family data; gene copy number; semiparametric model
SUMMARY
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
doi:10.1198/jasa.2011.tm10156
PMCID: PMC3273908
PMID: 22323840
Cox regression model; EM algorithm; Increasing failure rate; Non-parametric likelihood; Profile likelihood; Right-censored data