In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach.
Integrative analysis; Cancer prognosis; Heterogeneity model; Penalization
Breast cancer is the most common malignancy in women worldwide. With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed for more personalized treatment and disease management. Towards this goal, we have developed a novel computational model for breast cancer prognosis by combining the Pathway Deregulation Score (PDS) based pathifier algorithm, Cox regression and L1-LASSO penalization method. We trained the model on a set of 236 patients with gene expression data and clinical information, and validated the performance on three diversified testing data sets of 606 patients. To evaluate the performance of the model, we conducted survival analysis of the dichotomized groups, and compared the areas under the curve based on the binary classification. The resulting prognosis genomic model is composed of fifteen pathways (e.g. P53 pathway) that had previously reported cancer relevance, and it successfully differentiated relapse in the training set (log rank p-value = 6.25e-12) and three testing data sets (log rank p-value<0.0005). Moreover, the pathway-based genomic models consistently performed better than gene-based models on all four data sets. We also find strong evidence that combining genomic information with clinical information improved the p-values of prognosis prediction by at least three orders of magnitude in comparison to using either genomic or clinical information alone. In summary, we propose a novel prognosis model that harnesses the pathway-based dysregulation as well as valuable clinical information. The selected pathways in our prognosis model are promising targets for therapeutic intervention.
With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed early on for more personalized treatment and management. Towards this goal we propose in this study a novel pathway-based prognosis prediction model, which emphasizes on individualized pathway-based risk measurement using the pathway dysregulation score (PDS). In combination with the L1-LASSO penalized feature selection and the COX-Proportional Hazards regression model, we have identified fifteen cancer relevant pathways using the pathway-based genomic model that successfully differentiated the relapse in the training set as well as three diversified test sets. Moreover, given the debate whether higher-order representative features, such as GO sets, pathways and network modules are superior to the gene-level features in the genomic models, we demonstrate that pathway-based genomic models consistently performed better than gene-based models in all four data sets. Last but not least, we show strong evidence that models that combine genomic information with clinical information improves the prognosis prediction significantly, in comparison to models that use either genomic or clinical information alone.
With the availability of high-throughput microarray technologies, investigators can simultaneously measure the expression levels of many thousands of genes in a short period. Although there are rich statistical methods for analyzing microarray data in the literature, limited work has been done in mapping expression quantitative trait loci (eQTL) that influence the variation in levels of gene expression. Most existing eQTL mapping methods assume that the expression phenotypes follow a normal distribution and violation of the normality assumption may lead to inflated type I error and reduced power. QTL analysis of expression data involves the mapping of many expression phenotypes at thousands or hundreds of thousands of marker loci across the whole genome. An appropriate procedure to adjust for multiple testing is essential for guarding against an abundance of false positive results. In this study, we applied a semiparametric quantitative trait loci (SQTL) mapping method to human gene expression data. The SQTL mapping method is rank-based and therefore robust to non-normality and outliers. Furthermore, we apply an efficient Monte Carlo procedure to account for multiple testing and assess the genome-wide significance level. Particularly, we apply the SQTL mapping method and the Monte-Carlo approach to the gene expression data provided by Genetic Analysis Workshop 15.
The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.
biomarker; case-control study; classification; Hosmer-Lemeshow test; predictiveness curve; risk; ROC curve
The goal of this study is to extend the applications of parametric survival models so that they include cases in which accelerated failure time (AFT) assumption is not satisfied, and examine parametric and semiparametric models under different proportional hazards (PH) and AFT assumptions.
The data for 12,531 women diagnosed with breast cancer in British Columbia, Canada, during 1990–1999 were divided into eight groups according to patients’ ages and stage of disease, and each group was assumed to have different AFT and PH assumptions. For parametric models, we fitted the saturated generalized gamma (GG) distribution, and compared this with the conventional AFT model. Using a likelihood ratio statistic, both models were compared to the simpler forms including the Weibull and lognormal. For semiparametric models, either Cox's PH model or stratified Cox model was fitted according to the PH assumption and tested using Schoenfeld residuals. The GG family was compared to the log-logistic model using Akaike information criterion (AIC) and Baysian information criterion (BIC).
When PH and AFT assumptions were satisfied, semiparametric and parametric models both provided valid descriptions of breast cancer patient survival. When PH assumption was not satisfied but AFT condition held, the parametric models performed better than the stratified Cox model. When neither the PH nor the AFT assumptions were met, the log normal distribution provided a reasonable fit.
When both the PH and AFT assumptions are satisfied, the parametric and semiparametric models provide complementary information. When PH assumption is not satisfied, the parametric models should be considered, whether the AFT assumption is met or not.
Breast cancer; generalized gamma distribution; parametric regression; stratified Cox model; survival analysis
Motivation: The recent advance in high-throughput sequencing technologies is generating a huge amount of data that are becoming an important resource for deciphering the genotype underlying a given phenotype. Genome sequencing has been extensively applied to the study of the cancer genomes. Although a few methods have been already proposed for the detection of cancer-related genes, their automatic identification is still a challenging task. Using the genomic data made available by The Cancer Genome Atlas Consortium (TCGA), we propose a new prioritization approach based on the analysis of the distribution of putative deleterious variants in a large cohort of cancer samples.
Results: In this paper, we present ContastRank, a new method for the prioritization of putative impaired genes in cancer. The method is based on the comparison of the putative defective rate of each gene in tumor versus normal and 1000 genome samples. We show that the method is able to provide a ranked list of putative impaired genes for colon, lung and prostate adenocarcinomas. The list significantly overlaps with the list of known cancer driver genes previously published. More importantly, by using our scoring approach, we can successfully discriminate between TCGA normal and tumor samples. A binary classifier based on ContrastRank score reaches an overall accuracy >90% and the area under the curve (AUC) of receiver operating characteristics (ROC) >0.95 for all the three types of adenocarcinoma analyzed in this paper. In addition, using ContrastRank score, we are able to discriminate the three tumor types with a minimum overall accuracy of 77% and AUC of 0.83.
Conclusions: We describe ContrastRank, a method for prioritizing putative impaired genes in cancer. The method is based on the comparison of exome sequencing data from different cohorts and can detect putative cancer driver genes.
ContrastRank can also be used to estimate a global score for an individual genome about the risk of adenocarcinoma based on the genetic variants information from a whole-exome VCF (Variant Calling Format) file. We believe that the application of ContrastRank can be an important step in genomic medicine to enable genome-based diagnosis.
Availability and implementation: The lists of ContrastRank scores of all genes in each tumor type are available as supplementary materials. A webserver for evaluating the risk of the three studied adenocarcinomas starting from whole-exome VCF file is under development.
Supplementary data are available at Bioinformatics online.
To develop more targeted intervention strategies, an important research goal is to identify markers predictive of clinical events. A crucial step towards this goal is to characterize the clinical performance of a marker for predicting different types of events. In this manuscript, we present statistical methods for evaluating the performance of a prognostic marker in predicting multiple competing events. To capture the potential time-varying predictive performance of the marker and incorporate competing risks, we define time- and cause-specific accuracy summaries by stratifying cases based on causes of failure. Such definition would allow one to evaluate the predictive accuracy of a marker for each type of event and compare its predictiveness across event types. Extending the nonparametric crude cause-specific ROC curve estimators by Saha and Heagerty (2010), we develop inference procedures for a range of cause-specific accuracy summaries. To estimate the accuracy measures and assess how covariates may affect the accuracy of a marker under the competing risk setting, we consider two forms of semiparametric models through the cause-specific hazard framework. These approaches enable a flexible modeling of the relationships between the marker and failure times for each cause, while efficiently accommodating additional covariates. We investigate the asymptotic property of the proposed accuracy estimators and demonstrate the finite sample performance of these estimators through simulation studies. The proposed procedures are illustrated with data from a prostate cancer prognostic study.
Biomarker evaluation; Cause-specific Hazard; Competing risk; Negative predictive value; Positive predictive value; Receiver Operating Characteristics Curve (ROC curve); Survival analysis
A semiparametric approach was used to identify groups of cDNAs and genes with distinct expression profiles across time and overcome the limitations of clustering to identify groups. The semiparametric approach allows the generalization of mixtures of distributions while making no specific parametric assumptions about the distribution of the hidden heterogeneity of the cDNAs. The semiparametric approach was applied to study gene expression in the brains of Apis mellifera ligustica honey bees raised in two colonies (A. m. mellifera and ligustica) with consistent patterns across five maturation ages.
The semiparametric approach provided unambiguous criteria to detect groups of genes, trajectories and probability of gene membership to groups. The semiparametric results were cross-validated in both colony data sets. Gene Ontology analysis enhanced by genome annotation helped to confirm the semiparametric results and revealed that most genes with similar or related neurobiological function were assigned to the same group or groups with similar trajectories. Ten groups of genes were identified and nine groups had highly similar trajectories in both data sets. Differences in the trajectory of the reminder group were consistent with reports of accelerated maturation in ligustica colonies compared to mellifera colonies.
The combination of microarray technology, genomic information and semiparametric analysis provided insights into the genomic plasticity and gene networks linked to behavioral maturation in the honey bee.
To validate whether FAM70B, which was found in our micro-array profiling as a prognostic marker for cancer survival, could accurately predict prognosis in patients with muscle-invasive bladder cancer (MIBC).
Materials and Methods
A total of 124 patients with MIBC were enrolled in this study. The FAM70B expression level was analyzed by real-time polymerase chain reaction by using RNA from tumor tissues. The prognostic effect of FAM70B was evaluated by Kaplan-Meier analysis and a multivariate Cox regression model.
Kaplan-Meier estimates showed a significant difference in progression-free survival (log-rank test, p=0.011) and cancer-specific survival (log-rank test, p=0.017) according to FAM70B gene expression level. By multivariate Cox regression analysis, high FAM70B expression was predictive of cancer progression (hazard ratio [HR], 2.115, p=0.013) and cancer-specific death (HR, 1.925; p=0.033). In the subgroup analysis, high expression of FAM70B was associated with poor cancer-specific survival, progression-free survival, and overall survival in the patients who underwent cystectomy (log-rank test, p=0.013, p=0.036, p=0.005, respectively). In the chemotherapy group, FAM70B expression was associated with cancer-specific survival and progression-free survival (log-rank test, p=0.013, p=0.042, respectively). Moreover, high FAM70B expression was associated with shorter cancer-specific survival in localized or locally advanced tumor stages (log-rank test, p=0.016).
We confirmed the significance of FAM70B as a prognostic marker in a validation cohort. Therefore, we propose that the FAM70B gene could be used to more precisely predict cancer progression and cancer-specific death in patients with MIBC.
Bladder cancer; Gene expression profiling; Micro-array; Prognosis
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
genomic studies; semiparametric prognosis models; model comparison
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
ranking biomarkers; ROC; confounders; high-throughput data
Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.
We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.
From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.
Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype–environment interactions from case–control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy–Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype–environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation–maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case–control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by “NAT2,” a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.
Case-control studies; EM algorithm; Gene-environment interactions; Haplotype; Semiparametric methods
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology
23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One
2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics
63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics
9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics
86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets.
High dimensional statistics; Principal component analysis; Elliptical distribution; Robust statistics
In cancer prognosis research, diverse machine learning models have applied to the problems of cancer susceptibility (risk assessment), cancer recurrence (redevelopment of cancer after resolution), and cancer survivability, regarding an accuracy (or an AUC--the area under the ROC curve) as a primary measurement for the performance evaluation of the models. However, in order to help medical specialists to establish a treatment plan by using the predicted output of a model, it is more pragmatic to elucidate which variables (markers) have most significantly influenced to the resulting outcome of cancer or which patients show similar patterns.
In this study, a coupling approach of two sub-modules--a predictor and a descriptor--is proposed. The predictor module generates the predicted output for the cancer outcome. Semi-supervised learning co-training algorithm is employed as a predictor. On the other hand, the descriptor module post-processes the results of the predictor module, mainly focusing on which variables are more highly or less significantly ranked when describing the results of the prediction, and how patients are segmented into several groups according to the trait of common patterns among them. Decision trees are used as a descriptor.
The proposed approach, 'predictor-descriptor,' was tested on the breast cancer survivability problem based on the surveillance, epidemiology, and end results database for breast cancer (SEER). The results present the performance comparison among the established machine leaning algorithms, the ranks of the prognosis elements for breast cancer, and patient segmentation. In the performance comparison among the predictor candidates, Semi-supervised learning co-training algorithm showed best performance, producing an average AUC of 0.81. Later, the descriptor module found the top-tier prognosis markers which significantly affect to the classification results on survived/dead patients: 'lymph node involvement', 'stage', 'site-specific surgery', 'number of positive node examined', and 'tumor size', etc. Also, a typical example of patient-segmentation was provided: the patients classified as dead were grouped into two segments depending on difference in prognostic profiles, ones with serious results with respect to the pathologic exams and the others with the feebleness of age.
There are several statistical methods for time-to-event analysis, among which is the Cox proportional hazards model that is most commonly used. However, when the absolute change in risk, instead of the risk ratio, is of primary interest or when the proportional hazard assumption for the Cox proportional hazards model is violated, an additive hazard regression model may be more appropriate. In this paper, we give an overview of this approach and then apply a semiparametric as well as a nonparametric additive model to a data set from a study of the natural history of human papillomavirus (HPV) in HIV-positive and HIV-negative women. The results from the semiparametric model indicated on average an additional 14 oncogenic HPV infections per 100 woman-years related to CD4 count < 200 relative to HIV-negative women, and those from the nonparametric additive model showed an additional 40 oncogenic HPV infections per 100 women over 5 years of followup, while the estimated hazard ratio in the Cox model was 3.82. Although the Cox model can provide a better understanding of the exposure disease association, the additive model is often more useful for public health planning and intervention.
The widespread use of high-throughput methods of single nucleotide polymorphism (SNP) genotyping has created a number of computational and statistical challenges. The problem of identifying SNP–SNP interactions in case–control studies has been studied extensively, and a number of new techniques have been developed. Little progress has been made, however, in the analysis of SNP–SNP interactions in relation to time-to-event data, such as patient survival time or time to cancer relapse. We present an extension of the two class multifactor dimensionality reduction (MDR) algorithm that enables detection and characterization of epistatic SNP–SNP interactions in the context of survival analysis. The proposed Survival MDR (Surv-MDR) method handles survival data by modifying MDR’s constructive induction algorithm to use the log-rank test. Surv-MDR replaces balanced accuracy with log-rank test statistics as the score to determine the best models. We simulated datasets with a survival outcome related to two loci in the absence of any marginal effects. We compared Surv-MDR with Cox-regression for their ability to identify the true predictive loci in these simulated data. We also used this simulation to construct the empirical distribution of Surv-MDR’s testing score. We then applied Surv-MDR to genetic data from a population-based epidemiologic study to find prognostic markers of survival time following a bladder cancer diagnosis. We identified several two-loci SNP combinations that have strong associations with patients’ survival outcome. Surv-MDR is capable of detecting interaction models with weak main effects. These epistatic models tend to be dropped by traditional Cox regression approaches to evaluating interactions. With improved efficiency to handle genome wide datasets, Surv-MDR will play an important role in a research strategy that embraces the complexity of the genotype–phenotype mapping relationship since epistatic interactions are an important component of the genetic basis of disease.
Gene expression profiling has played an important role in cancer risk classification and has shown promising results. Since gene expression profiling often involves determination of a set of top rank genes for analysis, it is important to evaluate how modeling performance varies with the number of selected top ranked genes incorporated in the model. We used a colon data set collected at Moffitt Cancer Center, as an example of the study, and ranked genes based on the univariate Cox proportional hazards model. A set of top ranked genes was selected for evaluation. The selection was done by choosing the top k ranked genes for k=1 to 12,500. An analysis indicated a considerable variation of classification outcomes when the number of top ranked genes was changed. We developed a predictive risk probability approach to accommodate this variation by identifying a range number of top ranked genes. For each number of top ranked genes, the procedure classifies each patient as having high risk (score = 1) or low risk (score = 0). The categorizations are then averaged, giving a risk score between 0 and 1, thus providing a ranking for the patient’s need for further treatment. This approach was applied to the colon data set and demonstrated the strength of this approach by three criteria: First, a univariate Cox proportional hazards model showed a highly statistically significant level (log-rank χ2 statistics=110 with p value < 10−16) for the predictive risk probability classification. Second, the survival tree model used the risk probability to partition patients into five risk groups showing a good separation of survival curves (log-rank χ2 statistics=215). In addition, utilization of the risk group status identified a small set of risk genes which may be practical for biological validation. Third, analysis of re-sampling the risk probability suggested the variation pattern of the log-rank χ2 in the colon cancer dataset was unlikely caused by chance.
Cancer risk classification; dimension reduction; log-rank test; survival tree model; top ranked genes
AIM: To elucidate high mobility group-box 3 (HMGB3) protein expression in gastric adenocarcinoma, its potential prognostic relevance, and possible mechanism of action.
METHODS: Ninety-two patients with gastric adenocarcinomas surgically removed entered the study. HMGB3 expression was determined by immunohistochemistry through a tissue microarray procedure. The clinicopathologic characteristics of all patients were recorded, and regular follow-up was made for all patients. The inter-relationship of HMGB3 expression with histological and clinical factors was analyzed using nonparametric tests. Survival analysis was carried out by Kaplan-Meier (log-rank) and multivariate Cox (Forward LR) analyses between the group with overexpression of HMGB3 and the group with low or no HMGB3 expression to determine the prognosis value of HMGB3 expression on overall survival. Further, HMGB3 expression was knocked down by small hairpin RNAs (shRNAs) in the human gastric cancer cell line BGC823 to observe its influence on cell biological characteristics. The MTT method was utilized to detect gastric cancer cell proliferation changes, and cell cycle distribution was analyzed by flow cytometry.
RESULTS: Among 92 patients with gastric adenocarcinomas surgically removed in this study, high HMGB3 protein expression was detected in the gastric adenocarcinoma tissues vs peritumoral tissues (P < 0.001). Further correlation analysis with patients’ clinical and histology variables revealed that HMGB3 overexpression was obviously associated with extensive wall penetration (P = 0.005), a positive nodal status (P = 0.004), and advanced tumor-node-metastasis (TNM) stage (P = 0.001). But there was no correlation between HMGB3 overexpression and the age and gender of the patient, tumor localization or histologic grade. Statistical Kaplan-Meier survival analysis disclosed significant differences in overall survival between the HMGB3 overexpression group and the HMGB3 no or low expression group (P = 0.006). The expected overall survival time was 31.00 ± 3.773 mo (95%CI = 23.605-38.395) for patients with HMGB3 overexpression and 49.074 ± 3.648 mo (95%CI = 41.925-57.311) for patients with HMGB3 no and low-level expression. Additionally, older age (P = 0.040), extensive wall penetration (P = 0.008), positive lymph node metastasis (P = 0.005), and advanced TNM tumor stage (P = 0.007) showed negative correlation with overall survival. Multivariate Cox regression analysis indicated that HMGB3 overexpression was an independent variable with respect to age, gender, histologic grade, extent of wall penetration, lymph nodal metastasis, and TNM stage for patients with resectable gastric adenocarcinomas with poor prognosis (hazard ratio = 2.791, 95%CI = 1.233-6.319, P = 0.019). In the gene function study, after HMGB3 was knocked down in the gastric cell line BGC823 by shRNA, the cell proliferation rate was reduced at 24 h, 48 h and 72 h. Compared to BGC823 shRNA-negative control (NC) cells, the cell proliferation rate in cells that had HMGB3 shRNA transfected was significantly decreased (P < 0.01). Finally, cell cycle analysis by FACS showed that BGC823 cells that had HMGB3 knocked down were blocked in G1/G0 phase. The percentage of cells in G1/G0 phase in BGC823 cells with shRNA-NC and with shRNA-HMGB3 was 46.84% ± 1.7%, and 73.03% ± 3.51% respectively (P = 0.001), whereas G2/M cells percentage decreased from 26.51% ± 0.83% to 17.8% ± 2.26%.
CONCLUSION: HMGB3 is likely to be a useful prognostic marker involved in gastric cancer disease onset and progression by regulating the cell cycle.
High mobility group-box 3; Gastric adenocarcinoma; Prognosis; Cell proliferation; Cell cycle
The analysis of longitudinal repeated measures data is frequently complicated by missing data due to informative dropout. We describe a mixture model for joint distribution for longitudinal repeated measures, where the dropout distribution may be continuous and the dependence between response and dropout is semiparametric. Specifically, we assume that responses follow a varying coefficient random effects model conditional on dropout time, where the regression coefficients depend on dropout time through unspecified nonparametric functions that are estimated using step functions when dropout time is discrete (e.g., for panel data) and using smoothing splines when dropout time is continuous. Inference under the proposed semiparametric model is hence more robust than the parametric conditional linear model. The unconditional distribution of the repeated measures is a mixture over the dropout distribution. We show that estimation in the semiparametric varying coefficient mixture model can proceed by fitting a parametric mixed effects model and can be carried out on standard software platforms such as SAS. The model is used to analyze data from a recent AIDS clinical trial and its performance is evaluated using simulations.
Clinical trials; Equivalence trial; Linear mixed model; Missing data; Nonignorable dropout; Pattern-mixture model; Pediatric AIDS; Selection bias; Smoothing splines
We discuss the analysis of data from single nucleotide polymorphism (SNP) arrays comparing tumor and normal tissues. The data consist of sequences of indicators for loss of heterozygosity (LOH) and involve three nested levels of repetition: chromosomes for a given patient, regions within chromosomes, and SNPs nested within regions. We propose to analyze these data using a semiparametric model for multi-level repeated binary data. At the top level of the hierarchy we assume a sampling model for the observed binary LOH sequences that arises from a partial exchangeability argument. This implies a mixture of Markov chains model. The mixture is defined with respect to the Markov transition probabilities. We assume a nonparametric prior for the random mixing measure. The resulting model takes the form of a semiparametric random effects model with the matrix of transition probabilities being the random effects. The model includes appropriate dependence assumptions for the two remaining levels of the hierarchy, i.e., for regions within chromosomes and for chromosomes within patient. We use the model to identify regions of increased LOH in a dataset coming from a study of treatment-related leukemia in children with an initial cancer diagnostic. The model successfully identifies the desired regions and performs well compared to other available alternatives.
Platelets have been implicated in cancer metastasis and prognosis. No population-based study has been reported as to whether preoperative platelet count directly predicts metastatic recurrence of colorectal cancer (CRC) patients.
Using a well-characterized cohort of 1,513 surgically resected CRC patients, we assessed the predictive roles of preoperative platelet count in overall survival, overall recurrence, as well as locoregional and distant metastatic recurrences.
Patients with clinically high platelet count (≥400× 109/L) measured within 1 month before surgery had a significantly unfavorable survival (hazard ratio [HR]=1.66, 95 % confidence interval [CI] 1.34–2.05, P=2.6×10−6, Plog rank= 1.1×10−11) and recurrence (HR=1.90, 1.24–2.93, P=0.003, Plog rank=0.003). The association of platelet count with recurrence was evident only in patients with metastatic (HR=2.81, 1.67–4.74, P=1.1×10−4, Plog rank =2.6×10−6) but not locoregional recurrence (HR=0.59, 95 % CI 0.21–1.68, P= 0.325, Plog rank=0.152). The findings were internally validated through bootstrap resampling (P<0.01 at 98.6 % of resampling). Consistently, platelet count was significantly higher in deceased than living patients (P<0.0001) and in patients with metastatic recurrence than locoregional (P= 0.004) or nonrecurrent patients (P<0.0001). Time-dependent modeling indicated that the increased risks for death and metastasis associated with elevated preoperative platelet counts persisted up to 5 years after surgery.
Our data demonstrated that clinically high level of preoperative platelets was an independent predictor of CRC survival and metastasis. As an important component of the routinely tested complete blood count panel, platelet count may be a cost-effective and noninvasive marker for CRC prognosis and a potential intervention target to prevent metastatic recurrence.
Platelet; Thrombocytosis; CRC; Survival; Recurrence; Metastasis
In family-based longitudinal genetic studies, investigators collect repeated measurements on a trait that changes with time along with genetic markers. Since repeated measurements are nested within subjects and subjects are nested within families, both the subject-level and measurement-level correlations must be taken into account in the statistical analysis to achieve more accurate estimation. In such studies, the primary interests include to test for quantitative trait locus (QTL) effect, and to estimate age-specific QTL effect and residual polygenic heritability function. We propose flexible semiparametric models along with their statistical estimation and hypothesis testing procedures for longitudinal genetic designs. We employ penalized splines to estimate nonparametric functions in the models. We find that misspecifying the baseline function or the genetic effect function in a parametric analysis may lead to substantially inflated or highly conservative type I error rate on testing and large mean squared error on estimation. We apply the proposed approaches to examine age-specific effects of genetic variants reported in a recent genome-wide association study of blood pressure collected in the Framingham Heart Study.
Genome-wide association study; Penalized splines; Quantitative trait locus
The proportional hazards assumption in the commonly used Cox model for censored failure time data is often violated in scientific studies. Yang and Prentice (2005) proposed a novel semiparametric two-sample model that includes the proportional hazards model and the proportional odds model as sub-models, and accommodates crossing survival curves. The model leaves the baseline hazard unspecified and the two model parameters can be interpreted as the short-term and long-term hazard ratios. Inference procedures were developed based on a pseudo score approach. Although extension to accommodate covariates was mentioned, no formal procedures have been provided or proved. Furthermore, the pseudo score approach may not be asymptotically efficient. We study the extension of the short-term and long-term hazard ratio model of Yang and Prentice (2005) to accommodate potentially time-dependent covariates. We develop efficient likelihood-based estimation and inference procedures. The nonparametric maximum likelihood estimators are shown to be consistent, asymptotically normal, and asymptotically efficient. Extensive simulation studies demonstrate that the proposed methods perform well in practical settings. The proposed method successfully captured the phenomenon of crossing hazards in a cancer clinical trial and identified a genetic marker with significant long-term effect missed by using the proportional hazards model on age-at-onset of alcoholism in a genetic study.
Semiparametric hazards rate model; Non-parametric likelihood; Proportional hazards model; Proportional odds model; Semiparametric efficiency