PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (2004309)

Clipboard (0)
None

Related Articles

1.  A Penalized Robust Method for Identifying Gene-Environment Interactions 
Genetic epidemiology  2014;38(3):220-230.
In high-throughput studies, an important objective is to identify gene-environment interactions associated with disease outcomes and phenotypes. Many commonly adopted methods assume specific parametric or semiparametric models, which may be subject to model mis-specification. In addition, they usually use significance level as the criterion for selecting important interactions. In this study, we adopt the rank-based estimation, which is much less sensitive to model specification than some of the existing methods and includes several commonly encountered data and models as special cases. Penalization is adopted for the identification of gene-environment interactions. It achieves simultaneous estimation and identification and does not rely on significance level. For computation feasibility, a smoothed rank estimation is further proposed. Simulation shows that under certain scenarios, for example with contaminated or heavy-tailed data, the proposed method can significantly outperform the existing alternatives with more accurate identification. We analyze a lung cancer prognosis study with gene expression measurements under the AFT (accelerated failure time) model. The proposed method identifies interactions different from those using the alternatives. Some of the identified genes have important implications.
doi:10.1002/gepi.21795
PMCID: PMC4356211  PMID: 24616063
Gene-environment interaction; robust rank estimation; penalization; marker identification
2.  Breast Cancer Survival Analysis: Applying the Generalized Gamma Distribution under Different Conditions of the Proportional Hazards and Accelerated Failure Time Assumptions 
Background:
The goal of this study is to extend the applications of parametric survival models so that they include cases in which accelerated failure time (AFT) assumption is not satisfied, and examine parametric and semiparametric models under different proportional hazards (PH) and AFT assumptions.
Methods:
The data for 12,531 women diagnosed with breast cancer in British Columbia, Canada, during 1990–1999 were divided into eight groups according to patients’ ages and stage of disease, and each group was assumed to have different AFT and PH assumptions. For parametric models, we fitted the saturated generalized gamma (GG) distribution, and compared this with the conventional AFT model. Using a likelihood ratio statistic, both models were compared to the simpler forms including the Weibull and lognormal. For semiparametric models, either Cox's PH model or stratified Cox model was fitted according to the PH assumption and tested using Schoenfeld residuals. The GG family was compared to the log-logistic model using Akaike information criterion (AIC) and Baysian information criterion (BIC).
Results:
When PH and AFT assumptions were satisfied, semiparametric and parametric models both provided valid descriptions of breast cancer patient survival. When PH assumption was not satisfied but AFT condition held, the parametric models performed better than the stratified Cox model. When neither the PH nor the AFT assumptions were met, the log normal distribution provided a reasonable fit.
Conclusions:
When both the PH and AFT assumptions are satisfied, the parametric and semiparametric models provide complementary information. When PH assumption is not satisfied, the parametric models should be considered, whether the AFT assumption is met or not.
PMCID: PMC3445281  PMID: 23024854
Breast cancer; generalized gamma distribution; parametric regression; stratified Cox model; survival analysis
3.  Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets 
Genetics research  2013;95(0):68-77.
SUMMARY
In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach.
doi:10.1017/S0016672313000086
PMCID: PMC4090387  PMID: 23938111
Integrative analysis; Cancer prognosis; Heterogeneity model; Penalization
4.  A Novel Model to Combine Clinical and Pathway-Based Transcriptomic Information for the Prognosis Prediction of Breast Cancer 
PLoS Computational Biology  2014;10(9):e1003851.
Breast cancer is the most common malignancy in women worldwide. With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed for more personalized treatment and disease management. Towards this goal, we have developed a novel computational model for breast cancer prognosis by combining the Pathway Deregulation Score (PDS) based pathifier algorithm, Cox regression and L1-LASSO penalization method. We trained the model on a set of 236 patients with gene expression data and clinical information, and validated the performance on three diversified testing data sets of 606 patients. To evaluate the performance of the model, we conducted survival analysis of the dichotomized groups, and compared the areas under the curve based on the binary classification. The resulting prognosis genomic model is composed of fifteen pathways (e.g. P53 pathway) that had previously reported cancer relevance, and it successfully differentiated relapse in the training set (log rank p-value = 6.25e-12) and three testing data sets (log rank p-value<0.0005). Moreover, the pathway-based genomic models consistently performed better than gene-based models on all four data sets. We also find strong evidence that combining genomic information with clinical information improved the p-values of prognosis prediction by at least three orders of magnitude in comparison to using either genomic or clinical information alone. In summary, we propose a novel prognosis model that harnesses the pathway-based dysregulation as well as valuable clinical information. The selected pathways in our prognosis model are promising targets for therapeutic intervention.
Author Summary
With the increasing awareness of heterogeneity in breast cancers, better prediction of breast cancer prognosis is much needed early on for more personalized treatment and management. Towards this goal we propose in this study a novel pathway-based prognosis prediction model, which emphasizes on individualized pathway-based risk measurement using the pathway dysregulation score (PDS). In combination with the L1-LASSO penalized feature selection and the COX-Proportional Hazards regression model, we have identified fifteen cancer relevant pathways using the pathway-based genomic model that successfully differentiated the relapse in the training set as well as three diversified test sets. Moreover, given the debate whether higher-order representative features, such as GO sets, pathways and network modules are superior to the gene-level features in the genomic models, we demonstrate that pathway-based genomic models consistently performed better than gene-based models in all four data sets. Last but not least, we show strong evidence that models that combine genomic information with clinical information improves the prognosis prediction significantly, in comparison to models that use either genomic or clinical information alone.
doi:10.1371/journal.pcbi.1003851
PMCID: PMC4168973  PMID: 25233347
5.  Semiparametric methods for genome-wide linkage analysis of human gene expression data 
BMC Proceedings  2007;1(Suppl 1):S83.
With the availability of high-throughput microarray technologies, investigators can simultaneously measure the expression levels of many thousands of genes in a short period. Although there are rich statistical methods for analyzing microarray data in the literature, limited work has been done in mapping expression quantitative trait loci (eQTL) that influence the variation in levels of gene expression. Most existing eQTL mapping methods assume that the expression phenotypes follow a normal distribution and violation of the normality assumption may lead to inflated type I error and reduced power. QTL analysis of expression data involves the mapping of many expression phenotypes at thousands or hundreds of thousands of marker loci across the whole genome. An appropriate procedure to adjust for multiple testing is essential for guarding against an abundance of false positive results. In this study, we applied a semiparametric quantitative trait loci (SQTL) mapping method to human gene expression data. The SQTL mapping method is rank-based and therefore robust to non-normality and outliers. Furthermore, we apply an efficient Monte Carlo procedure to account for multiple testing and assess the genome-wide significance level. Particularly, we apply the SQTL mapping method and the Monte-Carlo approach to the gene expression data provided by Genetic Analysis Workshop 15.
PMCID: PMC2367566  PMID: 18466586
6.  Evaluating Prognostic Accuracy of Biomarkers under Competing Risk 
Biometrics  2011;68(2):388-396.
Summary
To develop more targeted intervention strategies, an important research goal is to identify markers predictive of clinical events. A crucial step towards this goal is to characterize the clinical performance of a marker for predicting different types of events. In this manuscript, we present statistical methods for evaluating the performance of a prognostic marker in predicting multiple competing events. To capture the potential time-varying predictive performance of the marker and incorporate competing risks, we define time- and cause-specific accuracy summaries by stratifying cases based on causes of failure. Such definition would allow one to evaluate the predictive accuracy of a marker for each type of event and compare its predictiveness across event types. Extending the nonparametric crude cause-specific ROC curve estimators by Saha and Heagerty (2010), we develop inference procedures for a range of cause-specific accuracy summaries. To estimate the accuracy measures and assess how covariates may affect the accuracy of a marker under the competing risk setting, we consider two forms of semiparametric models through the cause-specific hazard framework. These approaches enable a flexible modeling of the relationships between the marker and failure times for each cause, while efficiently accommodating additional covariates. We investigate the asymptotic property of the proposed accuracy estimators and demonstrate the finite sample performance of these estimators through simulation studies. The proposed procedures are illustrated with data from a prostate cancer prognostic study.
doi:10.1111/j.1541-0420.2011.01671.x
PMCID: PMC3694786  PMID: 22150576
Biomarker evaluation; Cause-specific Hazard; Competing risk; Negative predictive value; Positive predictive value; Receiver Operating Characteristics Curve (ROC curve); Survival analysis
7.  Retrospective analysis of haplotype-based case–control studies under a flexible model for gene–environment association 
Summary
Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype–environment interactions from case–control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy–Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype–environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation–maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case–control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by “NAT2,” a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.
doi:10.1093/biostatistics/kxm011
PMCID: PMC2683243  PMID: 17490987
Case-control studies; EM algorithm; Gene-environment interactions; Haplotype; Semiparametric methods
8.  Semiparametric approach to characterize unique gene expression trajectories across time 
BMC Genomics  2006;7:233.
Background:
A semiparametric approach was used to identify groups of cDNAs and genes with distinct expression profiles across time and overcome the limitations of clustering to identify groups. The semiparametric approach allows the generalization of mixtures of distributions while making no specific parametric assumptions about the distribution of the hidden heterogeneity of the cDNAs. The semiparametric approach was applied to study gene expression in the brains of Apis mellifera ligustica honey bees raised in two colonies (A. m. mellifera and ligustica) with consistent patterns across five maturation ages.
Results:
The semiparametric approach provided unambiguous criteria to detect groups of genes, trajectories and probability of gene membership to groups. The semiparametric results were cross-validated in both colony data sets. Gene Ontology analysis enhanced by genome annotation helped to confirm the semiparametric results and revealed that most genes with similar or related neurobiological function were assigned to the same group or groups with similar trajectories. Ten groups of genes were identified and nine groups had highly similar trajectories in both data sets. Differences in the trajectory of the reminder group were consistent with reports of accelerated maturation in ligustica colonies compared to mellifera colonies.
Conclusion:
The combination of microarray technology, genomic information and semiparametric analysis provided insights into the genomic plasticity and gene networks linked to behavioral maturation in the honey bee.
doi:10.1186/1471-2164-7-233
PMCID: PMC1592090  PMID: 16970825
9.  Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods 
Statistics in medicine  2010;29(13):1391-1410.
Summary
The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.
doi:10.1002/sim.3876
PMCID: PMC3045657  PMID: 20527013
biomarker; case-control study; classification; Hosmer-Lemeshow test; predictiveness curve; risk; ROC curve
10.  Outcome after BCG treatment for urinary bladder cancer may be influenced by polymorphisms in the NOS2 and NOS3 genes☆ 
Redox Biology  2015;6:272-277.
Purpose
Bacillus Calmette-Guérin (BCG)-treatment is an established treatment for bladder cancer, but its mechanisms of action are not fully understood. High-risk non-muscle invasive bladder-cancer (NMIBC)-patients failing to respond to BCG-treatment have worse prognosis than those undergoing immediate radical cystectomy and identification of patients at risk for BCG-failure is of high priority. Several studies indicate a role for nitric oxide (NO) in the cytotoxic effect that BCG exerts on bladder cancer cells. In this study we investigated whether NO-synthase (NOS)-gene polymorphisms, NOS2-promoter microsatellite (CCTTT)n, and the NOS3-polymorphisms-786T>C (rs2070744) and Glu298Asp (rs1799983), can serve as possible molecular markers for outcome after BCG-treatment for NMIBC.
Materials and methods
All NMIBC-patients from a well-characterized population based cohort were analyzed (n=88). Polymorphism data were combined with information from 15-years of clinical follow-up. The effect of BCG-treatment on cancer-specific death (CSD), recurrence and progression in patients with varying NOS-genotypes were studied using Cox proportional hazard-models and log rank tests.
Results
BCG-treatment resulted in significantly better survival in patients without (Log rank: p=0.006; HR: 0.12, p=0.048), but not in patients with a long version ((CCTTT)n ≧13 repeats) of the NOS2-promoter microsatellite. The NOS3-rs2070744(TT) and rs1799983(GG)-genotypes showed decreased risk for CSD (Log rank(TT): p=0.001; Log rank(GG): p=0.010, HR(GG): 0.16, p=0.030) and progression (Log rank(TT): p<0.001, HR(TT): 0.05, p=0.005; Log rank(GG): p<0.001, HR(GG): 0.10, p=0.003) after BCG-therapy compared to the other genotypes. There was also a reduction in recurrence in BCG-treated patients that was mostly genotype independent. Analysis of combined genotypes identified a subgroup of 30% of the BCG-treated patients that did not benefit from BCG-treatment.
Conclusions
Our results suggest that the investigated polymorphisms influence patient response to BCG-treatment and thus may serve as possible markers for identification of BCG-failures.
Graphical abstract
Highlights
•30% of BCG treated bladder cancer (NMIBC)-patients do not respond to BCG-treatment.•We need to identify BCG failures before the BCG-treatment is given.•Altered NOS2 and NOS3 gene activity may be associated with BCG treatment outcome.•NOS-polymorphisms are possible BCG-failure biomarkers in bladder cancer patients.
doi:10.1016/j.redox.2015.08.008
PMCID: PMC4556773  PMID: 26298202
BCG, Bacillus Calmette-Guérin vaccine; NMIBC, non-muscle invasive bladder cancer; NOS, nitric oxide synthase; CSD, cancer specific death; CI, confidence interval; BCG vaccine; Urinary bladder neoplasms; Nitric oxide synthase; Genetic polymorphism; Cancer survival
11.  Outcome Prediction in Patients with Glioblastoma by Using Imaging, Clinical, and Genomic Biomarkers: Focus on the Nonenhancing Component of the Tumor 
Radiology  2014;272(2):484-493.
In the current study, we focused on the role of the nonenhancing region (NER) of glioblastomas and showed that there are imaging phenotypic features related specifically to the NER—most notably the NER crossing the midline and relative cerebral blood volume of NER, which provide important prognostic information; these are complementary to clinical and genomic features and can improve models of patient prognosis.
Purpose
To correlate patient survival with morphologic imaging features and hemodynamic parameters obtained from the nonenhancing region (NER) of glioblastoma (GBM), along with clinical and genomic markers.
Materials and Methods
An institutional review board waiver was obtained for this HIPAA-compliant retrospective study. Forty-five patients with GBM underwent baseline imaging with contrast material–enhanced magnetic resonance (MR) imaging and dynamic susceptibility contrast-enhanced T2*-weighted perfusion MR imaging. Molecular and clinical predictors of survival were obtained. Single and multivariable models of overall survival (OS) and progression-free survival (PFS) were explored with Kaplan-Meier estimates, Cox regression, and random survival forests.
Results
Worsening OS (log-rank test, P = .0103) and PFS (log-rank test, P = .0223) were associated with increasing relative cerebral blood volume of NER (rCBVNER), which was higher with deep white matter involvement (t test, P = .0482) and poor NER margin definition (t test, P = .0147). NER crossing the midline was the only morphologic feature of NER associated with poor survival (log-rank test, P = .0125). Preoperative Karnofsky performance score (KPS) and resection extent (n = 30) were clinically significant OS predictors (log-rank test, P = .0176 and P = .0038, respectively). No genomic alterations were associated with survival, except patients with high rCBVNER and wild-type epidermal growth factor receptor (EGFR) mutation had significantly poor survival (log-rank test, P = .0306; area under the receiver operating characteristic curve = 0.62). Combining resection extent with rCBVNER marginally improved prognostic ability (permutation, P = .084). Random forest models of presurgical predictors indicated rCBVNER as the top predictor; also important were KPS, age at diagnosis, and NER crossing the midline. A multivariable model containing rCBVNER, age at diagnosis, and KPS can be used to group patients with more than 1 year of difference in observed median survival (0.49–1.79 years).
Conclusion
Patients with high rCBVNER and NER crossing the midline and those with high rCBVNER and wild-type EGFR mutation showed poor survival. In multivariable survival models, however, rCBVNER provided unique prognostic information that went above and beyond the assessment of all NER imaging features, as well as clinical and genomic features.
© RSNA, 2014
Online supplemental material is available for this article.
doi:10.1148/radiol.14131691
PMCID: PMC4263660  PMID: 24646147
12.  ContrastRank: a new method for ranking putative cancer driver genes and classification of tumor samples 
Bioinformatics  2014;30(17):i572-i578.
Motivation: The recent advance in high-throughput sequencing technologies is generating a huge amount of data that are becoming an important resource for deciphering the genotype underlying a given phenotype. Genome sequencing has been extensively applied to the study of the cancer genomes. Although a few methods have been already proposed for the detection of cancer-related genes, their automatic identification is still a challenging task. Using the genomic data made available by The Cancer Genome Atlas Consortium (TCGA), we propose a new prioritization approach based on the analysis of the distribution of putative deleterious variants in a large cohort of cancer samples.
Results: In this paper, we present ContastRank, a new method for the prioritization of putative impaired genes in cancer. The method is based on the comparison of the putative defective rate of each gene in tumor versus normal and 1000 genome samples. We show that the method is able to provide a ranked list of putative impaired genes for colon, lung and prostate adenocarcinomas. The list significantly overlaps with the list of known cancer driver genes previously published. More importantly, by using our scoring approach, we can successfully discriminate between TCGA normal and tumor samples. A binary classifier based on ContrastRank score reaches an overall accuracy >90% and the area under the curve (AUC) of receiver operating characteristics (ROC) >0.95 for all the three types of adenocarcinoma analyzed in this paper. In addition, using ContrastRank score, we are able to discriminate the three tumor types with a minimum overall accuracy of 77% and AUC of 0.83.
Conclusions: We describe ContrastRank, a method for prioritizing putative impaired genes in cancer. The method is based on the comparison of exome sequencing data from different cohorts and can detect putative cancer driver genes.
ContrastRank can also be used to estimate a global score for an individual genome about the risk of adenocarcinoma based on the genetic variants information from a whole-exome VCF (Variant Calling Format) file. We believe that the application of ContrastRank can be an important step in genomic medicine to enable genome-based diagnosis.
Availability and implementation: The lists of ContrastRank scores of all genes in each tumor type are available as supplementary materials. A webserver for evaluating the risk of the three studied adenocarcinomas starting from whole-exome VCF file is under development.
Contact: emidio@uab.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu466
PMCID: PMC4147919  PMID: 25161249
13.  A coupling approach of a predictor and a descriptor for breast cancer prognosis 
BMC Medical Genomics  2014;7(Suppl 1):S4.
Background
In cancer prognosis research, diverse machine learning models have applied to the problems of cancer susceptibility (risk assessment), cancer recurrence (redevelopment of cancer after resolution), and cancer survivability, regarding an accuracy (or an AUC--the area under the ROC curve) as a primary measurement for the performance evaluation of the models. However, in order to help medical specialists to establish a treatment plan by using the predicted output of a model, it is more pragmatic to elucidate which variables (markers) have most significantly influenced to the resulting outcome of cancer or which patients show similar patterns.
Methods
In this study, a coupling approach of two sub-modules--a predictor and a descriptor--is proposed. The predictor module generates the predicted output for the cancer outcome. Semi-supervised learning co-training algorithm is employed as a predictor. On the other hand, the descriptor module post-processes the results of the predictor module, mainly focusing on which variables are more highly or less significantly ranked when describing the results of the prediction, and how patients are segmented into several groups according to the trait of common patterns among them. Decision trees are used as a descriptor.
Results
The proposed approach, 'predictor-descriptor,' was tested on the breast cancer survivability problem based on the surveillance, epidemiology, and end results database for breast cancer (SEER). The results present the performance comparison among the established machine leaning algorithms, the ranks of the prognosis elements for breast cancer, and patient segmentation. In the performance comparison among the predictor candidates, Semi-supervised learning co-training algorithm showed best performance, producing an average AUC of 0.81. Later, the descriptor module found the top-tier prognosis markers which significantly affect to the classification results on survived/dead patients: 'lymph node involvement', 'stage', 'site-specific surgery', 'number of positive node examined', and 'tumor size', etc. Also, a typical example of patient-segmentation was provided: the patients classified as dead were grouped into two segments depending on difference in prognostic profiles, ones with serious results with respect to the pathologic exams and the others with the feebleness of age.
doi:10.1186/1755-8794-7-S1-S4
PMCID: PMC4101306  PMID: 25080202
14.  Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test 
Biostatistics (Oxford, England)  2012;13(4):776-790.
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology 23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One 2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics 63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics 9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics 86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
doi:10.1093/biostatistics/kxs015
PMCID: PMC3440238  PMID: 22734045
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
15.  Empirical study of supervised gene screening 
BMC Bioinformatics  2006;7:537.
Background
Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.
Results
We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.
Conclusion
From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.
doi:10.1186/1471-2105-7-537
PMCID: PMC1764766  PMID: 17176468
16.  Improving Clinical Risk Stratification at Diagnosis in Primary Prostate Cancer: A Prognostic Modelling Study 
PLoS Medicine  2016;13(8):e1002063.
Introduction
Over 80% of the nearly 1 million men diagnosed with prostate cancer annually worldwide present with localised or locally advanced non-metastatic disease. Risk stratification is the cornerstone for clinical decision making and treatment selection for these men. The most widely applied stratification systems use presenting prostate-specific antigen (PSA) concentration, biopsy Gleason grade, and clinical stage to classify patients as low, intermediate, or high risk. There is, however, significant heterogeneity in outcomes within these standard groupings. The International Society of Urological Pathology (ISUP) has recently adopted a prognosis-based pathological classification that has yet to be included within a risk stratification system. Here we developed and tested a new stratification system based on the number of individual risk factors and incorporating the new ISUP prognostic score.
Methods and Findings
Diagnostic clinicopathological data from 10,139 men with non-metastatic prostate cancer were available for this study from the Public Health England National Cancer Registration Service Eastern Office. This cohort was divided into a training set (n = 6,026; 1,557 total deaths, with 462 from prostate cancer) and a testing set (n = 4,113; 1,053 total deaths, with 327 from prostate cancer). The median follow-up was 6.9 y, and the primary outcome measure was prostate-cancer-specific mortality (PCSM). An external validation cohort (n = 1,706) was also used. Patients were first categorised as low, intermediate, or high risk using the current three-stratum stratification system endorsed by the National Institute for Health and Care Excellence (NICE) guidelines. The variables used to define the groups (PSA concentration, Gleason grading, and clinical stage) were then used to sub-stratify within each risk category by testing the individual and then combined number of risk factors. In addition, we incorporated the new ISUP prognostic score as a discriminator. Using this approach, a new five-stratum risk stratification system was produced, and its prognostic power was compared against the current system, with PCSM as the outcome. The results were analysed using a Cox hazards model, the log-rank test, Kaplan-Meier curves, competing-risks regression, and concordance indices. In the training set, the new risk stratification system identified distinct subgroups with different risks of PCSM in pair-wise comparison (p < 0.0001). Specifically, the new classification identified a very low-risk group (Group 1), a subgroup of intermediate-risk cancers with a low PCSM risk (Group 2, hazard ratio [HR] 1.62 [95% CI 0.96–2.75]), and a subgroup of intermediate-risk cancers with an increased PCSM risk (Group 3, HR 3.35 [95% CI 2.04–5.49]) (p < 0.0001). High-risk cancers were also sub-classified by the new system into subgroups with lower and higher PCSM risk: Group 4 (HR 5.03 [95% CI 3.25–7.80]) and Group 5 (HR 17.28 [95% CI 11.2–26.67]) (p < 0.0001), respectively. These results were recapitulated in the testing set and remained robust after inclusion of competing risks. In comparison to the current risk stratification system, the new system demonstrated improved prognostic performance, with a concordance index of 0.75 (95% CI 0.72–0.77) versus 0.69 (95% CI 0.66–0.71) (p < 0.0001). In an external cohort, the new system achieved a concordance index of 0.79 (95% CI 0.75–0.84) for predicting PCSM versus 0.66 (95% CI 0.63–0.69) (p < 0.0001) for the current NICE risk stratification system. The main limitations of the study were that it was registry based and that follow-up was relatively short.
Conclusions
A novel and simple five-stratum risk stratification system outperforms the standard three-stratum risk stratification system in predicting the risk of PCSM at diagnosis in men with primary non-metastatic prostate cancer, even when accounting for competing risks. This model also allows delineation of new clinically relevant subgroups of men who might potentially receive more appropriate therapy for their disease. Future research will seek to validate our results in external datasets and will explore the value of including additional variables in the system in order in improve prognostic performance.
Vincent Gnanapragasam and colleagues test the performance of a new 5-category risk stratification scheme for primary prostate cancer, using data from large cohorts of men with localised or locally-advanced disease.
Author Summary
Why Was This Study Done?
Prostate cancer incidence is rising worldwide, and, with improved detection, increasing proportions of men are presenting with non-metastatic disease (over 80%). Amongst these men, the disease is heterogeneous, and different management options are possible.
Risk stratification is the primary method of deciding which treatment is appropriate for an individual. However, the current method of risk stratification is based on historical data and was not originally validated against prostate cancer mortality as an outcome. Moreover, no current risk stratification system has been developed first in an unscreened population, which represents the vast majority of men presenting with prostate cancer worldwide.
Current risk models therefore require improvement to be more relevant for the management of prostate cancer in patients. In this study, we sought to improve clinical risk stratification by refining the attributes that make up the current risk stratification system and incorporating the latest pathological grading system for prostate cancer from the International Society of Urological Pathology.
What Did the Researchers Do and Find?
We studied a large dataset from a cohort of UK patients. Data from 10,139 men were available, and the cohort was split into a training group and a testing group for analysis.
Clinico-pathological characteristics at diagnosis (including clinical stage, biopsy grade, and prostate-specific antigen [PSA] concentration) were used first to categorise patients according to the standard three-stratum risk stratification system (from the UK NICE guidelines). These same three individual characteristics were then used to sub-stratify within each risk group. In addition, we incorporated the new pathological prognostic grading system (score 1–5) recently adopted by the International Society of Urological Pathology.
We found that the new risk model (with five subgroups) was significantly better at identifying patient populations with very different outcomes in terms of prostate-cancer-specific mortality. The model performance held true even when other competing risks of death were included. Most importantly, the model demonstrated improved prognostic power in comparison to the NICE stratification system, both in our primary cohort and in a separate external validation cohort.
What Do These Findings Mean?
To our knowledge, this study is the first to test the standard three-stratum risk stratification system in an unscreened first diagnosis population and to measure this system’s ability to predict prostate-cancer-specific mortality. We show that this model has a poor concordance for predicting mortality outcome at the point of diagnosis and is probably of little value in this context.
Our new model performs much better and not only improves prediction of mortality but also provides better distinction of patient subgroups to inform clinical decision making. Moreover, the cohorts used for our study are more representative of real-world practice, where screening for prostate cancer is uncommon.
These findings do need further validation in independent external cohorts, and our study is limited by its reliance on cancer registry records and relatively short follow-up.
Nevertheless, the large sample size and the consistency of our findings in external validation suggest that these findings are robust and ready for clinical use. The new model does not require any additional variables other than those routinely collected at diagnosis in any clinic setting worldwide and will therefore be simple to adopt internationally.
doi:10.1371/journal.pmed.1002063
PMCID: PMC4970710  PMID: 27483464
17.  Gene-Environment Interactions in Genome-Wide Association Studies: A Comparative Study of Tests Applied to Empirical Studies of Type 2 Diabetes 
American Journal of Epidemiology  2011;175(3):191-202.
The question of which statistical approach is the most effective for investigating gene-environment (G-E) interactions in the context of genome-wide association studies (GWAS) remains unresolved. By using 2 case-control GWAS (the Nurses’ Health Study, 1976–2006, and the Health Professionals Follow-up Study, 1986–2006) of type 2 diabetes, the authors compared 5 tests for interactions: standard logistic regression-based case-control; case-only; semiparametric maximum-likelihood estimation of an empirical-Bayes shrinkage estimator; and 2-stage tests. The authors also compared 2 joint tests of genetic main effects and G-E interaction. Elevated body mass index was the exposure of interest and was modeled as a binary trait to avoid an inflated type I error rate that the authors observed when the main effect of continuous body mass index was misspecified. Although both the case-only and the semiparametric maximum-likelihood estimation approaches assume that the tested markers are independent of exposure in the general population, the authors did not observe any evidence of inflated type I error for these tests in their studies with 2,199 cases and 3,044 controls. Both joint tests detected markers with known marginal effects. Loci with the most significant G-E interactions using the standard, empirical-Bayes, and 2-stage tests were strongly correlated with the exposure among controls. Study findings suggest that methods exploiting G-E independence can be efficient and valid options for investigating G-E interactions in GWAS.
doi:10.1093/aje/kwr368
PMCID: PMC3261439  PMID: 22199026
case-control studies; case study; diabetes mellitus, type 2; epidemiologic methods; genome-wide association study; genotype-environment interaction
18.  Development and Validation of a New Prognostic System for Patients with Hepatocellular Carcinoma 
PLoS Medicine  2016;13(4):e1002006.
Background
Prognostic assessment in patients with hepatocellular carcinoma (HCC) remains controversial. Using the Italian Liver Cancer (ITA.LI.CA) database as a training set, we sought to develop and validate a new prognostic system for patients with HCC.
Methods and Findings
Prospective collected databases from Italy (training cohort, n = 3,628; internal validation cohort, n = 1,555) and Taiwan (external validation cohort, n = 2,651) were used to develop the ITA.LI.CA prognostic system. We first defined ITA.LI.CA stages (0, A, B1, B2, B3, C) using only tumor characteristics (largest tumor diameter, number of nodules, intra- and extrahepatic macroscopic vascular invasion, extrahepatic metastases). A parametric multivariable survival model was then used to calculate the relative prognostic value of ITA.LI.CA tumor stage, Eastern Cooperative Oncology Group (ECOG) performance status, Child–Pugh score (CPS), and alpha-fetoprotein (AFP) in predicting individual survival. Based on the model results, an ITA.LI.CA integrated prognostic score (from 0 to 13 points) was constructed, and its prognostic power compared with that of other integrated systems (BCLC, HKLC, MESIAH, CLIP, JIS). Median follow-up was 58 mo for Italian patients (interquartile range, 26–106 mo) and 39 mo for Taiwanese patients (interquartile range, 12–61 mo).
The ITA.LI.CA integrated prognostic score showed optimal discrimination and calibration abilities in Italian patients. Observed median survival in the training and internal validation sets was 57 and 61 mo, respectively, in quartile 1 (ITA.LI.CA score ≤ 1), 43 and 38 mo in quartile 2 (ITA.LI.CA score 2–3), 23 and 23 mo in quartile 3 (ITA.LI.CA score 4–5), and 9 and 8 mo in quartile 4 (ITA.LI.CA score > 5). Observed and predicted median survival in the training and internal validation sets largely coincided. Although observed and predicted survival estimations were significantly lower (log-rank test, p < 0.001) in Italian than in Taiwanese patients, the ITA.LI.CA score maintained very high discrimination and calibration features also in the external validation cohort.
The concordance index (C index) of the ITA.LI.CA score in the internal and external validation cohorts was 0.71 and 0.78, respectively. The ITA.LI.CA score’s prognostic ability was significantly better (p < 0.001) than that of BCLC stage (respective C indexes of 0.64 and 0.73), CLIP score (0.68 and 0.75), JIS stage (0.67 and 0.70), MESIAH score (0.69 and 0.77), and HKLC stage (0.68 and 0.75). The main limitations of this study are its retrospective nature and the intrinsically significant differences between the Taiwanese and Italian groups.
Conclusions
The ITA.LI.CA prognostic system includes both a tumor staging—stratifying patients with HCC into six main stages (0, A, B1, B2, B3, and C)—and a prognostic score—integrating ITA.LI.CA tumor staging, CPS, ECOG performance status, and AFP. The ITA.LI.CA prognostic system shows a strong ability to predict individual survival in European and Asian populations.
Using Italian and Taiwanese cohorts, Alessandro Vitale and colleagues develop and validate a staging system and prognostic model for hepatocellular carcinoma.
Editors' Summary
Background
Primary liver cancer—a tumor that starts when a liver cell acquires genetic changes that allow it and its descendants to divide uncontrollably and move around the body (metastasize)—is the sixth most common cancer and the second leading cause of cancer-related deaths worldwide. Liver cancer kills more than three-quarters of a million people every year, mostly in resource-limited countries. The risk of developing hepatocellular carcinoma (HCC; the most common type of liver cancer) is highest in eastern and southeastern Asia; among wealthier nations, the risk of HCC is particularly high in Italy. HCC can be treated by surgical removal of part of the liver, liver transplantation, ablation (which uses an electric current to destroy the cancer cells), intra-arterial therapies (which deliver drugs directly into the liver), or systemic (whole body) drug therapies. However, the symptoms of HCC, which include weight loss, tiredness, and jaundice, are vague. HCC is therefore rarely diagnosed before the cancer is advanced and has a poor prognosis (likely outcome)—fewer than 5% of patients survive for five or more years after diagnosis.
Why Was This Study Done?
Cancer staging describes the severity of a cancer based on the size and extent of the original tumor and whether the tumor has metastasized. Staging helps doctors estimate the patient’s prognosis and can help them devise a treatment plan that will, hopefully, improve patients’ quality of life and may extend their life expectancy. Several staging systems have been devised for HCC, but prognostic assessment of patients with HCC is controversial. No single prognostic model (a model that allows clinicians to obtain predictions about the likely outcomes of individual patients) has been universally adopted. An ideal model is difficult to achieve as it would need to consider tumor-related, liver-function-related, and patient-related variables, all of which have different impacts on patient prognosis. Here, the researchers use a database created by the Italian Liver Cancer (ITA.LI.CA) group that includes information on more than 5,000 Italians with HCC to develop a new prognostic model to predict individual patient outcomes based on tumor-related, liver-function-related, and patient-related variables.
What Did the Researchers Do and Find?
The researchers first defined ITA.LI.CA stages for HCC using tumor characteristics only. They then used information on 3,628 patients in the ITA.LI.CA database (the “training” set) and statistical modeling to calculate the relative prognostic value of tumor staging, Eastern Cooperative Oncology Group (ECOG) performance status (an indicator of whether patients are able to look after themselves and undertake normal daily activities), liver function (measured using the Child—Pugh score), and alpha-fetoprotein level (a liver tumor marker) in the prediction of the survival of individual patients. Based on these modeling results, they constructed an ITA.LI.CA integrated prognostic score. The researchers report that the observed and predicted median (average) survival times in the training set and in an internal validation cohort of 1,555 additional patients in the ITA.LI.CA database were similar. Moreover, although the observed and predicted survival times were lower in the Italian patients than in 2,651 patients with HCC from Taiwan, the ITA.LI.CA score had high discrimination and calibration features in this external validation cohort as well (the discrimination of a prognostic model indicates its ability to separate patients into groups with different outcomes, the calibration of a prognostic model is the degree of correspondence between predicted and observed outcomes). Finally, the prognostic ability of the new ITA.LI.CA prognostic model was significantly better than that of several other prognostic scoring systems.
What Do These Findings Mean?
These findings introduce a revised staging system for HCC and an integrated prognostic score—the ITA.LI.CA prognostic score—based on this staging system, Child—Pugh score, ECOG performance status, and alpha-fetoprotein level that has a greater ability to predict survival among Italian and Taiwanese patients than previous prognostic models. Because this study was retrospective—previously recorded data, including outcomes, were used to develop the prognostic model—a prospective trial is needed to validate the ITA.LI.CA prognostic score. That is, researchers need to enroll a group of patients, determine their ITA.LI.CA prognostic scores, and then follow the patients to determine their actual outcomes. If validated in this way and in other populations, use of the ITA.LI.CA prognostic score should allow clinicians to provide more accurate prognoses for individual patients, and may be a starting point for evaluating which treatment option is best suited to each patient presenting with HCC.
Additional Information
This list of resources contains links that can be accessed when viewing the PDF on a device or via the online version of the article at http://dx.doi.org/10.1371/journal.pmed.1002006.
This study is further discussed in a PLOS Medicine Perspective by Neehar Parikh and Amit Singal
The US National Cancer Institute provides information about all aspects of cancer, including detailed information for patients and professionals about primary liver cancer and about cancer staging (in English and Spanish)
The American Cancer Society also provides information about liver cancer (including information on support programs and services; available in several languages)
The UK National Health Service Choices website provides information about primary liver cancer (including a video about coping with cancer) and about cancer staging
Cancer Research UK (a not-for-profit organization) provides detailed information about primary liver cancer
The British Liver Trust (a not-for-profit organization) also provides information about liver cancer, including a personal story
MedlinePlus provides links to further resources about liver cancer (in English and Spanish)
doi:10.1371/journal.pmed.1002006
PMCID: PMC4846017  PMID: 27116206
19.  Semiparametric prognosis models in genomic studies 
Briefings in Bioinformatics  2010;11(4):385-393.
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
doi:10.1093/bib/bbp070
PMCID: PMC2905523  PMID: 20123942
genomic studies; semiparametric prognosis models; model comparison
20.  On Model Specification and Selection of the Cox Proportional Hazards Model* 
Statistics in medicine  2013;32(26):4609-4623.
Prognosis plays a pivotal role in patient management and trial design. A useful prognostic model should correctly identify important risk factors and estimate their effects. In this article, we discuss several challenges in selecting prognostic factors and estimating their effects using the Cox proportional hazards model. Although a flexible semiparametric form, the Cox’s model is not entirely exempt from model misspecification. To minimize possible misspecification, instead of imposing traditional linear assumption, flexible modeling techniques have been proposed to accommodate the nonlinear effect. We first review several existing nonparametric estimation and selection procedures and then present a numerical study to compare the performance between parametric and nonparametric procedures. We demonstrate the impact of model misspecification on variable selection and model prediction using a simulation study and a example from a phase III trial in prostate cancer.
doi:10.1002/sim.5876
PMCID: PMC3795916  PMID: 23784939
Cox’s Model; Model Selection; LASSO; Smoothing Splines; COSSO
21.  Guideline-concordant Timely Lung Cancer Care and Prognosis among Elderly Patients in the United States: A Population-based Study 
Cancer epidemiology  2015;39(6):1136-1144.
Objectives
Elderly carry a disproportionate burden of lung cancer in the US. Therefore, its important to ensure that these patients receive quality cancer care. Timeliness of care is an important dimension of cancer care quality but its impact on prognosis remains to be explored. This study evaluates the variations in guideline-concordant timely lung cancer care and prognosis among elderly in the US.
Materials and Methods
Using the Surveillance, Epidemiology, and End Results (SEER)-Medicare database (2002-2007), we identified elderly patients with lung cancer (n = 48,850) and determined time to diagnosis and treatment. We categorized patients by receipt of timely care using guidelines from the British Thoracic Society and the RAND Corporation. Hierarchical generalized logistic model was constructed to identify variables associated with receipt of timely care. Kaplan-Meier analysis and Log Rank test was used for estimation and comparison of the three-year survival. Multivariable Cox proportional hazards model was constructed to estimate lung cancer mortality risk associated with receipt of delayed care.
Results
Time to diagnosis and treatment varied significantly among the elderly. However, majority of them (77.5%) received guideline-concordant timely lung cancer care. The likelihood of receiving timely care significantly decreased with NSCLC disease, early stage diagnosis, increasing age, non-white race, higher comorbidity score, and lower income. Paradoxically, survival outcomes were significantly worse among patients receiving timely care. Adjusted lung cancer mortality risk was also significantly lower among patients receiving delayed care, relative to those receiving timely care (Hazard ratio (HR) = 0.68, 95% Confidence interval (CI)= (0.66 - 0.71); p ≤ 0.05).
Conclusion
This study highlights the critical need to address disparities in receipt of guideline-concordant timely lung cancer care among elderly. Although timely care was not associated with better prognosis in this study, any delays in diagnosis and treatment should be avoided, as it may increase the risk of disease progression and psychological stress in patients. Furthermore, given that lung cancer diagnostic and management services are covered under the Medicare program, observed delays in care among Medicare beneficiaries is also a cause for concern.
doi:10.1016/j.canep.2015.06.005
PMCID: PMC4679644  PMID: 26138902
Lung; Cancer; Elderly; Medicare; Disparities; Guidelines; Treatment
22.  Semiparametric latent covariate mixed-effects models with application to a colon carcinogenesis study 
We study a mixed-effects model in which the response and the main covariate are linked by position. While the covariate corresponding to the observed response is not directly observable, there exists a latent covariate process that represents the underlying positional features of the covariate. When the positional features and the underlying distributions are parametric, the expectation-maximization (EM) is the most commonly used procedure. Though without the parametric assumptions, the practical feasibility of a semi-parametric EM algorithm and the corresponding inference procedures remain to be investigated. In this paper, we propose a semiparametric approach, and identify the conditions under which the semiparametric estimators share the same asymptotic properties as the unachievable estimators using the true values of the latent covariate; that is, the oracle property is achieved. We propose a Monte Carlo graphical evaluation tool to assess the adequacy of the sample size for achieving the oracle property. The semiparametric approach is later applied to data from a colon carcinogenesis study on the effects of cell DNA damage on the expression level of oncogene bcl-2. The graphical evaluation shows that, with moderate size of subunits, the numerical performance of the semiparametric estimator is very close to the asymptotic limit. It indicates that a complex EM-based implementation may at most achieve minimal improvement and is thus unnecessary.
PMCID: PMC2818699  PMID: 20148130
Carcinogenesis; Consistency; Generalized estimating equation; Local linear smoothing; Mixed-effects model
23.  META-GSA: Combining Findings from Gene-Set Analyses across Several Genome-Wide Association Studies 
PLoS ONE  2015;10(10):e0140179.
Introduction
Gene-set analysis (GSA) methods are used as complementary approaches to genome-wide association studies (GWASs). The single marker association estimates of a predefined set of genes are either contrasted with those of all remaining genes or with a null non-associated background. To pool the p-values from several GSAs, it is important to take into account the concordance of the observed patterns resulting from single marker association point estimates across any given gene set. Here we propose an enhanced version of Fisher’s inverse χ2-method META-GSA, however weighting each study to account for imperfect correlation between association patterns.
Simulation and Power
We investigated the performance of META-GSA by simulating GWASs with 500 cases and 500 controls at 100 diallelic markers in 20 different scenarios, simulating different relative risks between 1 and 1.5 in gene sets of 10 genes. Wilcoxon’s rank sum test was applied as GSA for each study. We found that META-GSA has greater power to discover truly associated gene sets than simple pooling of the p-values, by e.g. 59% versus 37%, when the true relative risk for 5 of 10 genes was assume to be 1.5. Under the null hypothesis of no difference in the true association pattern between the gene set of interest and the set of remaining genes, the results of both approaches are almost uncorrelated. We recommend not relying on p-values alone when combining the results of independent GSAs.
Application
We applied META-GSA to pool the results of four case-control GWASs of lung cancer risk (Central European Study and Toronto/Lunenfeld-Tanenbaum Research Institute Study; German Lung Cancer Study and MD Anderson Cancer Center Study), which had already been analyzed separately with four different GSA methods (EASE; SLAT, mSUMSTAT and GenGen). This application revealed the pathway GO0015291 “transmembrane transporter activity” as significantly enriched with associated genes (GSA-method: EASE, p = 0.0315 corrected for multiple testing). Similar results were found for GO0015464 “acetylcholine receptor activity” but only when not corrected for multiple testing (all GSA-methods applied; p≈0.02).
doi:10.1371/journal.pone.0140179
PMCID: PMC4621033  PMID: 26501144
24.  A Semiparametric Bayesian Model for Repeatedly Repeated Binary Outcomes 
Summary
We discuss the analysis of data from single nucleotide polymorphism (SNP) arrays comparing tumor and normal tissues. The data consist of sequences of indicators for loss of heterozygosity (LOH) and involve three nested levels of repetition: chromosomes for a given patient, regions within chromosomes, and SNPs nested within regions. We propose to analyze these data using a semiparametric model for multi-level repeated binary data. At the top level of the hierarchy we assume a sampling model for the observed binary LOH sequences that arises from a partial exchangeability argument. This implies a mixture of Markov chains model. The mixture is defined with respect to the Markov transition probabilities. We assume a nonparametric prior for the random mixing measure. The resulting model takes the form of a semiparametric random effects model with the matrix of transition probabilities being the random effects. The model includes appropriate dependence assumptions for the two remaining levels of the hierarchy, i.e., for regions within chromosomes and for chromosomes within patient. We use the model to identify regions of increased LOH in a dataset coming from a study of treatment-related leukemia in children with an initial cancer diagnostic. The model successfully identifies the desired regions and performs well compared to other available alternatives.
doi:10.1111/j.1467-9876.2008.00619.x
PMCID: PMC2739390  PMID: 19746193
25.  Clinical prognostic significance and pro-metastatic activity of RANK/RANKL via the AKT pathway in endometrial cancer 
Oncotarget  2015;7(5):5564-5575.
RANK/RANKL plays a key role in metastasis of certain malignant tumors, which makes it a promising target for developing novel therapeutic strategies for cancer. However, the prognostic value and pro-metastatic activity of RANK in endometrial cancer (EC) remain to be determined. Thus, the present study investigated the effect of RANK on the prognosis of EC patients, as well as the pro-metastatic activity of EC cells. The results indicated that those with high expression of RANK showed decreased overall survival and progression-free survival. Statistical analysis revealed the positive correlations between RANK/RANKL expression and metastasis-related factors. Additionally, RANK/RANKL significantly promoted cell migration/invasion via activating AKT/β-catenin/Snail pathway in vitro. However, RANK/RANKL-induced AKT activation could be suppressed after osteoprotegerin (OPG) treatment. Furthermore, the combination of medroxyprogesterone acetate (MPA) and RANKL could in turn attenuate the effect of RANKL alone. Similarly, MPA could partially inhibit the RANK-induced metastasis in an orthotopic mouse model via suppressing AKT/β-catenin/Snail pathway. Therefore, therapeutic inhibition of MPA in RANK/RANKL-induced metastasis was mediated by AKT/β-catenin/Snail pathway both in vitro and in vivo, suggesting a potential target of RANK for gene-based therapy for EC.
doi:10.18632/oncotarget.6795
PMCID: PMC4868706  PMID: 26734994
RANK; RANKL; endometrial cancer; prognosis; metastasis

Results 1-25 (2004309)