Interferon regulatory factor (IRF)-5 is a transcription factor involved in type I interferon signaling whose germ line variants have been associated with autoimmune pathogenesis. Since relationships have been observed between development of autoimmunity and responsiveness of melanoma to several types of immunotherapy, we tested whether polymorphisms of IRF5 are associated with responsiveness of melanoma to adoptive therapy with tumor infiltrating lymphocytes (TILs).
140 TILs were genotyped for four single nucleotide polymorphisms (rs10954213, rs11770589, rs6953165, rs2004640) and one insertion-deletion in the IRF5 gene by sequencing. Gene-expression profile of the TILs, 112 parental melanoma metastases (MM) and 9 cell lines derived from some metastases were assessed by Affymetrix Human Gene ST 1.0 array.
Lack of A allele in rs10954213 (G > A) was associated with non-response (p < 0.005). Other polymorphisms in strong linkage disequilibrium with rs10954213 demonstrated similar trends. Genes differentially expressed in vitro between cell lines carrying or not the A allele could be applied to the transcriptional profile of 112 melanoma metastases to predict their responsiveness to therapy, suggesting that IRF5 genotype may influence immune responsiveness by affecting the intrinsic biology of melanoma.
This study is the first to analyze associations between melanoma immune responsiveness and IRF5 polymorphism. The results support a common genetic basis which may underline the development of autoimmunity and melanoma immune responsiveness.
We demonstrate that clinical trials using response adaptive randomized treatment assignment rules are subject to substantial bias if there are time trends in unknown prognostic factors and standard methods of analysis are used. We develop a general class of randomization tests based on generating the null distribution of a general test statistic by repeating the adaptive randomized treatment assignment rule holding fixed the sequence of outcome values and covariate vectors actually observed in the trial. We develop broad conditions on the adaptive randomization method and the stochastic mechanism by which outcomes and covariate vectors are sampled that ensure that the type I error is controlled at the level of the randomization test. These conditions ensure that the use of the randomization test protects the type I error against time trends that are independent of the treatment assignments. Under some conditions in which the prognosis of future patients is determined by knowledge of the current randomization weights, the type I error is not strictly protected. We show that response-adaptive randomization can result in substantial reduction in statistical power when the type I error is preserved. Our results also ensure that type I error is controlled at the level of the randomization test for adaptive stratification designs used for balancing covariates.
Response adaptive randomization; adaptive stratification; clinical trials
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not “anticonservative” using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.
Gene expression analysis; High-dimensional data; Microarray; Probabilistic classification
In 2009, an outbreak of raccoon rabies in Central Park in New York City, New York, USA, infected 133 raccoons. Five persons and 2 dogs were exposed but did not become infected. A trap-vaccinate-release program vaccinated ≈500 raccoons and contributed to the end of the epizootic.
rabies; raccoon; vaccination; epizootic; urban; New York; TVR; trap-vaccinate-release; viruses
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell’s concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
predictive medicine; survival risk classification; cross-validation; gene expression
Cell type heterogeneity may have a substantial effect on gene expression profiling of human tissue. Several in silico methods for deconvoluting a gene expression profile into cell-type-specific subprofiles have been published but not widely used. Here, we consider recent methods and the experimental validations available for them. Shen-Orr et al. recently developed an approach called cell-type-specific significance analysis of microarray for deconvoluting gene expression. This method requires the measurement of the proportion of each cell type in each sample and the expression profiles of the heterogeneous samples. It determines how gene expression varies among pre-defined phenotypes for each cell type. Gene expression can vary substantially among cell types and sample heterogeneity can mask the identification of biologically important phenotypic correlations. Consequently, the deconvolution approach can be useful in the analysis of mixtures of cell populations in clinical samples.
Although numerous methods of using microarray data analysis for cancer classification have been proposed, most utilize many genes to achieve accurate classification. This can hamper interpretability of the models and ease of translation to other assay platforms. We explored the use of single genes to construct classification models. We first identified the genes with the most powerful univariate class discrimination ability and then constructed simple classification rules for class prediction using the single genes.
We applied our model development algorithm to eleven cancer gene expression datasets and compared classification accuracy to that for standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. The single gene classifiers provided classification accuracy comparable to or better than those obtained by existing methods in most cases. We analyzed the factors that determined when simple single gene classification is effective and when more complex modeling is warranted.
For most of the datasets examined, the single-gene classification methods appear to work as well as more standard methods, suggesting that simple models could perform well in microarray-based cancer prediction.
biomarkers; early detection; genomics; personalized medicine; translational research
DAPfinder and DAPview are novel BRB-ArrayTools plug-ins to construct gene coexpression networks and identify significant differences in pairwise gene-gene coexpression between two phenotypes.
Each significant difference in gene-gene association represents a Differentially Associated Pair (DAP). Our tools include several choices of filtering methods, gene-gene association metrics, statistical testing methods and multiple comparison adjustments. Network results are easily displayed in Cytoscape. Analyses of glioma experiments and microarray simulations demonstrate the utility of these tools.
DAPfinder is a new friendly-user tool for reconstruction and comparison of biological networks.
We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?
We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.
By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.
A substantial number of studies have reported the development of gene expression–based prognostic signatures for lung cancer. The ultimate aim of such studies should be the development of well-validated clinically useful prognostic signatures that improve therapeutic decision making beyond current practice standards. We critically reviewed published studies reporting the development of gene expression–based prognostic signatures for non–small cell lung cancer to assess the progress made toward this objective. Studies published between January 1, 2002, and February 28, 2009, were identified through a PubMed search. Following hand-screening of abstracts of the identified articles, 16 were selected as relevant. Those publications were evaluated in detail for appropriateness of the study design, statistical validation of the prognostic signature on independent datasets, presentation of results in an unbiased manner, and demonstration of medical utility for the new signature beyond that obtained using existing treatment guidelines. Based on this review, we found little evidence that any of the reported gene expression signatures are ready for clinical application. We also found serious problems in the design and analysis of many of the studies. We suggest a set of guidelines to aid the design, analysis, and evaluation of prognostic signature studies. These guidelines emphasize the importance of focused study planning to address specific medically important questions and the use of unbiased analysis methods to evaluate whether the resulting signatures provide evidence of medical utility beyond standard of care–based prognostic factors.
Rationale: Ineffective repair of a damaged alveolar epithelium has been postulated to cause pulmonary fibrosis. In support of this theory, epithelial cell abnormalities, including hyperplasia, apoptosis, and persistent denudation of the alveolar basement membrane, are found in the lungs of humans with idiopathic pulmonary fibrosis and in animal models of fibrotic lung disease. Furthermore, mutations in genes that affect regenerative capacity or that cause injury/apoptosis of type II alveolar epithelial cells have been identified in familial forms of pulmonary fibrosis. Although these findings are compelling, there are no studies that demonstrate a direct role for the alveolar epithelium or, more specifically, type II cells in the scarring process.
Objectives: To determine if a targeted injury to type II cells would result in pulmonary fibrosis.
Methods: A transgenic mouse was generated to express the human diphtheria toxin receptor on type II alveolar epithelial cells. Diphtheria toxin was administered to these animals to specifically target the type II epithelium for injury. Lung fibrosis was assessed by histology and hydroxyproline measurement.
Measurements and Main Results: Transgenic mice treated with diphtheria toxin developed an approximately twofold increase in their lung hydroxyproline content on Days 21 and 28 after diphtheria toxin treatment. The fibrosis developed in conjunction with type II cell injury. Histological evaluation revealed diffuse collagen deposition with patchy areas of more confluent scarring and associated alveolar contraction.
Conclusions: The development of lung fibrosis in the setting of type II cell injury in our model provides evidence for a causal link between the epithelial defects seen in idiopathic pulmonary fibrosis and the corresponding areas of scarring.
diphtheria toxin; lung; collagen; scarring
The development of tumor biomarkers ready for clinical use is complex. We propose a refined system for biomarker study design, conduct, analysis, and evaluation that incorporates a hierarchal level of evidence scale for tumor marker studies, including those using archived specimens. Although fully prospective randomized clinical trials to evaluate the medical utility of a prognostic or predictive biomarker are the gold standard, such trials are costly, so we discuss more efficient indirect “prospective–retrospective” designs using archived specimens. In particular, we propose new guidelines that stipulate that 1) adequate amounts of archived tissue must be available from enough patients from a prospective trial (which for predictive factors should generally be a randomized design) for analyses to have adequate statistical power and for the patients included in the evaluation to be clearly representative of the patients in the trial; 2) the test should be analytically and preanalytically validated for use with archived tissue; 3) the plan for biomarker evaluation should be completely specified in writing before the performance of biomarker assays on archived tissue and should be focused on evaluation of a single completely defined classifier; and 4) the results from archived specimens should be validated using specimens from one or more similar, but separate, studies.
Physicians need improved tools for selecting treatments for individual patients. Many diagnostic entities hat were traditionally viewed as individual diseases are heterogeneous in their molecular pathogenesis and treatment responsiveness. This results in the treatment of many patients with ineffective drugs, incursion of substantial medical costs for the treatment of patients who do not benefit and the conducting of large clinical trials to identify small, average treatment benefits for heterogeneous groups of patients. In oncology, new genomic technologies provide powerful tools for the selection of patients who require systemic treatment and are most (or least) likely to benefit from a molecularly targeted therapeutic. In the large amount of literature on biomarkers, there is considerable uncertainty and confusion regarding the specifics involved in the development and evaluation of prognostic and predictive biomarker diagnostics. There is a lack of appreciation that the development of drugs with companion diagnostics increases the complexity of clinical development. Adapting to the fundamental importance of tumor heterogeneity and achieving the benefits of personalized oncology for patients and healthcare costs will require paradigm changes for clinical and statistical investigators in academia, industry and regulatory agencies. In this review, I attempt to address some of these issues and provide guidance on the design of clinical trials for evaluating the clinical utility and robustness of prognostic and predictive biomarkers.
adaptive design; biomarker; clinical trial design; predictive; prognostic; validation
The traditional oncology drug development paradigm of single arm phase II studies followed by a randomized phase III study has limitations for modern oncology drug development. Interpretation of single arm phase II study results is difficult when a new drug is used in combination with other agents or when progression free survival is used as the endpoint rather than tumor shrinkage. Randomized phase II studies are more informative for these objectives but increase both the number of patients and time required to determine the value of a new experimental agent. In this paper, we compare different phase II study strategies to determine the most efficient drug development path in terms of number of patients and length of time to conclusion of drug efficacy on overall survival.
DNA microarrays are powerful tools for studying biological mechanisms and for developing prognostic and predictive classifiers for identifying the patients who require treatment and are best candidates for specific treatments. Because microarrays produce so much data from each specimen, they offer great opportunities for discovery and great dangers or producing misleading claims. Microarray based studies require clear objectives for selecting cases and appropriate analysis methods. Effective analysis of microarray data, where the number of measured variables is orders of magnitude greater than the number of cases, requires specialized statistical methods which have recently been developed. Recent literature reviews indicate that serious problems of analysis exist a substantial proportion of publications. This manuscript attempts to provide a nontechnical summary of the key principles of statistical design and analysis for studies that utilize microarray expression profiling.
Bioinformatics; biomarkers; gene expression signatures; microarray data analysis
Plasminogen activation to plasmin protects from lung fibrosis, but the mechanism underlying this antifibrotic effect remains unclear. We found that mice lacking plasminogen activation inhibitor–1 (PAI-1), which are protected from bleomycin-induced pulmonary fibrosis, exhibit lung overproduction of the antifibrotic lipid mediator prostaglandin E2 (PGE2). Plasminogen activation upregulated PGE2 synthesis in alveolar epithelial cells, lung fibroblasts, and lung fibrocytes from saline- and bleomycin-treated mice, as well as in normal fetal and adult primary human lung fibroblasts. This response was exaggerated in cells from Pai1–/– mice. Although enhanced PGE2 formation required the generation of plasmin, it was independent of proteinase-activated receptor 1 (PAR-1) and instead reflected proteolytic activation and release of HGF with subsequent induction of COX-2. That the HGF/COX-2/PGE2 axis mediates in vivo protection from fibrosis in Pai1–/– mice was demonstrated by experiments showing that a selective inhibitor of the HGF receptor c-Met increased lung collagen to WT levels while reducing COX-2 protein and PGE2 levels. Of clinical interest, fibroblasts from patients with idiopathic pulmonary fibrosis were found to be defective in their ability to induce COX-2 and, therefore, unable to upregulate PGE2 synthesis in response to plasmin or HGF. These studies demonstrate crosstalk between plasminogen activation and PGE2 generation in the lung and provide a mechanism for the well-known antifibrotic actions of the fibrinolytic pathway.
There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases. We have evaluated three linear regression algorithms that can be used for the prediction of a continuous response based on high dimensional gene expression data. The three algorithms are the least angle regression (LAR), the least absolute shrinkage and selection operator (LASSO), and the averaged linear regression method (ALM). All methods are tested using simulations based on a real gene expression dataset and analyses of two sets of real gene expression data and using an unbiased complete cross validation approach. Our results show that the LASSO algorithm often provides a model with somewhat lower prediction error than the LAR method, but both of them perform more efficiently than the ALM predictor. We have developed a plug-in for BRB-ArrayTools that implements the LAR and the LASSO algorithms with complete cross-validation.
regression model; gene expression; continuous outcome
Motivation: Major tumor sequencing projects have been conducted in the past few years to identify genes that contain ‘driver’ somatic mutations in tumor samples. These genes have been defined as those for which the non-silent mutation rate is significantly greater than a background mutation rate estimated from silent mutations. Several methods have been used for estimating the background mutation rate.
Results: We propose a new method for identifying cancer driver genes, which we believe provides improved accuracy. The new method accounts for the functional impact of mutations on proteins, variation in background mutation rate among tumors and the redundancy of the genetic code. We reanalyzed sequence data for 623 candidate genes in 188 non-small cell lung tumors using the new method. We found several important genes like PTEN, which were not deemed significant by the previous method. At the same time, we determined that some genes previously reported as drivers were not significant by the new analysis because mutations in these genes occurred mainly in tumors with large background mutation rates.
Availability: The software is available at: http://linus.nci.nih.gov/Data/YounA/software.zip
Supplementary information: Supplementary data are available at Bioinformatics online.
Developments in whole genome biotechnology have dramatically increased the opportunities for developing more effective therapeutics and for targeting them to patients who require them and who can benefit from them. This can have profound benefits for patients and for the economics of health care. There are, however, many obstacles to overcome in achieving this revolution. The effectiveness of translational research in oncology is seriously limited by many factors, both structural and scientific. Some of the obstacles involve the failure of biomedical organizations to develop and fund new models of inter-disciplinary collaboration needed to attract and support the best and brightest quantitative scientists to predictive medicine. Many of the challenges are scientific, requiring paradigm changes in the way drugs are developed and the way clinical trials are designed and analyzed. Some of these issues are addressed here, specifically in the context of developing molecular diagnostics in a manner that moves retrospective correlative science to prospective predictive medicine.
The three HLA class II alleles of the DR2 haplotype, DRB1*1501, DRB5*0101, and DQB1*0602, are in strong linkage disequilibrium and confer most of the genetic risk to multiple sclerosis. Functional redundancy in Ag presentation by these class II molecules would allow recognition by a single TCR of identical peptides with the different restriction elements, facilitating T cell activation and providing one explanation how a disease-associated HLA haplotype could be linked to a CD4+ T cell-mediated autoimmune disease. Using combinatorial peptide libraries and B cell lines expressing single HLA-DR/DQ molecules, we show that two of five in vivo-expanded and likely disease-relevant, cross-reactive cerebrospinal fluid-infiltrating T cell clones use multiple disease-associated HLA class II molecules as restriction elements. One of these T cell clones recognizes >30 identical foreign and human peptides using all DR and DQ molecules of the multiple sclerosis-associated DR2 haplotype. A T cell signaling machinery tuned for efficient responses to weak ligands together with structural features of the TCR-HLA/peptide complex result in this promiscuous HLA class II restriction.
Using a question and answer format we describe important aspects of using genomic technologies in cancer research. The main challenges are not managing the mass of data, but rather the design, analysis and accurate reporting of studies that result in increased biological knowledge and medical utility. Many analysis issues address the use of expression microarrays but are also applicable to other whole genome assays. Microarray based clinical investigations have generated both unrealistic hyperbole and excessive skepticism. Genomic technologies are tremendously powerful and will play instrumental roles in elucidating the mechanisms of oncogenesis and in devlopingan era of predictive medicine in which treatments are tailored to individual tumors. Achieving these goals involves challenges in re-thinking many paradigms for the conduct of basic and clinical cancer research and for the organization of interdisciplinary collaboration.
Many syndromes traditionally viewed as individual diseases are heterogeneous in molecular pathogenesis and treatment responsiveness. This often leads to the conduct of large clinical trials to identify small average treatment benefits for heterogeneous groups of patients. Drugs that demonstrate effectiveness in such trials may subsequently be used broadly, resulting in ineffective treatment of many patients. New genomic and proteomic technologies provide powerful tools for the selection of patients likely to benefit from a therapeutic without unacceptable adverse events. In spite of the large literature on developing predictive biomarkers, there is considerable confusion about the development and validation of biomarker based diagnostic classifiers for treatment selection. In this paper we attempt to clarify some of these issues and to provide guidance on the design of clinical trials for evaluating the clinical utility and robustness of pharmacogenomic classifiers.
Pharmacogenomics; biomarker; genomics; DNA microarray; clinical trial design; validation
Microarray based expression profiling is a powerful technology for studying biological mechanisms and for developing clinically valuable predictive classifiers. The high dimensional readout for each sample assayed makes it possible to do new kinds of studies but also increases the risks of misleading conclusions. We review here the current state-of-the-art for design and analysis of microarray based investigations.
Apoptosis of fibroblasts/myofibroblasts is a critical event in the resolution of tissue repair responses; however, mechanisms for the regulation of (myo)fibroblast apoptosis/survival remain unclear. In this study, we demonstrate counter-regulatory interactions between the plasminogen activation system and transforming growth factor-β1 (TGF-β1) in the control of fibroblast apoptosis. Plasmin treatment induced fibroblast apoptosis in a time- and dose-dependent manner in association with proteolytic degradation of extracellular matrix proteins, as detected by the release of soluble fibronectin peptides. Plasminogen, which was activated to plasmin by fibroblasts, also induced fibronectin proteolysis and fibroblast apoptosis, both of which were blocked by α2-antiplasmin but not by inhibition of matrix metalloproteinase activity. TGF-β1 protected fibroblasts from apoptosis induced by plasminogen but not from apoptosis induced by exogenous plasmin. The protection from plasminogen-induced apoptosis conferred by TGF-β1 is associated with the up-regulation of plasminogen activator-1 (PAI-1) expression and inhibition of plasminogen activation. Moreover, lung fibroblasts from mice genetically deficient in PAI-1 lose the protective effect of TGF-β1 against plasminogen-induced apoptosis. These findings support a novel role for the plasminogen activation system in the regulation of fibroblast apoptosis and a potential role of TGF-β1/PAI-1 in promoting (myo)fibroblast survival in chronic fibrotic disorders.
myofibroblast; fibrosis; transforming growth factor-β; anoikis; plasminogen activator inhibitor 1