Epithelial-mesenchymal transition (EMT) is a physiological program that is activated during cancer cell invasion and metastasis. We show here that EMT-related processes are linked to a broad and conserved program of transcriptional alterations that are influenced by cell contact and adhesion. Using cultured human breast cancer and mouse mammary epithelial cells, we find that reduced cell density, conditions under which cell contact is reduced, leads to reduced expression of genes associated with mammary epithelial cell differentiation and increased expression of genes associated with breast cancer. We further find that treatment of cells with matrix metalloproteinase-3 (MMP-3), an inducer of EMT, interrupts a defined subset of cell contact-regulated genes, including genes encoding a variety of RNA splicing proteins known to regulate the expression of Rac1b, an activated splice isoform of Rac1 known to be a key mediator of MMP-3-induced EMT in breast, lung, and pancreas. These results provide new insights into how MMPs act in cancer progression and how loss of cell–cell interactions is a key step in the earliest stages of cancer development.
breast cancer; mammary epithelial cells; cell contact; epithelial-mesenchymal transition; matrix metalloproteinases; extracellular matrix
A major challenge in the diagnosis and treatment of brain tumors is tissue heterogeneity leading to mixed treatment response. Additionally, they are often difficult or at very high risk for biopsy, further hindering the clinical management process. To overcome this, novel advanced imaging methods are increasingly being adapted clinically to identify useful noninvasive biomarkers capable of disease stage characterization and treatment response prediction. One promising technique is called functional diffusion mapping (fDM), which uses diffusion-weighted imaging (DWI) to generate parametric maps between two imaging time points in order to identify significant voxel-wise changes in water diffusion within the tumor tissue. Here we introduce serial functional diffusion mapping (sfDM), an extension of existing fDM methods, to analyze the entire tumor diffusion profile along the temporal course of the disease. sfDM provides the tools necessary to analyze a tumor data set in the context of spatiotemporal parametric mapping: the image registration pipeline, biomarker extraction, and visualization tools. We present the general workflow of the pipeline, along with a typical use case for the software. sfDM is written in Python and is freely available as an open-source package under the Berkley Software Distribution (BSD) license to promote transparency and reproducibility.
diffusion MR; parametric mapping; brain tumors; open source
Chromatin immunoprecipitation sequencing (ChIP-seq) is a powerful method for analyzing protein interactions with DNA. It can be applied to identify the binding sites of transcription factors (TFs) and genomic landscape of histone modification marks (HMs). Previous research has largely focused on developing peak-calling procedures to detect the binding sites for TFs. However, these procedures may fail when applied to ChIP-seq data of HMs, which have diffuse signals and multiple local peaks. In addition, it is important to identify genes with differential histone enrichment regions between two experimental conditions, such as different cellular states or different time points. Parametric methods based on Poisson/negative binomial distribution have been proposed to address this differential enrichment problem and most of these methods require biological replications. However, many ChIP-seq data usually have a few or even no replicates. We propose a nonparametric method to identify the genes with differential histone enrichment regions even without replicates. Our method is based on nonparametric hypothesis testing and kernel smoothing in order to capture the spatial differences in histone-enriched profiles. We demonstrate the method using ChIP-seq data on a comparative epigenomic profiling of adipogenesis of murine adipose stromal cells and the Encyclopedia of DNA Elements (ENCODE) ChIP-seq data. Our method identifies many genes with differential H3K27ac histone enrichment profiles at gene promoter regions between proliferating preadipocytes and mature adipocytes in murine 3T3-L1 cells. The test statistics also correlate with the gene expression changes well and are predictive to gene expression changes, indicating that the identified differentially enriched regions are indeed biologically meaningful.
kernel smoothing; normalization; nonparametric testing; spatial histone profiles
Mutations in the mtDNA genome have long been suspected to play an important role in cancer. Although most cancer cells harbor mtDNA mutations, the question of whether such mutations are associated with clinical prognosis of lung cancer remains unclear. We resequenced the entire mitochondrial genomes of tumor tissue from a population of 250 Korean patients with non-small cell lung cancer (NSCLC). Our analysis revealed that the haplogroup (D/D4) was associated with worse overall survival (OS) of early-stage NSCLC [adjusted hazard ratio (AHR), 1.95; 95% CI, 1.14–3.33; Ptrend = 0.03]. By comparing the mtDNA variations between NSCLC tissues and matched blood samples, we found that haplogroups M/N and/or D/D4 were hotspots for somatic mutations, suggesting a more complicated mechanism of mtDNA somatic mutations other than the commonly accepted mechanism of sequential accumulation of mtDNA mutations.
mitochondria genome; mitochondria mutations; lung cancer survival; haplogroup; mitochondrial genome resequencing
Somatic alterations in DNA copy number have been well studied in numerous malignancies, yet the role of germline DNA copy number variation in cancer is still emerging. Genotyping microarrays generate allele-specific signal intensities to determine genotype, but may also be used to infer DNA copy number using additional computational approaches. Numerous tools have been developed to analyze Illumina genotype microarray data for copy number variant (CNV) discovery, although commonly utilized algorithms freely available to the public employ approaches based upon the use of hidden Markov models (HMMs). QuantiSNP, PennCNV, and GenoCN utilize HMMs with six copy number states but vary in how transition and emission probabilities are calculated. Performance of these CNV detection algorithms has been shown to be variable between both genotyping platforms and data sets, although HMM approaches generally outperform other current methods. Low sensitivity is prevalent with HMM-based algorithms, suggesting the need for continued improvement in CNV detection methodologies.
copy number variation; genotyping microarray; hidden Markov model
Modeling signal transduction in cancer cells has implications for targeting new therapies and inferring the mechanisms that improve or threaten a patient’s treatment response. For transcriptome-wide studies, it has been proposed that simple correlation between a ligand and receptor pair implies a relationship to the disease process. Statistically, a differential correlation (DC) analysis across groups stratified by prognosis can link the pair to clinical outcomes. While the prognostic effect and the apparent change in correlation are both biological consequences of activation of the signaling mechanism, a correlation-driven analysis does not clearly capture this assumption and makes inefficient use of continuous survival phenotypes. To augment the correlation hypothesis, we propose that a regression framework assuming a patient-specific, latent level of signaling activation exists and generates both prognosis and correlation. Data from these systems can be inferred via interaction terms in survival regression models allowing signal transduction models beyond one pair at a time and adjusting for other factors. We illustrate the use of this model on ovarian cancer data from the Cancer Genome Atlas (TCGA) and discuss how the finding may be used to develop markers to guide targeted molecular therapies.
differential correlation; gene expression; ovarian cancer; signal transduction; survival analysis
Variable selection methods play an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from ultrahigh-dimensional data. In particular, genome-wide association studies (GWAS), which focus on identifying single nucleotide polymorphisms (SNPs) associated with a disease of interest, have produced ultrahigh-dimensional data. Numerous methods have been proposed to handle GWAS data. Most statistical methods have adopted a two-stage approach: pre-screening for dimensional reduction and variable selection to identify causal SNPs. The pre-screening step selects SNPs in terms of their P-values or the absolute values of the regression coefficients in single SNP analysis. Penalized regressions, such as the ridge, lasso, adaptive lasso, and elastic-net regressions, are commonly used for the variable selection step. In this paper, we investigate which combination of pre-screening method and penalized regression performs best on a quantitative phenotype using two real GWAS datasets.
genome-wide association study; the Korea Association Resource (KARE); the Age-Related Eye Disease Study (AREDS); penalized regression; variable selection
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases.
cancer subtype prediction; multinomial logit model; group lasso; network-constraint; proximal gradient algorithm
Pancreatic cancer is the fourth leading cause of cancer-related deaths. Therefore, in order to improve survival rates, the development of biomarkers for early diagnosis is crucial. Recently, diabetes has been associated with an increased risk of pancreatic cancer. The aims of this study were to search for novel serum biomarkers that could be used for early diagnosis of pancreatic cancer and to identify whether diabetes was a risk factor for this disease.
Blood samples were collected from 25 patients with diabetes (control) and 93 patients with pancreatic cancer (including 53 patients with diabetes), and analyzed using matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF/MS). We performed preprocessing, and various classification methods with imputation were used to replace the missing values. To validate the selection of biomarkers identified in pancreatic cancer patients, we measured biomarker intensity in pancreatic cancer patients with diabetes following surgical resection and compared our results with those from control (diabetes-only) patients.
By using various classification methods, we identified the commonly splitting protein peaks as m/z 1,465, 1,206, and 1,020. In the follow-up study, in which we assessed biomarkers in pancreatic cancer patients with diabetes after surgical resection, we found that the intensities of m/z at 1,465, 1,206, and 1,020 became comparable with those of diabetes-only patients.
biomarker; classification; mass spectrometry; pancreatic cancer
Numerous statistical methods have been published for designing and analyzing microarray projects. Traditional genome-wide microarray platforms (such as Affymetrix, Illumina, and DASL) measure the expression level of tens of thousands genes. Since the sets of genes included in these array chips are selected by the manufacturers, the number of genes associated with a specific disease outcome is limited and a large portion of the genes are not associated. nCounter is a new technology by NanoString to measure the expression of a selected number (up to 800) of genes. The list of genes for nCounter chips can be selected by customers. Due to the limited number of genes and the price increase in the number of selected genes, the genes for nCounter chips are carefully selected among those discovered from previous studies, usually using traditional high-throughput platforms, and only a small number of definitely unassociated genes, called control genes, are included to standardize the overall expression level across different chips. Furthermore, nCounter chips measure the expression level of each gene using a counting observation while the traditional high-throughput platforms produce continuous observations. Due to these differences, some statistical methods developed for the design and analysis of high-throughput projects may need modification or may be inappropriate for nCounter projects. In this paper, we discuss statistical methods that can be used for designing and analyzing nCounter projects.
censoring; false discovery rate; gradient lasso; permutation; proportional hazards model
High-throughput genomic assays are performed using tissue samples with the goal of classifying the samples as normal < pre-malignant < malignant or by stage of cancer using a small set of molecular features. In such cases, molecular features monotonically associated with the ordinal response may be important to disease development; that is, an increase in the phenotypic level (stage of cancer) may be mechanistically linked through a monotonic association with gene expression or methylation levels. Though traditional ordinal response modeling methods exist, they assume independence among the predictor variables and require the number of samples (n) to exceed the number of covariates (P) included in the model. In this paper, we describe our ordinalgmifs R package, available from the Comprehensive R Archive Network, which can fit a variety of ordinal response models when the number of predictors (P) exceeds the sample size (n). R code illustrating usage is also provided.
ordinal response; high-dimensional features; penalized models; R
We propose a class of hierarchical models to investigate the protein functional network of cellular markers. We consider a novel data set from single-cell proteomics. The data are generated from single-cell mass cytometry experiments, in which protein expression is measured within an individual cell for multiple markers. Tens of thousands of cells are measured serving as biological replicates. Applying the Bayesian models, we report protein functional networks under different experimental conditions and the differences between the networks, ie, differential networks. We also present the differential network in a novel fashion that allows direct observation of the links between the experimental agent and its putative targeted proteins based on posterior inference. Our method serves as a powerful tool for studying molecular interactions at cellular level.
Bayes; cytometry; graphical model; Markov chain Monte Carlo; network; proteomics; single cell
In genome-wide association studies (GWAS), regression analysis has been most commonly used to establish an association between a phenotype and genetic variants, such as single nucleotide polymorphism (SNP). However, most applications of regression analysis have been restricted to the investigation of single marker because of the large computational burden. Thus, there have been limited applications of regression analysis to multiple SNPs, including gene–gene interaction (GGI) in large-scale GWAS data. In order to overcome this limitation, we propose CARAT-GxG, a GPU computing system-oriented toolkit, for performing regression analysis with GGI using CUDA (compute unified device architecture). Compared to other methods, CARAT-GxG achieved almost 700-fold execution speed and delivered highly reliable results through our GPU-specific optimization techniques. In addition, it was possible to achieve almost-linear speed acceleration with the application of a GPU computing system, which is implemented by the TORQUE Resource Manager. We expect that CARAT-GxG will enable large-scale regression analysis with GGI for GWAS data.
GWAS; gene–gene interaction; logistic regression; GPU; graphics processing unit
Cancer biomarker discovery can facilitate drug development, improve staging of patients, and predict patient prognosis. Because cancer is the result of many interacting genes, analysis based on a set of genes with related biological functions or pathways may be more informative than single gene-based analysis for cancer biomarker discovery. The relevant pathways thus identified may help characterize different aspects of molecular phenotypes related to the tumor. Although it is well known that cancer patients may respond to the same treatment differently because of clinical variables and variation of molecular phenotypes, this patient heterogeneity has not been explicitly considered in pathway analysis in the literature. We hypothesize that combining pathway and patient clinical information can more effectively identify relevant pathways pertinent to specific patient subgroups, leading to better diagnosis and treatment. In this article, we propose to perform stratified pathway analysis based on clinical information from patients. In contrast to analysis using all the patients, this more focused analysis has the potential to reveal subgroup-specific pathways that may lead to more biological insights into disease etiology and treatment response. As an illustration, the power of our approach is demonstrated through its application to a breast cancer dataset in which the patients are stratified according to their oral contraceptive use.
cancer; random forests; pathways; progesterone receptor
Human tumor xenograft studies are the primary means to evaluate the biological activity of anticancer agents in late-stage preclinical drug discovery. The variability in the growth rate of human tumors established in mice and the small sample sizes make rigorous statistical analysis critical. The most commonly used summary of antitumor activity for these studies is the T/C ratio. However, alternative methods based on growth rate modeling can be used. Here, we describe a summary metric called the rate-based T/C, derived by fitting each animal’s tumor growth to a simple exponential model. The rate-based T/C uses all of the data, in contrast with the traditional T/C, which only uses a single measurement. We compare the rate-based T/C with the traditional T/C and assess their performance through a bootstrap analysis of 219 tumor xenograft studies. We find that the rate-based T/C requires fewer animals to achieve the same power as the traditional T/C. We also compare 14-day studies with 21-day studies and find that 14-day studies are more cost efficient. Finally, we perform a power analysis to determine an appropriate sample size.
xenograft; design; T; C
Historically, breast cancer classification has relied on prognostic subtypes. Thus, unlike hematopoietic cancers, breast tumor classification lacks phylogenetic rationale. The feasibility of phylogenetic classification of breast tumors has recently been demonstrated based on estrogen receptor (ER), androgen receptor (AR), vitamin D receptor (VDR) and Keratin 5 expression. Four hormonal states (HR0–3) comprising 11 cellular subtypes of breast cells have been proposed. This classification scheme has been shown to have relevance to clinical prognosis. We examine the implications of such phylogenetic classification on DNA methylation of both breast tumors and normal breast tissues by applying recently developed deconvolution algorithms to three DNA methylation data sets archived on Gene Expression Omnibus. We propose that breast tumors arising from a particular cell-of-origin essentially magnify the epigenetic state of their original cell type. We demonstrate that DNA methylation of tumors manifests patterns consistent with cell-specific epigenetic states, that these states correspond roughly to previously posited normal breast cell types, and that estimates of proportions of the underlying cell types are predictive of tumor phenotypes. Taken together, these findings suggest that the epigenetics of breast tumors is ultimately based on the underlying phylogeny of normal breast tissue.
androgen receptor; cell composition; deconvolution; DNA methylation; estrogen receptor; EWAS
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.
SVM; overfitting; biomarker discovery
Multiscale models are commonplace in cancer modeling, where individual models acting on different biological scales are combined within a single, cohesive modeling framework. However, model composition gives rise to challenges in understanding interfaces and interactions between them. Based on specific domain expertise, typically these computational models are developed by separate research groups using different methodologies, programming languages, and parameters. This paper introduces a graph-based model for semantically linking computational cancer models via domain graphs that can help us better understand and explore combinations of models spanning multiple biological scales. We take the data model encoded by TumorML, an XML-based markup language for storing cancer models in online repositories, and transpose its model description elements into a graph-based representation. By taking such an approach, we can link domain models, such as controlled vocabularies, taxonomic schemes, and ontologies, with cancer model descriptions to better understand and explore relationships between models. The union of these graphs creates a connected property graph that links cancer models by categorizations, by computational compatibility, and by semantic interoperability, yielding a framework in which opportunities for exploration and discovery of combinations of models become possible.
tumor modeling; in silico oncology; model exploration; property graphs; neo4j
Antibody–drug conjugates (ADCs) constitute a category of anticancer targeted therapy that has gathered great interest during the last few years because of their potential to kill cancer cells while causing significantly fewer side effects than traditional chemotherapy. In this paper, a process of computational construction of ADCs is described, using the surface lysines of an antibody and a non-covalent linker molecule, as well as a cytotoxic substance, as files in Protein Data Bank format. Also, aspects related to the function, properties, and development of ADCs are discussed.
cancer; targeted therapy; antibody–drug conjugate
It is well-established that the development of a disease, especially cancer, is a complex process that results from the joint effects of multiple genes involved in various molecular signaling pathways. In this article, we propose methods to discover genes and molecular pathways significantly associated with clinical outcomes in cancer samples. We exploit the natural hierarchal structure of genes related to a given pathway as a group of interacting genes to conduct selection of both pathways and genes. We posit the problem in a hierarchical structured variable selection (HSVS) framework to analyze the corresponding gene expression data. HSVS methods conduct simultaneous variable selection at the pathway (group level) and the gene (within-group) level. To adapt to the overlapping group structure present in the pathway–gene hierarchy of the data, we developed an overlap-HSVS method that introduces latent partial effect variables that partition the marginal effect of the covariates and corresponding weights for a proportional shrinkage of the partial effects. Combining gene expression data with prior pathway information from the KEGG databases, we identified several gene–pathway combinations that are significantly associated with clinical outcomes of multiple myeloma. Biological discoveries support this relationship for the pathways and the corresponding genes we identified.
Bayesian variable selection; hierarchical variable selection; multiple myeloma; overlapping group
High-throughput transcriptome sequencing allows identification of cancer-related changes that occur at the stages of transcription, pre-messenger RNA (mRNA), and splicing. In the current study, we devised a pipeline to predict novel alternative splicing (AS) variants from high-throughput transcriptome sequencing data and applied it to large sets of tumor transcriptomes from The Cancer Genome Atlas (TCGA). We identified two novel tumor-associated splice variants of matriptase, a known cancer-associated gene, in the transcriptome data from epithelial-derived tumors but not normal tissue. Most notably, these variants were found in 69% of lung squamous cell carcinoma (LUSC) samples studied. We confirmed the expression of matriptase AS transcripts using quantitative reverse transcription PCR (qRT-PCR) in an orthogonal panel of tumor tissues and cell lines. Furthermore, flow cytometric analysis confirmed surface expression of matriptase splice variants in chinese hamster ovary (CHO) cells transiently transfected with cDNA encoding the novel transcripts. Our findings further implicate matriptase in contributing to oncogenic processes and suggest potential novel therapeutic uses for matriptase splice variants.
matriptase; alternative splicing; epithelial tumors; RNA sequencing; de novo assembly
Cancer is responsible for approximately 7.6 million deaths per year worldwide. A 2012 survey in the United Kingdom found dramatic improvement in survival rates for childhood cancer because of increased participation in clinical trials. Unfortunately, overall patient participation in cancer clinical studies is low. A key logistical barrier to patient and physician participation is the time required for identification of appropriate clinical trials for individual patients. We introduce the Trial Prospector tool that supports end-to-end management of cancer clinical trial recruitment workflow with (a) structured entry of trial eligibility criteria, (b) automated extraction of patient data from multiple sources, (c) a scalable matching algorithm, and (d) interactive user interface (UI) for physicians with both matching results and a detailed explanation of causes for ineligibility of available trials. We report the results from deployment of Trial Prospector at the National Cancer Institute (NCI)-designated Case Comprehensive Cancer Center (Case CCC) with 1,367 clinical trial eligibility evaluations performed with 100% accuracy.
clinical trial; gastrointestinal cancer; clinical oncology; patient recruitment; clinical decision support system
Subnetwork detection is often used with differential expression analysis to identify modules or pathways associated with a disease or condition. Many computational methods are available for subnetwork analysis. Here, we compare the results of eight methods: simulated annealing–based jActiveModules, greedy search–based jActiveModules, DEGAS, BioNet, NetBox, ClustEx, OptDis, and NetWalker. These methods represent distinctly different computational strategies and are among the most widely used. Each of these methods was used to analyze gene expression data consisting of paired tumor and normal samples from 50 breast cancer patients. While the number of genes/proteins and protein interactions detected by the eight methods vary widely, a core set of 60 genes and 50 interactions was found to be shared by the subnetworks identified by five or more of the methods. Within the core set, 12 genes were found to be known breast cancer genes.
subnetwork detection; pathway analysis; breast cancer; TCGA; network biology
We aim at developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require heavy client (encoder side) cannot be applied. To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side. Based on the variation between source and reference, our protocol will pick adaptively either syndrome coding or hash coding to compress subsequences of changing code length. Our experimental results showed promising performance of the proposed method when compared with the state-of-the-art algorithm (GRS).
genome compression; distributed source coding; graphical model
In order to identify somatic focal copy number aberrations (CNAs) in cancer specimens and to distinguish them from germ-line copy number variations (CNVs), we developed the software package FocalCall. FocalCall enables user-defined size cutoffs to recognize focal aberrations and builds on established array comparative genomic hybridization segmentation and calling algorithms. To distinguish CNAs from CNVs, the algorithm uses matched patient normal signals as references or, if this is not available, a list with known CNVs in a population. Furthermore, FocalCall differentiates between homozygous and heterozygous deletions as well as between gains and amplifications and is applicable to high-resolution array and sequencing data.
R-package; focal CNAs; DNA copy number; sequencing; aCGH