We describe a supervised prediction method for diagnosis of acute myeloid leukemia (AML) from patient samples based on flow cytometry measurements. We use a data driven approach with machine learning methods to train a computational model that takes in flow cytometry measurements from a single patient and gives a confidence score of the patient being AML-positive. Our solution is based on an regularized logistic regression model that aggregates AML test statistics calculated from individual test tubes with different cell populations and fluorescent markers. The model construction is entirely data driven and no prior biological knowledge is used. The described solution scored a 100% classification accuracy in the DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukaemia Challenge against a golden standard consisting of 20 AML-positive and 160 healthy patients. Here we perform a more extensive validation of the prediction model performance and further improve and simplify our original method showing that statistically equal results can be obtained by using simple average marker intensities as features in the logistic regression model. In addition to the logistic regression based model, we also present other classification models and compare their performance quantitatively. The key benefit in our prediction method compared to other solutions with similar performance is that our model only uses a small fraction of the flow cytometry measurements making our solution highly economical.
Cancer is a broad group of genetic diseases which account for millions of deaths worldwide each year. Cancers are classified by various clinical, pathological and molecular methods, but even within a well-characterized disease, there is a significant inter-patient variability in survival, response to treatment, and other parameters. Especially in molecular level, tumours of the same category can appear significantly dissimilar due to complex combinations of genetic aberrations leading to a similar malignancy. We extended the current classification methods by studying tumour heterogeneity at pathway level.
We computed the rate of alterations in 1994 pathways and 2210 tumours consisting of eight different cancers. Using gene set enrichment analysis, each sample was computed a pathway aberration profile that reflected its molecular state. The profiles were analysed together to infer the characteristic aberration rates for each pathway within each cancer. Subgroups of tumours defined by similar pathway aberrations were identified using clustering analyses. The pathway aberration and gene expression profiles of the subgroups were consecutively compared across all eight cancer types to search for similar tumours crossing the standard classification.
We identified pathways and processes that were common to all cancers as well as traits that are unique to a cancer type or closely related cancers. Studying the gene expression patterns within the pathway context suggested potential alteration mechanisms. Clustering analysis revealed five clinically relevant subgroups of tumours in four cancers that exhibited significant differences in survival compared to others. The cross-cancer analysis of the subgroups resulted in the identification of tumours that shared potentially significant alterations.
This study represents the first effort to extend the molecular characterizations towards pathway level descriptions across the family of cancers. In addition to providing a proof-of-concept for single sample pathway aberration analysis in this context, we present a comprehensive pathway aberration dataset that can be used to study pathway aberration patterns within or across cancers. Significant similarities between subgroups of different cancers on pathway and gene expression levels provide interesting hypotheses for understanding variable drug response, or transferring treatments across diseases by identifying common druggable pathways or genes, for example.
Aging and gender have a strong influence on the functional capacity of the immune system. In general, the immune response in females is stronger than that in males, but there is scant information about the effect of aging on the gender difference in the immune response. To address this question, we performed a transcriptomic analysis of peripheral blood mononuclear cells derived from elderly individuals (nonagenarians, n = 146) and young controls (aged 19–30 years, n = 30). When compared to young controls, we found 339 and 248 genes that were differentially expressed (p<0.05, fold change >1.5 or <−1.5) in nonagenarian females and males, respectively, 180 of these genes were changed in both genders. An analysis of the affected signaling pathways revealed a clear gender bias: there were 48 pathways that were significantly changed in females, while only 29 were changed in males. There were 24 pathways that were shared between both genders. Our results indicate that female nonagenarians have weaker T cell defenses and a more prominent pro-inflammatory response as compared to males. In males significantly fewer pathways were affected, two of which are known to be regulated by estrogen. These data show that the effects of aging on the human immune system are significantly different in males and females.
To facilitate analysis and understanding of biological systems, large-scale data are often integrated into models using a variety of mathematical and computational approaches. Such models describe the dynamics of the biological system and can be used to study the changes in the state of the system over time. For many model classes, such as discrete or continuous dynamical systems, there exist appropriate frameworks and tools for analyzing system dynamics. However, the heterogeneous information that encodes and bridges molecular and cellular dynamics, inherent to fine-grained molecular simulation models, presents significant challenges to the study of system dynamics. In this paper, we present an algorithmic information theory based approach for the analysis and interpretation of the dynamics of such executable models of biological systems. We apply a normalized compression distance (NCD) analysis to the state representations of a model that simulates the immune decision making and immune cell behavior. We show that this analysis successfully captures the essential information in the dynamics of the system, which results from a variety of events including proliferation, differentiation, or perturbations such as gene knock-outs. We demonstrate that this approach can be used for the analysis of executable models, regardless of the modeling framework, and for making experimentally quantifiable predictions.
Boolean networks have been used as a discrete model for several biological systems, including metabolic and genetic regulatory networks. Due to their simplicity they offer a firm foundation for generic studies of physical systems. In this work we show, using a measure of context-dependent information, set complexity, that prior to reaching an attractor, random Boolean networks pass through a transient state characterized by high complexity. We justify this finding with a use of another measure of complexity, namely, the statistical complexity. We show that the networks can be tuned to the regime of maximal complexity by adding a suitable amount of noise to the deterministic Boolean dynamics. In fact, we show that for networks with Poisson degree distributions, all networks ranging from subcritical to slightly supercritical can be tuned with noise to reach maximal set complexity in their dynamics. For networks with a fixed number of inputs this is true for near-to-critical networks. This increase in complexity is obtained at the expense of disruption in information flow. For a large ensemble of networks showing maximal complexity, there exists a balance between noise and contracting dynamics in the state space. In networks that are close to critical the intrinsic noise required for the tuning is smaller and thus also has the smallest effect in terms of the information processing in the system. Our results suggest that the maximization of complexity near to the state transition might be a more general phenomenon in physical systems, and that noise present in a system may in fact be useful in retaining the system in a state with high information content.
Fusion genes are chromosomal aberrations that are found in many cancers and can be used as prognostic markers and drug targets in clinical practice. Fusions can lead to production of oncogenic fusion proteins or to enhanced expression of oncogenes. Several recent studies have reported that some fusion genes can escape microRNA regulation via 3′–untranslated region (3′-UTR) deletion. We performed whole transcriptome sequencing to identify fusion genes in glioma and discovered FGFR3-TACC3 fusions in 4 of 48 glioblastoma samples from patients both of mixed European and of Asian descent, but not in any of 43 low-grade glioma samples tested. The fusion, caused by tandem duplication on 4p16.3, led to the loss of the 3′-UTR of FGFR3, blocking gene regulation of miR-99a and enhancing expression of the fusion gene. The fusion gene was mutually exclusive with EGFR, PDGFR, or MET amplification. Using cultured glioblastoma cells and a mouse xenograft model, we found that fusion protein expression promoted cell proliferation and tumor progression, while WT FGFR3 protein was not tumorigenic, even under forced overexpression. These results demonstrated that the FGFR3-TACC3 gene fusion is expressed in human cancer and generates an oncogenic protein that promotes tumorigenesis in glioblastoma.
Malignant peripheral nerve sheath tumor (MPNST) is a rare sarcoma that lacks effective therapeutic strategies. We gain insight into the most recurrent genetically altered pathways with the purpose of scanning possible therapeutic targets.
We performed a microarray based-comparative genomic hybridization (aCGH) profiling of two cohorts of primary MPNST tissue samples including 25 patients treated at The University of Texas MD Anderson Cancer Center and 26 patients from Tianjin Cancer Hospital. IHC and cell biology detection and validation were performed on human MPNST tissues and cell lines.
Genomic characterization of 51 MPNST tissue samples identified several frequently amplified regions harboring 2,599 genes and regions of deletion including 4,901 genes. At the pathway level, we identified a significant enrichment of copy number–altering events in the insulin-like growth factor 1 receptor (IGF1R) pathway, including frequent amplifications of the IGF1R gene itself. To validate the IGF1R pathway as a potential target in MPNSTs, we first confirmed that high IGF1R protein correlated with worse tumor-free survival in an independent set of samples using immunohistochemistry. Two MPNST cell lines (ST88-14 and STS26T) were used to determine the effect of attenuating IGF1R. Inhibition of IGF1R in ST88-14 cells using small interfering RNAs or an IGF1R inhibitor, MK-0646, led to significant decreases in cell proliferation, invasion, and migration accompanied by attenuation of the PI3K/AKT and MAPK pathways.
These integrated genomic and molecular studies provide evidence that the IGF1R pathway is a potential therapeutic target for patients with MPNST.
malignant peripheral nerve sheath tumor; insulin-like growth factor 1 receptor; genomic characterization; targeted therapy; microarray-based comparative genomic hybridization; gene amplification; MK-0646; epidermal growth factor receptor; Gefitinib
Gastrointestinal stromal tumors (GISTs) were historically grouped with leiomyosarcomas (LMSs) based on their morphological similarities, but recently they have been unequivocally established as a distinct type of sarcoma based on the molecular features and response to imatinib treatment. To gain further insight into the genomic differences between GISTs and LMSs, we mapped gene copy number aberrations (CNAs) in 42 GISTs and 30 LMSs and integrated them with gene expression profiles. Our studies revealed distinct patterns of CNAs between GISTs and LMSs. Losses in chromosomes 1p, 14q, 15q, and 22q were significantly more frequent in GISTs than in LMSs (P < 0.001), whereas losses in chromosomes 10 and 16 as well as gains in 1q, 14q, and 15q (P < 0.001) were more common in LMSs. By integrating CNAs with gene expression data and clinical information, we found several clinically relevant CNAs that were prognostic of survival in patients with GIST. Furthermore, GISTs were categorized into four groups according to an accumulating pattern of genetic alterations. Many key cellular pathways were differently expressed in the four groups and the patients had increasingly worse prognosis as the extent of genomic alterations increased. These findings lead us to propose a new tumor-progression genetic staging system termed Genomic Instability Stage (GIS) to complement the current prognostic predictive system based on tumor size, mitotic index (MI), and KIT mutation.
imatinib; CNA; GIST; leiomyosarcoma; array CGH; survival; KIT
Protein binding microarrays (PBM) are a high throughput technology used to characterize protein-DNA binding. The arrays measure a protein's affinity toward thousands of double-stranded DNA sequences at once, producing a comprehensive binding specificity catalog. We present a linear model for predicting the binding affinity of a protein toward DNA sequences based on PBM data. Our model represents the measured intensity of an individual probe as a sum of the binding affinity contributions of the probe's subsequences. These subsequences characterize a DNA binding motif and can be used to predict the intensity of protein binding against arbitrary DNA sequences. Our method was the best performer in the Dialogue for Reverse Engineering Assessments and Methods 5 (DREAM5) transcription factor/DNA motif recognition challenge. For the DREAM5 bonus challenge, we also developed an approach for the identification of transcription factors based on their PBM binding profiles. Our approach for TF identification achieved the best performance in the bonus challenge.
Colorectal cancer (CRC) remains one of the major cancer types and cancer
related death worldwide. Sensitive, non-invasive biomarkers that can
facilitate disease detection, staging and prediction of therapeutic outcome
are highly desirable to improve survival rate and help to determine
optimized treatment for CRC. The small non-coding RNAs, microRNAs (miRNAs),
have recently been identified as critical regulators for various diseases
including cancer and may represent a novel class of cancer biomarkers. The
purpose of this study was to identify and validate circulating microRNAs in
human plasma for use as such biomarkers in colon cancer.
By using quantitative reverse transcription-polymerase chain reaction, we
found that circulating miR-141 was significantly associated with stage IV
colon cancer in a cohort of 102 plasma samples. Receiver operating
characteristic (ROC) analysis was used to evaluate the sensitivity and
specificity of candidate plasma microRNA markers. We observed that
combination of miR-141 and carcinoembryonic antigen (CEA), a widely used
marker for CRC, further improved the accuracy of detection. These findings
were validated in an independent cohort of 156 plasma samples collected at
Tianjin, China. Furthermore, our analysis showed that high levels of plasma
miR-141 predicted poor survival in both cohorts and that miR-141 was an
independent prognostic factor for advanced colon cancer.
We propose that plasma miR-141 may represent a novel biomarker that
complements CEA in detecting colon cancer with distant metastasis and that
high levels of miR-141 in plasma were associated with poor prognosis.
Neuronal networks exhibit a wide diversity of structures, which contributes to the diversity of the dynamics therein. The presented work applies an information theoretic framework to simultaneously analyze structure and dynamics in neuronal networks. Information diversity within the structure and dynamics of a neuronal network is studied using the normalized compression distance. To describe the structure, a scheme for generating distance-dependent networks with identical in-degree distribution but variable strength of dependence on distance is presented. The resulting network structure classes possess differing path length and clustering coefficient distributions. In parallel, comparable realistic neuronal networks are generated with NETMORPH simulator and similar analysis is done on them. To describe the dynamics, network spike trains are simulated using different network structures and their bursting behaviors are analyzed. For the simulation of the network activity the Izhikevich model of spiking neurons is used together with the Tsodyks model of dynamical synapses. We show that the structure of the simulated neuronal networks affects the spontaneous bursting activity when measured with bursting frequency and a set of intraburst measures: the more locally connected networks produce more and longer bursts than the more random networks. The information diversity of the structure of a network is greatest in the most locally connected networks, smallest in random networks, and somewhere in between in the networks between order and disorder. As for the dynamics, the most locally connected networks and some of the in-between networks produce the most complex intraburst spike trains. The same result also holds for sparser of the two considered network densities in the case of full spike trains.
information diversity; neuronal network; structure-dynamics relationship; complexity
The innate immune system is a two-edged sword; it is absolutely required for host defense against infection but, uncontrolled, can trigger a plethora of inflammatory diseases. Here we used systems biology approaches to predict and validate a gene regulatory network involving a dynamic interplay between the transcription factors NF-κB, C/EBPδ, and ATF3 that controls inflammatory responses. We mathematically modeled transcriptional regulation of Il6 and Cebpd genes and experimentally validated the prediction that the combination of an initiator (NF-κB), an amplifier (C/EBPδ) and an attenuator (ATF3) forms a regulatory circuit that discriminates between transient and persistent Toll-like receptor 4-induced signals. Our results suggest a mechanism that enables the innate immune system to detect the duration of infection and to respond appropriately.
We present a computational framework for predicting targets of transcription factor regulation. The framework is based on the integration of a number of sources of evidence, derived from DNA sequence and gene expression data, using a weighted sum approach. Sources of evidence are prioritized based on a training set, and their relative contributions are then optimized. The performance of the proposed framework is demonstrated in the context of BCL6 target prediction. We show that this framework is able to uncover BCL6 targets reliably when biological prior information is utilized effectively, particularly in the case of sequence analysis. The framework results in a considerable gain in performance over scores in which sequence information was not incorporated. This analysis shows that with assessment of the quality and biological relevance of the data, reliable predictions can be obtained with this computational framework.
network inference; transcription factor binding site prediction; data integration
Two computational methods for estimating the cell cycle phase distribution of a budding yeast (Saccharomyces cerevisiae) cell population are presented. The first one is a nonparametric method that is based on the analysis of DNA content in the individual cells of the population. The DNA content is measured with a fluorescence-activated cell sorter (FACS). The second method is based on budding index analysis. An automated image analysis method is presented for the task of detecting the cells and buds. The proposed methods can be used to obtain quantitative information on the cell cycle phase distribution of a budding yeast S. cerevisiae population. They therefore provide a solid basis for obtaining the complementary information needed in deconvolution of gene expression data. As a case study, both methods are tested with data that were obtained in a time series experiment with S. cerevisiae. The details of the time series experiment as well as the image and FACS data obtained in the experiment can be found in the online additional material at http://www.cs.tut.fi/sgn/csb/yeastdistrib/.
Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed.
We present a microarray simulation model which can be used to validate different kinds of data analysis algorithms. The proposed model is unique in the sense that it includes all the steps that affect the quality of real microarray data. These steps include the simulation of biological ground truth data, applying biological and measurement technology specific error models, and finally simulating the microarray slide manufacturing and hybridization. After all these steps are taken into account, the simulated data has realistic biological and statistical characteristics. The applicability of the proposed model is demonstrated by several examples.
The proposed microarray simulation model is modular and can be used in different kinds of applications. It includes several error models that have been proposed earlier and it can be used with different types of input data. The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays. All this makes the model a valuable tool for example in validation of data analysis algorithms.