Computational cancer phylogenetics seeks to enumerate the temporal sequences of aberrations in tumor evolution, thereby delineating the evolution of possible tumor progression pathways, molecular subtypes and mechanisms of action. We previously developed a pipeline for constructing phylogenies describing evolution between major recurring cell types computationally inferred from whole-genome tumor profiles. The accuracy and detail of the phylogenies, however, depends on the identification of accurate, high-resolution molecular markers of progression, i.e., reproducible regions of aberration that robustly differentiate different subtypes and stages of progression. Here we present a novel hidden Markov model (HMM) scheme for the problem of inferring such phylogenetically significant markers through joint segmentation and calling of multi-sample tumor data. Our method classifies sets of genome-wide DNA copy number measurements into a partitioning of samples into normal (diploid) or amplified at each probe. It differs from other similar HMM methods in its design specifically for the needs of tumor phylogenetics, by seeking to identify robust markers of progression conserved across a set of copy number profiles. We show an analysis of our method in comparison to other methods on both synthetic and real tumor data, which confirms its effectiveness for tumor phylogeny inference and suggests avenues for future advances.
PMCID: PMC3830698
PMID: 24407301
Biology and genetics; Health; Trees; Segmentation
Array comparative genomic hybridization (aCGH) provides a high-resolution and high-throughput technique for screening of copy number variations (CNVs) within the entire genome. This technique, compared to the conventional CGH, significantly improves the identification of chromosomal abnormalities. However, due to the random noise inherited in the imaging and hybridization process, identifying statistically significant DNA copy number changes in aCGH data is challenging. We propose a novel approach that uses the mean and variance change point model (MVCM) to detect CNVs or breakpoints in aCGH data sets. We derive an approximate p-value for the test statistic and also give the estimate of the locus of the DNA copy number change. We carry out simulation studies to evaluate the accuracy of the estimate and the p-value formulation. These simulation results show that the approach is effective in identifying copy number changes. The approach is also tested on fibroblast cancer cell line data, breast tumor cell line data, and breast cancer cell line aCGH data sets that are publicly available. Changes that have not been identified by the circular binary segmentation (CBS) method but are biologically verified are detected by our approach on these cell lines with higher sensitivity and specificity than CBS.
doi:10.1109/TCBB.2008.129
PMCID: PMC4154476
PMID: 19875853
Statistical hypothesis testing; aCGH microarray data; gene expression; DNA copy numbers; CNVs
Next-generation sequencing technologies provide unprecedented power to explore the repertoire of genes and their alternative splice variants, collectively defining the transcriptome of a species in great detail. However, assembling the short reads into full-length gene and transcript models presents significant computational challenges. We review current algorithms for assembling transcripts and genes from next-generation sequencing reads aligned to a reference genome, and lay out areas for future improvements.
PMCID: PMC4086730
PMID: 24524156
Algorithms; biology and genetics; computer applications; medicine and science
Detecting and quantifying the timing and the genetic contributions of parental populations to a hybrid population is an important but challenging problem in reconstructing evolutionary histories from genetic variation data. With the advent of high throughput genotyping technologies, new methods suitable for large-scale data are especially needed. Furthermore, existing methods typically assume the assignment of individuals into subpopulations is known, when that itself is a difficult problem often unresolved for real data. Here we propose a novel method that combines prior work for inferring non-reticulate population structures with an MCMC scheme for sampling over admixture scenarios to both identify population assignments and learn divergence times and admixture proportions for those populations using genome-scale admixed genetic variation data. We validated our method using coalescent simulations and a collection of real bovine and human variation data. On simulated sequences, our methods show better accuracy and faster runtime than leading competitive methods in estimating admixture fractions and divergence times. Analysis on the real data further shows our methods to be effective at matching our best current knowledge about the relevant populations.
PMCID: PMC4019315
PMID: 23959633
J.3.a Biology and genetics; E.1.d Graphs and networks; H.1.1.b Information theory; F.2.2.b Computations on discrete structures
Cryo-electron microscopy is an experimental technique that is able to produce 3D gray-scale images of protein molecules. In contrast to other experimental techniques, cryo-electron microscopy is capable of visualizing large molecular complexes such as viruses and ribosomes. At medium resolution, the positions of the atoms are not visible and the process cannot proceed. The medium-resolution images produced by cryo-electron microscopy are used to derive the atomic structure of the proteins in de novo modeling. The skeletons of the 3D gray-scale images are used to interpret important information that is helpful in de novo modeling. Unfortunately, not all features of the image can be captured using a single segmentation. In this paper, we present a segmentation-free approach to extract the gray-scale curve-like skeletons. The approach relies on a novel representation of the 3D image, where the image is modeled as a graph and a set of volume trees. A test containing 36 synthesized maps and one authentic map shows that our approach can improve the performance of the two tested tools used in de novo modeling. The improvements were 62 and 13 percent for Gorgon and DP-TOSS, respectively.
doi:10.1109/TCBB.2013.121
PMCID: PMC4104753
PMID: 24384713
Image processing; graphs; modeling techniques; volumetric image representation
True steady states are a rare occurrence in living organisms, yet their knowledge is essential for quasi-steady state approximations, multistability analysis, and other important tools in the investigation of chemical reaction networks (CRN) used to describe molecular processes on the cellular level. Here we present an approach that can provide closed form steady-state solutions to complex systems, resulting from CRN with binary reactions and mass-action rate laws. We map the nonlinear algebraic problem of finding steady states onto a linear problem in a higher dimensional space. We show that the linearized version of the steady state equations obeys the linear conservation laws of the original CRN. We identify two classes of problems for which complete, minimally parameterized solutions may be obtained using only the machinery of linear systems and a judicious choice of the variables used as free parameters. We exemplify our method, providing explicit formulae, on CRN describing signal initiation of two important types of RTK receptor-ligand systems, VEGF and EGF-ErbB1.
doi:10.1109/TCBB.2013.41
PMCID: PMC4090023
PMID: 24334389
Chemical reaction networks; cell signaling; VEGF; EGF; linear conservation laws; analytical solution; bilinear systems; minimally parameterized solutions
Given a single index, the receiver operational characteristic (ROC) curve analysis is routinely utilized for characterizing performances in distinguishing two conditions/groups in terms of sensitivity and specificity. Given the availability of multiple data sources (referred to as multi-indices), such as multimodal neuroimaging data sets, cognitive tests, and clinical ratings and genomic data in Alzheimer’s disease (AD) studies, the single-index-based ROC underutilizes all available information. For a long time, a number of algorithmic/analytic approaches combining multiple indices have been widely used to simultaneously incorporate multiple sources. In this study, we propose an alternative for combining multiple indices using logical operations, such as “AND,” “OR,” and “at least n” (where n is an integer), to construct multivariate ROC (multiV-ROC) and characterize the sensitivity and specificity statistically associated with the use of multiple indices. With and without the “leave-one-out” cross-validation, we used two data sets from AD studies to showcase the potentially increased sensitivity/specificity of the multiV-ROC in comparison to the single-index ROC and linear discriminant analysis (an analytic way of combining multi-indices). We conclude that, for the data sets we investigated, the proposed multiV-ROC approach is capable of providing a natural and practical alternative with improved classification accuracy as compared to univariate ROC and linear discriminant analysis.
doi:10.1109/TCBB.2012.141
PMCID: PMC4085147
PMID: 23702553
Alzheimer’s dementia (AD); multiple indices; multiV-ROC; receiver operational characteristic (ROC)
High-throughput expression technologies, including gene expression array and liquid chromatography – mass spectrometry (LC-MS) etc., measure thousands of features, i.e. genes or metabolites, on a continuous scale. In such data, both linear and nonlinear relations exist between features. Nonlinear relations can reflect critical regulation patterns in the biological system. However they are not identified and utilized by traditional clustering methods based on linear associations. Clustering based on general dependencies, i.e. both linear and nonlinear relations, is hampered by the high dimensionality and high noise level of the data. We developed a sensitive nonparametric measure of general dependency between (groups of) random variables in high-dimensions. Based on this dependency measure, we developed a hierarchical clustering method. In simulation studies, the method outperformed correlation- and mutual information (MI) – based hierarchical clustering methods in clustering features with nonlinear dependencies. We applied the method to a microarray dataset measuring the gene expression in cell-cycle time series to show it generates biologically relevant results. The R code is available at http://userwww.service.emory.edu/~tyu8/GDHC.
doi:10.1109/TCBB.2013.99
PMCID: PMC3905248
PMID: 24334400
clustering; association rule; non-linear association; high-throughput data
Analysis of molecular interaction networks is pervasive in systems biology. This research relies almost entirely on graphs for modeling interactions. However, edges in graphs cannot represent multiway interactions among molecules, which occur very often within cells. Hypergraphs may be better representations for networks having such interactions, since hyperedges can naturally represent relationships among multiple molecules. Here, we propose using hypergraphs to capture the uncertainty inherent in reverse engineering gene-gene networks. Some subsets of nodes may induce highly varying subgraphs across an ensemble of networks inferred by a reverse engineering algorithm. We provide a novel formulation of hyperedges to capture this uncertainty in network topology. We propose a clustering-based approach to discover hyperedges. We show that our approach can recover hyperedges planted in synthetic data sets with high precision and recall, even for moderate amount of noise. We apply our techniques to a data set of pathways inferred from genetic interaction data in S. cerevisiae related to the unfolded protein response. Our approach discovers several hyperedges that capture the uncertain connectivity of genes in relevant protein complexes, suggesting that further experiments may be required to precisely discern their interaction patterns. We also show that these complexes are not discovered by an algorithm that computes frequent and dense subgraphs.
doi:10.1109/TCBB.2013.71
PMCID: PMC4051496
PMID: 24384702
Biology and genetics; hypergraphs; graphs and networks
Binary (0,1) matrices, commonly known as transactional databases, can
represent many application data, including gene-phenotype data where
“1” represents a confirmed gene-phenotype relation and
“0” represents an unknown relation. It is natural to ask what
information is hidden behind these “0”s and “1”s.
Unfortunately, recent matrix completion methods, though very effective in many
cases, are less likely to infer something interesting from these (0,1)-matrices.
To answer this challenge, we propose IndEvi, a very succinct and
effective algorithm to perform independent-evidence-based transactional database
transformation. Each entry of a (0,1)-matrix is evaluated by “independent
evidence” (maximal supporting patterns) extracted from the whole matrix
for this entry. The value of an entry, regardless of its value as 0 or 1, has
completely no effect for its independent evidence. The experiment on a
gene-phenotype database shows that our method is highly promising in ranking
candidate genes and predicting unknown disease genes.
doi:10.1109/TCBB.2011.58
PMCID: PMC4047992
PMID: 21422495
Transactional database; binary matrix; frequent item set mining; maximal biclique; phenotype; disease gene; prioritization; matrix completion
A Bayesian alignment model (BAM) is proposed for alignment of liquid chromatography-mass spectrometry (LC-MS) data. BAM belongs to the category of profile-based approaches, which are composed of two major components: a prototype function and a set of mapping functions. Appropriate estimation of these functions is crucial for good alignment results. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler and 2) an adaptive selection of knots. A block Metropolis-Hastings algorithm that mitigates the problem of the MCMC sampler getting stuck at local modes of the posterior distribution is used for the update of the mapping function coefficients. In addition, a stochastic search variable selection (SSVS) methodology is used to determine the number and positions of knots. We applied BAM to a simulated data set, an LC-MS proteomic data set, and two LC-MS metabolomic data sets, and compared its performance with the Bayesian hierarchical curve registration (BHCR) model, the dynamic time-warping (DTW) model, and the continuous profile model (CPM). The advantage of applying appropriate profile-based retention time correction prior to performing a feature-based approach is also demonstrated through the metabolomic data sets.
doi:10.1109/TCBB.2013.25
PMCID: PMC3993096
PMID: 23929872
Alignment; Bayesian inference; block Metropolis-Hastings algorithm; liquid chromatography-mass spectrometry (LC-MS); Markov chain Monte Carlo (MCMC); stochastic search variable selection (SSVS)
With well over one thousand specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.
doi:10.1109/TCBB.2009.83
PMCID: PMC3980937
PMID: 21393656
Bioinformatics (genome or protein) databases; Clustering, classification, and association rules; text mining; biomedical text classification; data mining
Modeling of biological networks is a difficult endeavour, but exploration of this problem is essential for understanding the systems behaviour of biological processes. In this contribution, developed for sparse data, we present a new continuous Bayesian graphical learning algorithm to cotemporally model proteins in signaling networks and genes in transcriptional regulatory networks. In this continuous Bayesian algorithm the correlation matrix is singular because the number of time points is less than the number of biological entities (genes or proteins). A suitable restriction on the degree of the graph’s vertices is applied and a Metropolis-Hastings algorithm is guided by a BIC-based posterior probability score. Ten independent and diverse runs of the algorithm are conducted, so that the probability space is properly well-explored. Diagnostics to test the applicability of the algorithm to the specific data sets are developed; this is a major benefit of the methodology. This novel algorithm is applied to two time course experimental data sets: 1) protein modification data identifying a potential signaling network in chondrocytes; and 2) gene expression data identifying the transcriptional regulatory network underlying dendritic cell maturation. This method gives high estimated posterior probabilities to many of the proteins’ directed edges that are predicted by the literature; for the gene study, the method gives high posterior probabilities to many of the literature-predicted sibling edges. In simulations, the method gives substantially higher estimated posterior probabilities for true edges and true subnetworks than for their false counterparts.
doi:10.1109/TCBB.2010.95
PMCID: PMC3954570
PMID: 20855920
Biological system modeling; statistical computing; multivariate statistics; correlation and regression analysis; signal transduction networks; transcriptional regulatory networks; biological network modeling
Nature possesses a secret formula for the energy as a function of the structure of a protein. In protein design, approximations are made to both the structural representation of the molecule and to the form of the energy equation, such that the existence of a general energy function for proteins is by no means guaranteed. Here we present new insights towards the application of machine learning to the problem of finding a general energy function for protein design. Machine learning requires the definition of an objective function, which carries with it the implied definition of success in protein design. We explored four functions, consisting of two functional forms, each with two criteria for success. Optimization was carried out by a Monte Carlo search through the space of all variable parameters. Cross-validation of the optimized energy function against a test set gave significantly different results depending on the choice of objective function, pointing to relative correctness of the built-in assumptions. Novel energy cross-terms correct for the observed non-additivity of energy terms and an imbalance in the distribution of predicted amino acids. This paper expands on the work presented at ACM-BCB, Orlando FL , October 2012.
doi:10.1109/TCBB.2013.113
PMCID: PMC3919130
PMID: 24384706
Biology and Genetics; Physics; Chemistry; Protein design; energy function; machine learning; correlation; rotamers; dead-end elimination
Reliable inference of transcription regulatory networks is still a challenging task in the field of computational biology. Network component analysis (NCA) has become a powerful scheme to uncover the networks behind complex biological processes, especially when gene expression data is integrated with binding motif information. However, the performance of NCA is impaired by the high rate of false connections in binding motif information and the high level of noise in gene expression data. Moreover, in real applications such as cancer research, the performance of NCA in simultaneously analyzing multiple candidate transcription factors (TFs) is further limited by the small sample number of gene expression data. In this paper, we propose a novel scheme, stability-based NCA, to overcome the above-mentioned problems by addressing the inconsistency between gene expression data and motif binding information (i.e., prior network knowledge). This method introduces small perturbations on prior network knowledge and utilizes the variation of estimated TF activities to reflect the stability of TF activities. Such a scheme is less limited by the sample size and especially capable to identify condition-specific TFs and their target genes. Experiment results on both simulation data and real breast cancer data demonstrate the efficiency and robustness of the proposed method.
PMCID: PMC3652899
PMID: 24407294
transcription regulatory network; network component analysis; stability analysis; transcription factor activity; target genes identification
Models of regulatory networks become more difficult to construct and understand as they grow in size and complexity. Large models are usually built up from smaller models, representing subsets of reactions within the larger network. To assist modelers in this composition process, we present a formal approach for model composition, a wizard-style program for implementing the approach, and suggested language extensions to the Systems Biology Markup Language to support model composition. To illustrate the features of our approach and how to use the JigCell Composition Wizard, we build up a model of the eukaryotic cell cycle “engine” from smaller pieces.
doi:10.1109/TCBB.2008.64
PMCID: PMC3773227
PMID: 20431147
Modeling; composition; fusion; flattening; SBML
DNA copy number variation (CNV) accounts for a large proportion of genetic variation. One commonly used approach to detecting CNVs is array-based comparative genomic hybridization (aCGH). Although many methods have been proposed to analyze aCGH data, it is not clear how to combine information from multiple samples to improve CNV detection. In this paper, we propose to use a matrix to approximate the multisample aCGH data and minimize the total variation of each sample as well as the nuclear norm of the whole matrix. In this way, we can make use of the smoothness property of each sample and the correlation among multiple samples simultaneously in a convex optimization framework. We also developed an efficient and scalable algorithm to handle large-scale data. Experiments demonstrate that the proposed method outperforms the state-of-the-art techniques under a wide range of scenarios and it is capable of processing large data sets with millions of probes.
doi:10.1109/TCBB.2012.166
PMCID: PMC3715577
PMID: 23702561
CNV; aCGH; total variation; spectral regularization; convex optimization
High throughput technologies enable researchers to measure expression levels on a genomic scale. However, the correct and efficient biological interpretation of such voluminous data remains a challenging problem. Many tools have been developed for the analysis of GO terms that are over- or under-represented in a list of differentially expressed genes. However, a previously unexplored aspect is the identification of changes in the way various biological processes interact in a given condition with respect to a reference. Here, we present a novel approach that aims at identifying such interactions between biological processes that are significantly different in a given phenotype with respect to normal. The proposed technique uses vector-space representation, SVD-based dimensionality reduction, differential weighting, and bootstrapping to asses the significance of the interactions under the multiple and complex dependencies expected between the biological processes. We illustrate our approach on two real data sets involving breast and lung cancer. More than 88 percent of the interactions found by our approach were deemed to be correct by an extensive manual review of literature. An interesting subset of such interactions is discussed in detail and shown to have the potential to open new avenues for research in lung and breast cancer.
doi:10.1109/TCBB.2012.65
PMCID: PMC3748606
PMID: 22547431
Phenotype-specific interactions; biological processes; microarrays; gene ontology; single value decomposition
Enormous data collection efforts and improvements in technology have made large genome-wide association studies a promising approach for better understanding the genetics of common diseases. Still, the knowledge gained from these studies may be extended even further by testing the hypothesis that genetic susceptibility is due to the combined effect of multiple variants or interactions between variants. Here we explore and evaluate the use of a genetic algorithm to discover groups of SNPs (of size 2, 3, or 4) that are jointly associated with bipolar disorder. The algorithm is guided by the structure of a gene interaction network, and is able to find groups of SNPs that are strongly associated with the disease, while performing far fewer statistical tests than other methods.
doi:10.1109/TCBB.2011.145
PMCID: PMC3748153
PMID: 22025762
Biology and Genetics; Evolutionary Computing and Genetic Algorithms; Graphs and Networks
The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach.
doi:10.1109/TCBB.2008.29
PMCID: PMC3712327
PMID: 20150671
Gene function prediction; gene annotation; Gene Ontology; vector space model; latent semantic indexing; weighting schemes
The local conformation of RNA molecules is an important factor in determining their catalytic and binding properties. The analysis of such conformations is particularly difficult due to the large number of degrees of freedom, such as the measured torsion angles per residue and the interatomic distances among interacting residues. In this work, we use a nearest-neighbor search method based on the statistical mechanical Potts model to find clusters in the RNA conformational space. The proposed technique is mostly automatic and may be applied to problems, where there is no prior knowledge on the structure of the data space in contrast to many other clustering techniques. Results are reported for both single residue conformations, where the parameter set of the data space includes four to seven torsional angles, and base pair geometries, where the data space is reduced to two dimensions. Moreover, new results are reported for base stacking geometries. For the first two cases, i.e., single residue conformations and base pair geometries, we get a very good match between the results of the proposed clustering method and the known classifications with only few exceptions. For the case of base stacking geometries, we validate our classification with respect to geometrical constraints and describe the content, and the geometry of the new clusters.
doi:10.1109/TCBB.2010.128
PMCID: PMC3679554
PMID: 21173460
RNA conformation; clustering; potts model; statistical mechanics
PMCID: PMC3646588
PMID: 19644178
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido.
doi:10.1109/TCBB.2012.26
PMCID: PMC3389307
PMID: 22331862
Mass spectrometry; protein identification; graphical models; Bayesian inference
Molecular dynamics trajectories are very data-intensive thereby limiting sharing and archival of such data. One possible solution is compression of trajectory data. Here, trajectory compression based on conversion to the coarse-grained model PRIMO is proposed. The compressed data is about one third of the original data and fast decompression is possible with an analytical reconstruction procedure from PRIMO to all-atom representations. This protocol largely preserves structural features and to a more limited extent also energetic features of the original trajectory.
doi:10.1109/TCBB.2011.141
PMCID: PMC3505254
PMID: 22025759
proteins; all-atom reconstruction; PRIMO; molecular dynamics simulation; compression; coarse-grained model
Microarray gene expression data often contain missing values. Accurate estimation of the missing values is important for down-stream data analyses that require complete data. Nonlinear relationships between gene expression levels have not been well-utilized in missing value imputation. We propose an imputation scheme based on nonlinear dependencies between genes. By simulations based on real microarray data, we show that incorporating non-linear relationships could improve the accuracy of missing value imputation, both in terms of normalized root mean squared error and in terms of the preservation of the list of significant genes in statistical testing. In addition, we studied the impact of artificial dependencies introduced by data normalization on the simulation results. Our results suggest that methods relying on global correlation structures may yield overly optimistic simulation results when the data has been subjected to row (gene) – wise mean removal.
doi:10.1109/TCBB.2010.73
PMCID: PMC3624752
PMID: 20733236
gene expression; statistical analysis; missing value