PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1917857)

Clipboard (0)
None

Related Articles

1.  Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological data 
BMC Bioinformatics  2008;9:203.
Background
Integrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published data. Integrating these complementary datasets helps infer a mutually consistent transcriptional regulatory network (TRN) with strong similarity to the structure of the underlying genetic regulatory modules. Decomposing the TRN into a small set of recurring regulatory patterns, called network motifs (NM), facilitates the inference. Identifying NMs defined by specific transcription factors (TF) establishes the framework structure of a TRN and allows the inference of TF-target gene relationship. This paper introduces a computational framework for utilizing data from multiple sources to infer TF-target gene relationships on the basis of NMs. The data include time course gene expression profiles, genome-wide location analysis data, binding sequence data, and gene ontology (GO) information.
Results
The proposed computational framework was tested using gene expression data associated with cell cycle progression in yeast. Among 800 cell cycle related genes, 85 were identified as candidate TFs and classified into four previously defined NMs. The NMs for a subset of TFs are obtained from literature. Support vector machine (SVM) classifiers were used to estimate NMs for the remaining TFs. The potential downstream target genes for the TFs were clustered into 34 biologically significant groups. The relationships between TFs and potential target gene clusters were examined by training recurrent neural networks whose topologies mimic the NMs to which the TFs are classified. The identified relationships between TFs and gene clusters were evaluated using the following biological validation and statistical analyses: (1) Gene set enrichment analysis (GSEA) to evaluate the clustering results; (2) Leave-one-out cross-validation (LOOCV) to ensure that the SVM classifiers assign TFs to NM categories with high confidence; (3) Binding site enrichment analysis (BSEA) to determine enrichment of the gene clusters for the cognate binding sites of their predicted TFs; (4) Comparison with previously reported results in the literatures to confirm the inferred regulations.
Conclusion
The major contribution of this study is the development of a computational framework to assist the inference of TRN by integrating heterogeneous data from multiple sources and by decomposing a TRN into NM-based modules. The inference capability of the proposed framework is verified statistically (e.g., LOOCV) and biologically (e.g., GSEA, BSEA, and literature validation). The proposed framework is useful for inferring small NM-based modules of TF-target gene relationships that can serve as a basis for generating new testable hypotheses.
doi:10.1186/1471-2105-9-203
PMCID: PMC2386822  PMID: 18426580
2.  Alignment and Prediction of cis-Regulatory Modules Based on a Probabilistic Model of Evolution 
PLoS Computational Biology  2009;5(3):e1000299.
Cross-species comparison has emerged as a powerful paradigm for predicting cis-regulatory modules (CRMs) and understanding their evolution. The comparison requires reliable sequence alignment, which remains a challenging task for less conserved noncoding sequences. Furthermore, the existing models of DNA sequence evolution generally do not explicitly treat the special properties of CRM sequences. To address these limitations, we propose a model of CRM evolution that captures different modes of evolution of functional transcription factor binding sites (TFBSs) and the background sequences. A particularly novel aspect of our work is a probabilistic model of gains and losses of TFBSs, a process being recognized as an important part of regulatory sequence evolution. We present a computational framework that uses this model to solve the problems of CRM alignment and prediction. Our alignment method is similar to existing methods of statistical alignment but uses the conserved binding sites to improve alignment. Our CRM prediction method deals with the inherent uncertainties of binding site annotations and sequence alignment in a probabilistic framework. In simulated as well as real data, we demonstrate that our program is able to improve both alignment and prediction of CRM sequences over several state-of-the-art methods. Finally, we used alignments produced by our program to study binding site conservation in genome-wide binding data of key transcription factors in the Drosophila blastoderm, with two intriguing results: (i) the factor-bound sequences are under strong evolutionary constraints even if their neighboring genes are not expressed in the blastoderm and (ii) binding sites in distal bound sequences (relative to transcription start sites) tend to be more conserved than those in proximal regions. Our approach is implemented as software, EMMA (Evolutionary Model-based cis-regulatory Module Analysis), ready to be applied in a broad biological context.
Author Summary
Comparison of noncoding DNA sequences across species has the potential to significantly improve our understanding of gene regulation and our ability to annotate regulatory regions of the genome. This potential is evident from recent publications analyzing 12 Drosophila genomes for regulatory annotation. However, because noncoding sequences are much less structured than coding sequences, their interspecies comparison presents technical challenges, such as ambiguity about how to align them and how to predict transcription factor binding sites, which are the fundamental units that make up regulatory sequences. This article describes how to build an integrated probabilistic framework that performs alignment and binding site prediction simultaneously, in the process improving the accuracy of both tasks. It defines a stochastic model for the evolution of entire “cis-regulatory modules,” with its highlight being a novel theoretical treatment of the commonly observed loss and gain of binding sites during evolution. This new evolutionary model forms the backbone of newly developed software for the prediction of new cis-regulatory modules, alignment of known modules to elucidate general principles of cis-regulatory evolution, or both. The new software is demonstrated to provide benefits in performance of these two crucial genomics tasks.
doi:10.1371/journal.pcbi.1000299
PMCID: PMC2657044  PMID: 19293946
3.  Construction and Analysis of an Integrated Regulatory Network Derived from High-Throughput Sequencing Data 
PLoS Computational Biology  2011;7(11):e1002190.
We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.
Author Summary
The precise control of gene expression lies at the heart of many biological processes. In eukaryotes, the regulation is performed at multiple levels, mediated by different regulators such as transcription factors and miRNAs, each distinguished by different spatial and temporal characteristics. These regulators are further integrated to form a complex regulatory network responsible for the orchestration. The construction and analysis of such networks is essential for understanding the general design principles. Recent advances in high-throughput techniques like ChIP-Seq and RNA-Seq provide an opportunity by offering a huge amount of binding and expression data. We present a general framework to combine these types of data into an integrated network and perform various topological analyses, including its hierarchical organization and motif enrichment. We find that the integrated network possesses an intrinsic hierarchical organization and is enriched in several network motifs that include both transcription factors and miRNAs. We further demonstrate that the framework can be easily applied to other species like human and mouse. As more and more genome-wide ChIP-Seq and RNA-Seq data are going to be generated in the near future, our methods of data integration have various potential applications.
doi:10.1371/journal.pcbi.1002190
PMCID: PMC3219617  PMID: 22125477
4.  A classification-based framework for predicting and analyzing gene regulatory response 
BMC Bioinformatics  2006;7(Suppl 1):S5.
Background
We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.
Methods
In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.
Results
Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from .
doi:10.1186/1471-2105-7-S1-S5
PMCID: PMC1810316  PMID: 16723008
5.  Integrated Assessment and Prediction of Transcription Factor Binding 
PLoS Computational Biology  2006;2(6):e70.
Systematic chromatin immunoprecipitation (chIP-chip) experiments have become a central technique for mapping transcriptional interactions in model organisms and humans. However, measurement of chromatin binding does not necessarily imply regulation, and binding may be difficult to detect if it is condition or cofactor dependent. To address these challenges, we present an approach for reliably assigning transcription factors (TFs) to target genes that integrates many lines of direct and indirect evidence into a single probabilistic model. Using this approach, we analyze publicly available chIP-chip binding profiles measured for yeast TFs in standard conditions, showing that our model interprets these data with significantly higher accuracy than previous methods. Pooling the high-confidence interactions reveals a large network containing 363 significant sets of factors (TF modules) that cooperate to regulate common target genes. In addition, the method predicts 980 novel binding interactions with high confidence that are likely to occur in so-far untested conditions. Indeed, using new chIP-chip experiments we show that predicted interactions for the factors Rpn4p and Pdr1p are observed only after treatment of cells with methyl-methanesulfonate, a DNA-damaging agent. We outline the first approach for consistently integrating all available evidences for TF–target interactions and we comprehensively identify the resulting TF module hierarchy. Prioritizing experimental conditions for each factor will be especially important as increasing numbers of chIP-chip assays are performed in complex organisms such as humans, for which “standard conditions” are ill defined.
Synopsis
Transcription factors (TFs) bind close to their target genes for regulating transcript levels depending on cellular conditions. Each gene may be regulated differently from others through the binding of specific groups of TFs (TF modules). Recently, a wide variety of large-scale measurements about transcriptional networks has become available. Here the authors present a framework for consistently integrating all of this evidence to systematically determine the precise set of genes directly regulated by each TF (i.e., TF–target interactions). The framework is applied to the yeast Saccharomyces cerevisiae using seven distinct sources of evidences to score all possible TF–target interactions in this organism. Subsequently, the authors employ another newly developed algorithm to reveal TF modules based on the top 5,000 TF–target interactions, yielding more than 300 TF modules. The new scoring scheme for TF–target interactions allows predicting the binding of TFs under so-far untested conditions, which is demonstrated by experimentally verifying interactions for two TFs (Pdr1p, Rpn4p). Importantly, the new methods (scoring of TF–target interactions and TF module identification) are scalable to much larger datasets, making them applicable to future studies in humans, which are thought to have substantially larger numbers of TF–target interactions.
doi:10.1371/journal.pcbi.0020070
PMCID: PMC1479087  PMID: 16789814
6.  MORPH: Probabilistic Alignment Combined with Hidden Markov Models of cis-Regulatory Modules 
PLoS Computational Biology  2007;3(11):e216.
The discovery and analysis of cis-regulatory modules (CRMs) in metazoan genomes is crucial for understanding the transcriptional control of development and many other biological processes. Cross-species sequence comparison holds much promise for improving computational prediction of CRMs, for elucidating their binding site composition, and for understanding how they evolve. Current methods for analyzing orthologous CRMs from multiple species rely upon sequence alignments produced by off-the-shelf alignment algorithms, which do not exploit the presence of binding sites in the sequences. We present here a unified probabilistic framework, called MORPH, that integrates the alignment task with binding site predictions, allowing more robust CRM analysis in two species. The framework sums over all possible alignments of two sequences, thus accounting for alignment ambiguities in a natural way. We perform extensive tests on orthologous CRMs from two moderately diverged species Drosophila melanogaster and D. mojavensis, to demonstrate the advantages of the new approach. We show that it can overcome certain computational artifacts of traditional alignment tools and provide a different, likely more accurate, picture of cis-regulatory evolution than that obtained from existing methods. The burgeoning field of cis-regulatory evolution, which is amply supported by the availability of many related genomes, is currently thwarted by the lack of accurate alignments of regulatory regions. Our work will fill in this void and enable more reliable analysis of CRM evolution.
Author Summary
Interspecies comparison of regulatory sequences is a major focus in the bioinformatics community today. There is extensive ongoing effort toward measuring the extent and patterns of binding site turnover in cis-regulatory modules. A major roadblock in such an analysis has been the fact that traditional alignment methods are not very accurate for regulatory sequences. This is partly because the alignment is performed independently from the binding site predictions and turnover analysis. This article describes a new computational method to compare and align two orthologous regulatory sequences. It uses a unified probabilistic framework to perform alignment and binding site prediction simultaneously, rather than one after the other. Predictions of binding sites and their evolutionary relationships are obtained after summing over all possible alignments, making them robust to alignment ambiguities. The method can also be used to predict new cis-regulatory modules. The article presents extensive applications of the method on synthetic as well as real data. These include the analysis of over 200 cis-regulatory modules in D. melanogaster and their orthologs in D. mojavensis. This analysis reveals a significantly greater degree of conservation of binding sites between these two species than will be inferred from existing alignment tools.
doi:10.1371/journal.pcbi.0030216
PMCID: PMC2065892  PMID: 17997594
7.  Seasonal Influenza Vaccination for Children in Thailand: A Cost-Effectiveness Analysis 
PLoS Medicine  2015;12(5):e1001829.
Background
Seasonal influenza is a major cause of mortality worldwide. Routine immunization of children has the potential to reduce this mortality through both direct and indirect protection, but has not been adopted by any low- or middle-income countries. We developed a framework to evaluate the cost-effectiveness of influenza vaccination policies in developing countries and used it to consider annual vaccination of school- and preschool-aged children with either trivalent inactivated influenza vaccine (TIV) or trivalent live-attenuated influenza vaccine (LAIV) in Thailand. We also compared these approaches with a policy of expanding TIV coverage in the elderly.
Methods and Findings
We developed an age-structured model to evaluate the cost-effectiveness of eight vaccination policies parameterized using country-level data from Thailand. For policies using LAIV, we considered five different age groups of children to vaccinate. We adopted a Bayesian evidence-synthesis framework, expressing uncertainty in parameters through probability distributions derived by fitting the model to prospectively collected laboratory-confirmed influenza data from 2005-2009, by meta-analysis of clinical trial data, and by using prior probability distributions derived from literature review and elicitation of expert opinion. We performed sensitivity analyses using alternative assumptions about prior immunity, contact patterns between age groups, the proportion of infections that are symptomatic, cost per unit vaccine, and vaccine effectiveness. Vaccination of children with LAIV was found to be highly cost-effective, with incremental cost-effectiveness ratios between about 2,000 and 5,000 international dollars per disability-adjusted life year averted, and was consistently preferred to TIV-based policies. These findings were robust to extensive sensitivity analyses. The optimal age group to vaccinate with LAIV, however, was sensitive both to the willingness to pay for health benefits and to assumptions about contact patterns between age groups.
Conclusions
Vaccinating school-aged children with LAIV is likely to be cost-effective in Thailand in the short term, though the long-term consequences of such a policy cannot be reliably predicted given current knowledge of influenza epidemiology and immunology. Our work provides a coherent framework that can be used for similar analyses in other low- and middle-income countries.
Ben Cooper and colleagues use an age-structured model to estimate optimal cost-effectiveness of flu vaccination among Thai children aged 2 to 17.
Editors' Summary
Background
Every year, millions of people catch influenza, a viral disease of the airways. Most infected individuals recover quickly, but elderly people, the very young, and chronically ill individuals are at high risk of developing serious complications such as pneumonia; seasonal influenza kills about half a million people annually. Small but frequent changes in the influenza virus mean that an immune response produced one year by exposure to the virus provides only partial protection against influenza the next year. Annual immunization with a vaccine that contains killed or live-attenuated (weakened) influenza viruses of the major circulating strains can reduce a person’s chance of catching influenza. Consequently, many countries run seasonal influenza vaccination programs that target elderly people and other people at high risk of influenza complications, and people who care for these individuals.
Why Was This Study Done?
As well as reducing the vaccinated person’s risk of infection, influenza vaccination protects unvaccinated members of the population by reducing the chances of influenza spreading. Because children make a disproportionately large contribution to the transmission of influenza, vaccination of children might therefore provide greater benefits to the whole population than vaccination of elderly people, particularly when vaccination uptake among the elderly is low. Thus, many high-income countries now recommend annual influenza vaccination of children with a trivalent live-attenuated influenza vaccine (LAIV; a trivalent vaccine contains three viruses), which is sprayed into the nose. However, to date no low- or middle-income countries have evaluated this policy. Here, the researchers develop a mathematical model (framework) to evaluate the cost-effectiveness of annual vaccination of children with LAIV or trivalent inactivated influenza vaccine (TIV) in Thailand. A cost-effectiveness analysis evaluates whether a medical intervention is good value for money by comparing the health outcomes and costs associated with the introduction of the intervention with the health outcomes and costs of the existing standard of care. Thailand, a middle-income country, offers everyone over 65 years old free seasonal influenza vaccination with TIV, but vaccine coverage in this age group is low (10%).
What Did the Researchers Do and Find?
The researchers developed a modeling framework that contained six connected components including a transmission model that incorporated infectious contacts within and between different age groups, a health outcome model that calculated the disability-adjusted life years (DALYs, a measure of the overall disease burden) averted by specific vaccination policies, and a cost model that calculated the costs to the population of each policy. They used this framework and data from Thailand to calculate the cost-effectiveness of six childhood vaccination policies in Thailand (one with TIV and five with LAIV that targeted children of different ages) against a baseline policy of 10% TIV coverage in the elderly; they also investigated the cost-effectiveness of increasing vaccination in the elderly to 66%. All seven vaccination policies tested reduced influenza cases and deaths compared to the baseline policy, but the LAIV-based polices were consistently better than the TIV-based policies; the smallest reductions were seen when TIV coverage in elderly people was increased to 66%. All seven policies were highly cost-effective according to the World Health Organization’s threshold for cost-effectiveness. That is, the cost per DALY averted by each policy compared to the baseline policy (the incremental cost-effectiveness ratio) was less than Thailand’s gross domestic product per capita (the total economic output of a country divided by the number of people in the country).
What Do These Findings Mean?
These findings suggest that seasonal influenza vaccination of children with LAIV is likely to represent good value for money in Thailand and, potentially, in other middle- and low-income countries in the short term. The long-term consequences of annual influenza vaccination of children in Thailand cannot be reliably predicted, however, because of limitations in our current understanding of influenza immunity in populations. Moreover, the accuracy of these findings is limited by the assumptions built into the modeling framework, including the vaccine costs and efficacy that were used to run the model, which were estimated from limited data. Importantly, however, these findings support proposals for large-scale community-based controlled trials of policies to vaccinate children against influenza in low- and middle-income countries. Indeed, based on these findings, Thailand is planning to evaluate school-based seasonal influenza vaccination in a few provinces in 2016 before considering a nationwide program of seasonal influenza vaccination of children.
Additional Information
This list of resources contains links that can be accessed when viewing the PDF on a device or via the online version of the article at http://dx.doi.org/10.1371/journal.pmed.1001829.
The UK National Health Service Choices website provides information for patients about seasonal influenza, about influenza vaccination, and about influenza vaccination in children
The World Health Organization provides information on seasonal influenza (in several languages) and on influenza vaccines
The US Centers for Disease Control and Prevention also provides information for patients and health professionals on all aspects of seasonal influenza, including information about vaccination, about children, influenza, and vaccination, and about herd immunity; its website contains a short video about personal experiences of influenza
Flu.gov, a US government website, provides access to information on seasonal influenza and vaccination
MedlinePlus has links to further information about influenza and about vaccination (in English and Spanish)
The Thai National Influenza Center monitors influenza activity throughout Thailand
doi:10.1371/journal.pmed.1001829
PMCID: PMC4444096  PMID: 26011712
8.  Computational framework for the prediction of transcription factor binding sites by multiple data integration 
BMC Neuroscience  2006;7(Suppl 1):S8.
Control of gene expression is essential to the establishment and maintenance of all cell types, and its dysregulation is involved in pathogenesis of several diseases. Accurate computational predictions of transcription factor regulation may thus help in understanding complex diseases, including mental disorders in which dysregulation of neural gene expression is thought to play a key role. However, biological mechanisms underlying the regulation of gene expression are not completely understood, and predictions via bioinformatics tools are typically poorly specific.
We developed a bioinformatics workflow for the prediction of transcription factor binding sites from several independent datasets. We show the advantages of integrating information based on evolutionary conservation and gene expression, when tackling the problem of binding site prediction. Consistent results were obtained on a large simulated dataset consisting of 13050 in silico promoter sequences, on a set of 161 human gene promoters for which binding sites are known, and on a smaller set of promoters of Myc target genes.
Our computational framework for binding site prediction can integrate multiple sources of data, and its performance was tested on different datasets. Our results show that integrating information from multiple data sources, such as genomic sequence of genes' promoters, conservation over multiple species, and gene expression data, indeed improves the accuracy of computational predictions.
doi:10.1186/1471-2202-7-S1-S8
PMCID: PMC1775048  PMID: 17118162
9.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny 
PLoS Computational Biology  2005;1(7):e67.
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.
Synopsis
Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until recently motif finders would typically take one of the following two general approaches. Given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a “phylogenetic footprinting” approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.
In this work the authors present an algorithm, PhyloGibbs, that combines these two approaches into one integrated Bayesian framework. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to arbitrary collections of multiple sequence alignments while taking into account the phylogenetic relations between the sequences.
The authors perform a number of tests on synthetic data and real data from Saccharomyces genomes in which PhyloGibbs significantly outperforms other existing methods. Finally, a novel anneal-and-track strategy allows PhyloGibbs to make accurate estimates of the reliability of its predictions.
doi:10.1371/journal.pcbi.0010067
PMCID: PMC1309704  PMID: 16477324
10.  Rapid Sampling of Molecular Motions with Prior Information Constraints 
PLoS Computational Biology  2009;5(2):e1000295.
Proteins are active, flexible machines that perform a range of different functions. Innovative experimental approaches may now provide limited partial information about conformational changes along motion pathways of proteins. There is therefore a need for computational approaches that can efficiently incorporate prior information into motion prediction schemes. In this paper, we present PathRover, a general setup designed for the integration of prior information into the motion planning algorithm of rapidly exploring random trees (RRT). Each suggested motion pathway comprises a sequence of low-energy clash-free conformations that satisfy an arbitrary number of prior information constraints. These constraints can be derived from experimental data or from expert intuition about the motion. The incorporation of prior information is very straightforward and significantly narrows down the vast search in the typically high-dimensional conformational space, leading to dramatic reduction in running time. To allow the use of state-of-the-art energy functions and conformational sampling, we have integrated this framework into Rosetta, an accurate protocol for diverse types of structural modeling. The suggested framework can serve as an effective complementary tool for molecular dynamics, Normal Mode Analysis, and other prevalent techniques for predicting motion in proteins. We applied our framework to three different model systems. We show that a limited set of experimentally motivated constraints may effectively bias the simulations toward diverse predicates in an outright fashion, from distance constraints to enforcement of loop closure. In particular, our analysis sheds light on mechanisms of protein domain swapping and on the role of different residues in the motion.
Author Summary
Incorporating external knowledge into computational frameworks is a challenge of prime importance in many fields of biological research. In this study, we show how computational power can be harnessed to make use of limited external information and to more effectively simulate the molecular motion of proteins. While experimentally solved protein structures restrict our knowledge to static molecular “snapshots”, a vast number of proteins are flexible entities that constantly change shape. Protein motion is therefore intrinsically related to protein function. State-of-the-art experimental approaches are still limited in the information that they provide about protein motion. Therefore, we suggest here a very general computational framework that can take into account diverse external constraints and include experimental information or expert intuition. We explore in detail several biological systems of prime interest, including domain swapping and substrate binding, and show how limited partial information enhances the accuracy of predictions. Suggested motion pathways form detailed lab-testable hypotheses and can be of great interest to both experimentalists and theoreticians.
doi:10.1371/journal.pcbi.1000295
PMCID: PMC2637990  PMID: 19247429
11.  Usability testing of Avoiding Diabetes Thru Action Plan Targeting (ADAPT) decision support for integrating care-based counseling of pre-diabetes in an electronic health record 
Purpose
Usability testing can be used to evaluate human computer interaction (HCI) and communication in shared decision making (SDM) for patient-provider behavioral change and behavioral contracting. Traditional evaluations of usability using scripted or mock patient scenarios with think-aloud protocol analysis provide a to identify HCI issues. In this paper we describe the application of these methods in the evaluation of the Avoiding Diabetes Thru Action Plan Targeting (ADAPT) tool, and test the usability of the tool to support the ADAPT framework for integrated care counseling of pre-diabetes. The think-aloud protocol analysis typically does not provide an assessment of how patient-provider interactions are effected in “live” clinical workflow or whether a tool is successful. Therefore, “Near-live” clinical simulations involving applied simulation methods were used to compliment the think-aloud results. This complementary usability technique was used to test the end-user HCI and tool performance by more closely mimicking the clinical workflow and capturing interaction sequences along with assessing the functionality of computer module prototypes on clinician workflow. We expected this method to further complement and provide different usability findings as compared to think-aloud analysis. Together, this mixed method evaluation provided comprehensive and realistic feedback for iterative refinement of the ADAPT system prior to implementation.
Methods
The study employed two phases of testing of a new interactive ADAPT tool that embedded an evidence-based shared goal setting component into primary care workflow for dealing with pre-diabetes counseling within a commercial physician office electronic health record (EHR). Phase I applied usability testing that involved “think-aloud” protocol analysis of 8 primary care providers interacting with several scripted clinical scenarios. Phase II used “near-live” clinical simulations of 5 providers interacting with standardized trained patient actors enacting the clinical scenario of counseling for pre-diabetes, each of whom had a pedometer that recorded the number of steps taken over a week. In both phases, all sessions were audio-taped and motion screen-capture software was activated for onscreen recordings. Transcripts were coded using iterative qualitative content analysis methods.
Results
In Phase I, the impact of the components and layout of ADAPT on user’s Navigation, Understandability, and Workflow were associated with the largest volume of negative comments (i.e. approximately 80% of end-user commentary), while Usability and Content of ADAPT were representative of more positive than negative user commentary. The heuristic category of Usability had a positive-to-negative comment ratio of 2.1, reflecting positive perception of the usability of the tool, its functionality, and overall co-productive utilization of ADAPT. However, there were mixed perceptions about content (i.e., how the information was displayed, organized and described in the tool).
In Phase II, the duration of patient encounters was approximately 10 minutes with all of the Patient Instructions (prescriptions) and behavioral contracting being activated at the end of each visit. Upon activation, providers accepted the pathway prescribed by the tool 100% of the time and completed all the fields in the tool in the simulation cases. Only 14% of encounter time was spent using the functionality of the ADAPT tool in terms of keystrokes and entering relevant data. The rest of the time was spent on communication and dialogue to populate the patient instructions. In all cases, the interaction sequence of reviewing and discussing exercise and diet of the patient was linked to the functionality of the ADAPT tool in terms of monitoring, response-efficacy, self-efficacy, and negotiation in the patient-provider dialogue. There was a change from one-way dialogue to two-way dialogue and negotiation that ended in a behavioral contract. This change demonstrated the tool’s sequence, which supported recording current exercise and diet followed by a diet and exercise goal setting procedure to reduce the risk of diabetes onset.
Conclusions
This study demonstrated that “think-aloud” protocol analysis with “near-live” clinical simulations provided a successful usability evaluation of a new primary care pre-diabetes shared goal setting tool. Each phase of the study provided complementary observations on problems with the new onscreen tool and was used to show the influence of the ADAPT framework on the usability, workflow integration, and communication between the patient and provider. The think-aloud tests with the provider showed the tool can be used according to the ADAPT framework (exercise-to-diet behavior change and tool utilization), while the clinical simulations revealed the ADAPT framework to realistically support patient-provider communication to obtain behavioral change contract. SDM interactions and mechanisms affecting protocol-based care can be more completely captured by combining “near-live” clinical simulations with traditional “think-aloud analysis” which augments clinician utilization. More analysis is required to verify if the rich communication actions found in Phase II compliment clinical workflows.
doi:10.1016/j.ijmedinf.2014.05.002
PMCID: PMC4212327  PMID: 24981988
Usability; evidence-based medicine; protocol-based care; electronic health records; clinical simulations; behavioral change; behavioral contracting; healthcare providers; patient counseling
12.  Regression Analysis of Combined Gene Expression Regulation in Acute Myeloid Leukemia 
PLoS Computational Biology  2014;10(10):e1003908.
Gene expression is a combinatorial function of genetic/epigenetic factors such as copy number variation (CNV), DNA methylation (DM), transcription factors (TF) occupancy, and microRNA (miRNA) post-transcriptional regulation. At the maturity of microarray/sequencing technologies, large amounts of data measuring the genome-wide signals of those factors became available from Encyclopedia of DNA Elements (ENCODE) and The Cancer Genome Atlas (TCGA). However, there is a lack of an integrative model to take full advantage of these rich yet heterogeneous data. To this end, we developed RACER (Regression Analysis of Combined Expression Regulation), which fits the mRNA expression as response using as explanatory variables, the TF data from ENCODE, and CNV, DM, miRNA expression signals from TCGA. Briefly, RACER first infers the sample-specific regulatory activities by TFs and miRNAs, which are then used as inputs to infer specific TF/miRNA-gene interactions. Such a two-stage regression framework circumvents a common difficulty in integrating ENCODE data measured in generic cell-line with the sample-specific TCGA measurements. As a case study, we integrated Acute Myeloid Leukemia (AML) data from TCGA and the related TF binding data measured in K562 from ENCODE. As a proof-of-concept, we first verified our model formalism by 10-fold cross-validation on predicting gene expression. We next evaluated RACER on recovering known regulatory interactions, and demonstrated its superior statistical power over existing methods in detecting known miRNA/TF targets. Additionally, we developed a feature selection procedure, which identified 18 regulators, whose activities clustered consistently with cytogenetic risk groups. One of the selected regulators is miR-548p, whose inferred targets were significantly enriched for leukemia-related pathway, implicating its novel role in AML pathogenesis. Moreover, survival analysis using the inferred activities identified C-Fos as a potential AML prognostic marker. Together, we provided a novel framework that successfully integrated the TCGA and ENCODE data in revealing AML-specific regulatory program at global level.
Author Summary
Recent studies from The Cancer Genome Atlas (TCGA) showed that most Acute Myeloid Leukemia (AML) patients lack DNA mutations, which can potentially explain the tumorigenesis, and motivated a systematic approach to elucidate aberrant molecular signatures at the transcriptional and epigenetic levels. Using recently available data from two large consortia namely Encyclopedia of DNA Elements and TCGA, we developed a novel computational model to infer the regulatory activities of the expression regulators and their target genes in AML samples. Our analysis revealed 18 regulators whose dysregulation contributed significantly to explaining the global mRNA expression changes. Encouragingly, the inferred activities of these regulatory features followed a consistent pattern with cytogenetic phenotypes of the AML patients. Among these regulators, we identified microRNA hsa-miR-548p, whose regulatory relationships with leukemia-related genes including YY1 suggest its novel role in AML pathogenesis. Additionally, we discovered that the inferred activities of transcription factor C-Fos can be used as a prognostic marker to characterize survival rate of the AML patients. Together, we demonstrated an effective model that can integrate useful information from a large amount of heterogeneous data to dissect regulatory effects. Furthermore, the novel biological findings from this study may be constructive to future experimental research in AML.
doi:10.1371/journal.pcbi.1003908
PMCID: PMC4207489  PMID: 25340776
13.  Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies 
PLoS Genetics  2014;10(10):e1004722.
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
Author Summary
Genome-wide association studies (GWAS) have successfully identified numerous regions in the genome that harbor genetic variants that increase risk for various complex traits and diseases. However, it is generally the case that GWAS risk variants are not themselves causally affecting the trait, but rather, are correlated to the true causal variant through linkage disequilibrium (LD). Plausible causal variants are identified in fine-mapping studies through targeted sequencing followed by prioritization of variants for functional validation. In this work, we propose methods that leverage two sources of independent information, the association strength and genomic functional location, to prioritize causal variants. We demonstrate in simulations and empirical data that our approach reduces the number of SNPs that need to be selected for follow-up to identify the true causal variants at GWAS risk loci.
doi:10.1371/journal.pgen.1004722
PMCID: PMC4214605  PMID: 25357204
14.  Integrated analyses to reconstruct microRNA-mediated regulatory networks in mouse liver using high-throughput profiling 
BMC Genomics  2015;16(Suppl 2):S12.
Background
MicroRNAs (miRNAs) simultaneously target many transcripts through partial complementarity binding, and have emerged as a key type of post-transcriptional regulator for gene expression. How miRNA accomplishes its pleiotropic effects largely depends on its expression and its target repertoire. Previous studies discovered thousands of miRNAs and numerous miRNA target genes mainly through computation and prediction methods which produced high rates of false positive prediction. The development of Argonaute cross-linked immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) provides a system to effectively determine miRNA target genes. Likewise, the accuracy of dissecting the transcriptional regulation of miRNA genes has been greatly improved by chromatin immunoprecipitation of the transcription factors coupled with sequencing (ChIP-Seq). Elucidation of the miRNA target repertoire will provide an in-depth understanding of the functional roles of microRNA pathways. To reliably reconstruct a miRNA-mediated regulatory network, we established a computational framework using publicly available, sequence-based transcription factor-miRNA databases, including ChIPBase and TransmiR for the TF-miRNA interactions, along with miRNA-target databases, including miRTarBase, TarBase and starBase, for the miRNA-target interactions. We applied the computational framework to elucidate the miRNA-mediated regulatory network in the Mir122a-/- mouse model, which has an altered transcriptome and progressive liver disease.
Results
We applied our computational framework to the expression profiles of miRNA/mRNA of Mir122a-/- mutant mice and wild-type mice. The miRNA-mediated network involves 40 curated TFs contributing to the aberrant expression of 65 miRNAs and 723 curated miRNA target genes, of which 56% was found in the differentially-expressed genes of Mir122a--mice. Hence, the regulatory network disclosed previously-known and also many previously-unidentified miRNA-mediated regulations in mutant mice. Moreover, we demonstrate that loss of imprinting at the chromosome 12qF1 region is associated with miRNA overexpression in human hepatocellular carcinoma and stem cells, suggesting initiation of precancerous changes in young mice deficient in miR-122. A group of 9 miRNAs was found to share miR-122 target genes, indicating synergy between miRNAs and target genes by way of multiplicity and cooperativity.
Conclusions
The study provides significant insight into miRNA-mediated regulatory networks. Based on experimentally verified data, this network is highly reliable and effective in revealing previously-undetermined disease-associated molecular mechanisms. This computational framework can be applied to explore the significant TF-miRNA-miRNA target interactions in any complex biological systems with high degrees of confidence.
doi:10.1186/1471-2164-16-S2-S12
PMCID: PMC4331712  PMID: 25707768
15.  Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources 
PLoS ONE  2008;3(3):e1820.
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.
doi:10.1371/journal.pone.0001820
PMCID: PMC2268002  PMID: 18364997
16.  ExprTarget: An Integrative Approach to Predicting Human MicroRNA Targets 
PLoS ONE  2010;5(10):e13534.
Variation in gene expression has been observed in natural populations and associated with complex traits or phenotypes such as disease susceptibility and drug response. Gene expression itself is controlled by various genetic and non-genetic factors. The binding of a class of small RNA molecules, microRNAs (miRNAs), to mRNA transcript targets has recently been demonstrated to be an important mechanism of gene regulation. Because individual miRNAs may regulate the expression of multiple gene targets, a comprehensive and reliable catalogue of miRNA-regulated targets is critical to understanding gene regulatory networks. Though experimental approaches have been used to identify many miRNA targets, due to cost and efficiency, current miRNA target identification still relies largely on computational algorithms that aim to take advantage of different biochemical/thermodynamic properties of the sequences of miRNAs and their gene targets. A novel approach, ExprTarget, therefore, is proposed here to integrate some of the most frequently invoked methods (miRanda, PicTar, TargetScan) as well as the genome-wide HapMap miRNA and mRNA expression datasets generated in our laboratory. To our knowledge, this dataset constitutes the first miRNA expression profiling in the HapMap lymphoblastoid cell lines. We conducted diagnostic tests of the existing computational solutions using the experimentally supported targets in TarBase as gold standard. To gain insight into the biases that arise from such an analysis, we investigated the effect of the choice of gold standard on the evaluation of the various computational tools. We analyzed the performance of ExprTarget using both ROC curve analysis and cross-validation. We show that ExprTarget greatly improves miRNA target prediction relative to the individual prediction algorithms in terms of sensitivity and specificity. We also developed an online database, ExprTargetDB, of human miRNA targets predicted by our approach that integrates gene expression profiling into a broader framework involving important features of miRNA target site predictions.
doi:10.1371/journal.pone.0013534
PMCID: PMC2958831  PMID: 20975837
17.  Discovering Motifs in Ranked Lists of DNA Sequences 
PLoS Computational Biology  2007;3(3):e39.
Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.
Author Summary
A computational problem with many applications in molecular biology is to identify short DNA sequence patterns (motifs) that are significantly overrepresented in a target set of genomic sequences relative to a background set of genomic sequences. One example is a target set that contains DNA sequences to which a specific transcription factor protein was experimentally measured as bound while the background set contains sequences to which the same transcription factor was not bound. Overrepresented sequence motifs in the target set may represent a subsequence that is molecularly recognized by the transcription factor. An inherent limitation of the above formulation of the problem lies in the fact that in many cases data cannot be clearly partitioned into distinct target and background sets in a biologically justified manner. We describe a statistical framework for discovering motifs in a list of genomic sequences that are ranked according to a biological parameter or measurement (e.g., transcription factor to sequence binding measurements). Our approach circumvents the need to partition the data into target and background sets using arbitrarily set parameters. The framework is implemented in a software tool called DRIM. The application of DRIM led to the identification of novel putative transcription factor binding sites in yeast and to the discovery of previously unknown motifs in CpG methylation regions in human cancer cell lines.
doi:10.1371/journal.pcbi.0030039
PMCID: PMC1829477  PMID: 17381235
18.  Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts 
PLoS Computational Biology  2009;5(7):e1000450.
The recently proposed concept of molecular connectivity maps enables researchers to integrate experimental measurements of genes, proteins, metabolites, and drug compounds under similar biological conditions. The study of these maps provides opportunities for future toxicogenomics and drug discovery applications. We developed a computational framework to build disease-specific drug-protein connectivity maps. We integrated gene/protein and drug connectivity information based on protein interaction networks and literature mining, without requiring gene expression profile information derived from drug perturbation experiments on disease samples. We described the development and application of this computational framework using Alzheimer's Disease (AD) as a primary example in three steps. First, molecular interaction networks were incorporated to reduce bias and improve relevance of AD seed proteins. Second, PubMed abstracts were used to retrieve enriched drug terms that are indirectly associated with AD through molecular mechanistic studies. Third and lastly, a comprehensive AD connectivity map was created by relating enriched drugs and related proteins in literature. We showed that this molecular connectivity map development approach outperformed both curated drug target databases and conventional information retrieval systems. Our initial explorations of the AD connectivity map yielded a new hypothesis that diltiazem and quinidine may be investigated as candidate drugs for AD treatment. Molecular connectivity maps derived computationally can help study molecular signature differences between different classes of drugs in specific disease contexts. To achieve overall good data coverage and quality, a series of statistical methods have been developed to overcome high levels of data noise in biological networks and literature mining results. Further development of computational molecular connectivity maps to cover major disease areas will likely set up a new model for drug development, in which therapeutic/toxicological profiles of candidate drugs can be checked computationally before costly clinical trials begin.
Author Summary
Molecular connectivity maps between drugs and a wide range of bio-molecular entities can help researchers to study and compare the molecular therapeutic/toxicological profiles of many candidate drugs. Recent studies in this area have focused on linking drug molecules and genes in specific disease contexts using drug-perturbed gene expression experiments, which can be costly and time-consuming to derive. In this paper, we developed a computational framework to build disease-specific drug-protein connectivity maps, by mining molecular interaction networks and PubMed abstracts. Using Alzheimer's Disease (AD) as a case study, we described how drug-protein molecular connectivity maps can be constructed to overcome data coverage and noise issues inherent in automatically extracted results. We showed that this new approach outperformed both curated drug target databases and conventional text mining systems in retrieving disease-related drugs, with an overall balanced performance of sensitivity, specificity, and positive predictive values. The AD molecular connectivity map contained novel information on AD-related genes/proteins, AD candidate drugs, and protein therapeutic/toxicological profiles of all the AD candidate drugs. Bi-clustering of the molecular connectivity map revealed interesting patterns of functionally similar proteins and drugs, therefore creating new opportunities for future drug development applications.
doi:10.1371/journal.pcbi.1000450
PMCID: PMC2709445  PMID: 19649302
19.  Incorporating functional inter-relationships into protein function prediction algorithms 
BMC Bioinformatics  2009;10:142.
Background
Functional classification schemes (e.g. the Gene Ontology) that serve as the basis for annotation efforts in several organisms are often the source of gold standard information for computational efforts at supervised protein function prediction. While successful function prediction algorithms have been developed, few previous efforts have utilized more than the protein-to-functional class label information provided by such knowledge bases. For instance, the Gene Ontology not only captures protein annotations to a set of functional classes, but it also arranges these classes in a DAG-based hierarchy that captures rich inter-relationships between different classes. These inter-relationships present both opportunities, such as the potential for additional training examples for small classes from larger related classes, and challenges, such as a harder to learn distinction between similar GO terms, for standard classification-based approaches.
Results
We propose a method to enhance the performance of classification-based protein function prediction algorithms by addressing the issue of using these interrelationships between functional classes constituting functional classification schemes. Using a standard measure for evaluating the semantic similarity between nodes in an ontology, we quantify and incorporate these inter-relationships into the k-nearest neighbor classifier. We present experiments on several large genomic data sets, each of which is used for the modeling and prediction of over hundred classes from the GO Biological Process ontology. The results show that this incorporation produces more accurate predictions for a large number of the functional classes considered, and also that the classes benefitted most by this approach are those containing the fewest members. In addition, we show how our proposed framework can be used for integrating information from the entire GO hierarchy for improving the accuracy of predictions made over a set of base classes. Finally, we provide qualitative and quantitative evidence that this incorporation of functional inter-relationships enables the discovery of interesting biology in the form of novel functional annotations for several yeast proteins, such as Sna4, Rtn1 and Lin1.
Conclusion
We implemented and evaluated a methodology for incorporating interrelationships between functional classes into a standard classification-based protein function prediction algorithm. Our results show that this incorporation can help improve the accuracy of such algorithms, and help uncover novel biology in the form of previously unknown functional annotations. The complete source code, a sample data set and the additional files for this paper are available free of charge for non-commercial use at .
doi:10.1186/1471-2105-10-142
PMCID: PMC2693438  PMID: 19435516
20.  An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding 
PLoS Computational Biology  2014;10(3):e1003501.
Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS's multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulatory signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
Author Summary
Many proteins that regulate the activity of other genes do so by attaching to the genome at specific binding sites. The locations that a given regulatory protein will bind, and the strength or frequency of such binding at an individual location, can vary depending on the cell type. We can profile the locations that a protein binds in a particular cell type using an experimental method called ChIP-seq, followed by computational interpretation of the data. However, since the experimental data are typically noisy, it is often difficult to compare the computational analyses of ChIP-seq data across multiple experiments in order to understand any differences in binding that may occur in different cell types. In this paper, we present a new computational method named MultiGPS for simultaneously analyzing multiple related ChIP-seq experiments in an integrated manner. By analyzing all the data together in an appropriate way, we can gain a more accurate picture of where the profiled protein is binding to the genome, and we can more easily and reliably detect differences in protein binding across cell types. We demonstrate the MultiGPS software using a new analysis of the regulatory protein Cdx2 in three different developmental cell types.
doi:10.1371/journal.pcbi.1003501
PMCID: PMC3967921  PMID: 24675637
21.  A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data 
BMC Bioinformatics  2009;10:165.
Background
Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.
Results
This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).
Conclusion
Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.
doi:10.1186/1471-2105-10-165
PMCID: PMC2717092  PMID: 19480678
22.  A novel computational model of the circadian clock in Arabidopsis that incorporates PRR7 and PRR9 
We developed a mathematical model of the Arabidopsis circadian clock, including PRR7 and PRR9, which is able to predict several single, double and triple mutant phenotypes.Sensitivity Analysis was used to identify the properties and time sensing mechanisms of model structures.PRR7 and CCA1/LHY were identified as weak points of the mathematical model indicating where more experimental data is needed for further model development.Detailed dynamical studies showed that the timing of an evening light sensing element is essential for day length responsiveness
In recent years, molecular genetic techniques have revealed a complex network of components in the Arabidopsis circadian clock. Mathematical models allow for a detailed study of the dynamics and architecture of such complex gene networks leading to a better understanding of the genetic interactions. It is important to maintain a constant iteration with experimentation, to include novel components as they are discovered and use the updated model to design new experiments. This study develops a framework to introduce new components into the mathematical model of the Arabidopsis circadian clock accelerating the iterative model development process and gaining insight into the system's properties.
We used the interlocked feedback loop model published in Locke et al (2005) as the base model. In Arabidopsis, the first suggested regulatory loop involves the morning expressed transcription factors CIRCADIAN CLOCK-ASSOCIATED 1 (CCA1) and LATE ELONGATED HYPOCOTYL (LHY), and the evening expressed pseudo-response regulator TIMING OF CAB EXPRESSION (TOC1). The hypothetical component X had been introduced to realize a longer delay between gene expression of CCA1/LHY and TOC1. The introduction of Y was motivated by the need for a mechanism to reproduce the dampening short period rhythms of the cca1/lhy double mutant and to include an additional light input at the end of the day.
In this study, the new components pseudo-response regulators PRR7 and PRR9 were added in negative feedback loops based on the biological hypothesis that they are activated by LHY and in turn repress LHY transcription (Farré et al, 2005; Figure 1). We present three iterations steps of model development (Figure 1A–C).
A wide range of tools was used to establish and analyze new model structures. One of the challenges facing mathematical modeling of biological processes is parameter identification; they are notoriously difficult to determine experimentally. We established an optimization procedure based on an evolutionary strategy with a cost function mainly derived from wild-type characteristics. This ensured that the model was not restricted by a specific set of parameters and enabled us to use a large set of biological mutant information to assess the predictive capability of the model structure. Models were evaluated by means of an extended phenotype catalogue, allowing for an easy and fair comparison of the structures. We also carried out detailed simulation analysis of component interactions to identify weak points in the structure and suggest further modifications. Finally, we applied sensitivity analysis in a novel manner, using it to direct the model development. Sensitivity analysis provides quantitative measures of robustness; the two measures in this study were the traces of component concentrations over time (classical state sensitivities) and phase behavior (measured by the phase response curve). Three major results emerged from the model development process.
First, the iteration process helped us to learn about general characteristics of the system. We observed that the timing of Y expression is critical for evening light entrainment, which enables the system to respond to changes in day length. This is important for our understanding of the mechanism of light input to the clock and will add in the identification of biological candidates for this function. In addition, our results suggest that a detailed description of the mechanisms of genetic interactions is important for the systems behavior. We observed that the introduction of an experimentally based precise light regulation mechanism on PRR9 expression had a significant effect on the systems behavior.
Second, the final model structure (Figure 1C) was capable of predicting a wide range of mutant phenotypes, such as a reduction of TOC1 expression by RNAi (toc1RNAi), mutations in PRR7 and PRR9 and the novel mutant combinations prr9toc1RNAi and prr7prr9toc1RNAi. However, it was unable to predict the mutations in CCA1 and LHY.
Finally, sensitivity analysis identified the weak points of the system. The developed model structure was heavily based on the TOC1/Y feedback loop. This could explain the model's failure to predict the cca1lhy double mutant phenotype. More detailed information on the regulation of CCA1 and LHY expression will be important to achieve the right balance between the different regulatory loops in the mathematical model. This is in accordance with genetic studies that have identified several genes involved in the regulation of LHY and CCA1 expression. The identification of their mechanism of action will be necessary for the next model development.
In plants, as in animals, the core mechanism to retain rhythmic gene expression relies on the interaction of multiple feedback loops. In recent years, molecular genetic techniques have revealed a complex network of clock components in Arabidopsis. To gain insight into the dynamics of these interactions, new components need to be integrated into the mathematical model of the plant clock. Our approach accelerates the iterative process of model identification, to incorporate new components, and to systematically test different proposed structural hypotheses. Recent studies indicate that the pseudo-response regulators PRR7 and PRR9 play a key role in the core clock of Arabidopsis. We incorporate PRR7 and PRR9 into an existing model involving the transcription factors TIMING OF CAB (TOC1), LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED (CCA1). We propose candidate models based on experimental hypotheses and identify the computational models with the application of an optimization routine. Validation is accomplished through systematic analysis of various mutant phenotypes. We introduce and apply sensitivity analysis as a novel tool for analyzing and distinguishing the characteristics of proposed architectures, which also allows for further validation of the hypothesized structures.
doi:10.1038/msb4100101
PMCID: PMC1682023  PMID: 17102803
Arabidopsis; circadian rhythms; mathematical modeling; parameter optimization; sensitivity analysis
23.  Ultra-Structure database design methodology for managing systems biology data and analyses 
BMC Bioinformatics  2009;10:254.
Background
Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping).
Results
We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research.
Conclusion
We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.
doi:10.1186/1471-2105-10-254
PMCID: PMC2748085  PMID: 19691849
24.  Network-based analysis reveals distinct association patterns in a semantic MEDLINE-based drug-disease-gene network 
Background
A huge amount of associations among different biological entities (e.g., disease, drug, and gene) are scattered in millions of biomedical articles. Systematic analysis of such heterogeneous data can infer novel associations among different biological entities in the context of personalized medicine and translational research. Recently, network-based computational approaches have gained popularity in investigating such heterogeneous data, proposing novel therapeutic targets and deciphering disease mechanisms. However, little effort has been devoted to investigating associations among drugs, diseases, and genes in an integrative manner.
Results
We propose a novel network-based computational framework to identify statistically over-expressed subnetwork patterns, called network motifs, in an integrated disease-drug-gene network extracted from Semantic MEDLINE. The framework consists of two steps. The first step is to construct an association network by extracting pair-wise associations between diseases, drugs and genes in Semantic MEDLINE using a domain pattern driven strategy. A Resource Description Framework (RDF)-linked data approach is used to re-organize the data to increase the flexibility of data integration, the interoperability within domain ontologies, and the efficiency of data storage. Unique associations among drugs, diseases, and genes are extracted for downstream network-based analysis. The second step is to apply a network-based approach to mine the local network structure of this heterogeneous network. Significant network motifs are then identified as the backbone of the network. A simplified network based on those significant motifs is then constructed to facilitate discovery. We implemented our computational framework and identified five network motifs, each of which corresponds to specific biological meanings. Three case studies demonstrate that novel associations are derived from the network topology analysis of reconstructed networks of significant network motifs, further validated by expert knowledge and functional enrichment analyses.
Conclusions
We have developed a novel network-based computational approach to investigate the heterogeneous drug-gene-disease network extracted from Semantic MEDLINE. We demonstrate the power of this approach by prioritizing candidate disease genes, inferring potential disease relationships, and proposing novel drug targets, within the context of the entire knowledge. The results indicate that such approach will facilitate the formulization of novel research hypotheses, which is critical for translational medicine research and personalized medicine.
doi:10.1186/2041-1480-5-33
PMCID: PMC4137727  PMID: 25170419
25.  An integrative framework for the identification of double minute chromosomes using next generation sequencing data 
BMC Genetics  2015;16(Suppl 2):S1.
Background
Double minute chromosomes are circular fragments of DNA whose presence is associated with the onset of certain cancers. Double minutes are lethal, as they are highly amplified and typically contain oncogenes. Locating double minutes can supplement the process of cancer diagnosis, and it can help to identify therapeutic targets. However, there is currently a dearth of computational methods available to identify double minutes. We propose a computational framework for the idenfication of double minute chromosomes using next-generation sequencing data. Our framework integrates predictions from algorithms that detect DNA copy number variants, and it also integrates predictions from algorithms that locate genomic structural variants. This information is used by a graph-based algorithm to predict the presence of double minute chromosomes.
Results
Using a previously published copy number variant algorithm and two structural variation prediction algorithms, we implemented our framework and tested it on a dataset consisting of simulated double minute chromosomes. Our approach uncovered double minutes with high accuracy, demonstrating its plausibility.
Conclusions
Although we only tested the framework with three programs (RDXplorer, BreakDancer, Delly), it can be extended to incorporate results from programs that 1) detect amplified copy number and from programs that 2) detect genomic structural variants like deletions, translocations, inversions, and tandem repeats.
The software that implements the framework can be accessed here: https://github.com/mhayes20/DMFinder
doi:10.1186/1471-2156-16-S2-S1
PMCID: PMC4423570  PMID: 25953282
amplicon; double minute; next generation sequencing

Results 1-25 (1917857)