Currently most terms and term-term relationships in Gene Ontology (GO) are defined manually, which creates cost, consistency and completeness issues. Recent studies have demonstrated the feasibility of inferring GO automatically from biological networks, which represents an important complementary approach to GO construction. These methods (NeXO and CliXO) are unsupervised, which means 1) they cannot use the information contained in existing GO, 2) the way they integrate biological networks may not optimize the accuracy, and 3) they are not customized to infer the three different sub-ontologies of GO. Here we present a semi-supervised method called Unicorn that extends these previous methods to tackle the three problems. Unicorn uses a sub-tree of an existing GO sub-ontology as training part to learn parameters in integrating multiple networks. Cross-validation results show that Unicorn reliably inferred the left-out parts of each specific GO sub-ontology. In addition, by training Unicorn with an old version of GO together with biological networks, it successfully re-discovered some terms and term-term relationships present only in a new version of GO. Unicorn also successfully inferred some novel terms that were not contained in GO but have biological meanings well-supported by the literature.Availability: Source code of Unicorn is available at http://yiplab.cse.cuhk.edu.hk/unicorn/.
Protein interactions play significant roles in complex diseases. We analyzed peripheral blood mononuclear cells (PBMC) transcriptome using a multi-method strategy. We constructed a tissue-specific interactome (T2Di) and identified 420 molecular signatures associated with T2D-related comorbidity and symptoms, mainly implicated in inflammation, adipogenesis, protein phosphorylation and hormonal secretion. Apart from explaining the residual associations within the DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) study, the T2Di signatures were enriched in pathogenic cell type-specific regulatory elements related to fetal development, immunity and expression quantitative trait loci (eQTL). The T2Di revealed a novel locus near a well-established GWAS loci AChE, in which SRRT interacts with JAZF1, a T2D-GWAS gene implicated in pancreatic function. The T2Di also included known anti-diabetic drug targets (e.g. PPARD, MAOB) and identified possible druggable targets (e.g. NCOR2, PDGFR). These T2Di signatures were validated by an independent computational method, and by expression data of pancreatic islet, muscle and liver with some of the signatures (CEBPB, SREBF1, MLST8, SRF, SRRT and SLC12A9) confirmed in PBMC from an independent cohort of 66 T2D and 66 control subjects. By combining prior knowledge and transcriptome analysis, we have constructed an interactome to explain the multi-layered regulatory pathways in T2D.
Motivation: The three-dimensional structure of genomes makes it possible for genomic regions not adjacent in the primary sequence to be spatially proximal. These DNA contacts have been found to be related to various molecular activities. Previous methods for analyzing DNA contact maps obtained from Hi-C experiments have largely focused on studying individual interactions, forming spatial clusters composed of contiguous blocks of genomic locations, or classifying these clusters into general categories based on some global properties of the contact maps.
Results: Here, we describe a novel computational method that can flexibly identify small clusters of spatially proximal genomic regions based on their local contact patterns. Using simulated data that highly resemble Hi-C data obtained from real genome structures, we demonstrate that our method identifies spatial clusters that are more compact than methods previously used for clustering genomic regions based on DNA contact maps. The clusters identified by our method enable us to confirm functionally related genomic regions previously reported to be spatially proximal in different species. We further show that each genomic region can be assigned a numeric affinity value that indicates its degree of participation in each local cluster, and these affinity values correlate quantitatively with DNase I hypersensitivity, gene expression, super enhancer activities and replication timing in a cell type specific manner. We also show that these cluster affinity values can precisely define boundaries of reported topologically associating domains, and further define local sub-domains within each domain.
Availability and implementation: The source code of BNMF and tutorials on how to use the software to extract local clusters from contact maps are available at http://yiplab.cse.cuhk.edu.hk/bnmf/.
Supplementary data are available at Bioinformatics online.
The size of digital data is ever increasing and is expected to grow to 40,000 EB by 2020, yet the estimated global information storage capacity in 2011 is <300 EB, indicating that most of the data are transient. DNA, as a very stable nano-molecule, is an ideal massive storage device for long-term data archive. The two most notable illustrations are from Church et al. and Goldman et al., whose approaches are well-optimized for most sequencing platforms – short synthesized DNA fragments without homopolymer. Here, we suggested improvements on error handling methodology that could enable the integration of DNA-based computational process, e.g., algorithms based on self-assembly of DNA. As a proof of concept, a picture of size 438 bytes was encoded to DNA with low-density parity-check error-correction code. We salvaged a significant portion of sequencing reads with mutations generated during DNA synthesis and sequencing and successfully reconstructed the entire picture. A modular-based programing framework – DNAcodec with an eXtensible Markup Language-based data format was also introduced. Our experiments demonstrated the practicability of long DNA message recovery with high error tolerance, which opens the field to biocomputing and synthetic biology.
DNA-based information storage; error-tolerating module; DNA-based computational process; synthetic biology; biocomputing
Comprehensive whole-genome structural variation detection is challenging with current approaches. With diploid cells as DNA source and the presence of numerous repetitive elements, short-read DNA sequencing cannot be used to detect structural variation efficiently. In this report, we show that genome mapping with long, fluorescently labeled DNA molecules imaged on nanochannel arrays can be used for whole-genome structural variation detection without sequencing. While whole-genome haplotyping is not achieved, local phasing (across >150-kb regions) is routine, as molecules from the parental chromosomes are examined separately. In one experiment, we generated genome maps from a trio from the 1000 Genomes Project, compared the maps against that derived from the reference human genome, and identified structural variations that are >5 kb in size. We find that these individuals have many more structural variants than those published, including some with the potential of disrupting gene function or regulation.
biotechnology; genome mapping; structural variation detection
Recently, several experimental techniques have emerged for probing RNA structures based on high-throughput sequencing. However, most secondary structure prediction tools that incorporate probing data are designed and optimized for particular types of experiments. For example, RNAstructure-Fold is optimized for SHAPE data, while SeqFold is optimized for PARS data. Here, we report a new RNA secondary structure prediction method, restrained MaxExpect (RME), which can incorporate multiple types of experimental probing data and is based on a free energy model and an MEA (maximizing expected accuracy) algorithm. We first demonstrated that RME substantially improved secondary structure prediction with perfect restraints (base pair information of known structures). Next, we collected structure-probing data from diverse experiments (e.g. SHAPE, PARS and DMS-seq) and transformed them into a unified set of pairing probabilities with a posterior probabilistic model. By using the probability scores as restraints in RME, we compared its secondary structure prediction performance with two other well-known tools, RNAstructure-Fold (based on a free energy minimization algorithm) and SeqFold (based on a sampling algorithm). For SHAPE data, RME and RNAstructure-Fold performed better than SeqFold, because they markedly altered the energy model with the experimental restraints. For high-throughput data (e.g. PARS and DMS-seq) with lower probing efficiency, the secondary structure prediction performances of the tested tools were comparable, with performance improvements for only a portion of the tested RNAs. However, when the effects of tertiary structure and protein interactions were removed, RME showed the highest prediction accuracy in the DMS-accessible regions by incorporating in vivo DMS-seq data.
In the competing endogenous RNA (ceRNA) hypothesis, different transcripts communicate through a competition for their common targeting microRNAs (miRNAs). Individual examples have clearly shown the functional importance of ceRNA in gene regulation and cancer biology. It remains unclear to what extent gene expression levels are regulated by ceRNA in general. One major hurdle to studying this problem is the intertwined connections in miRNA-target networks, which makes it difficult to isolate the effects of individual miRNAs.
Here we propose computational methods for decomposing a complex miRNA-target network into largely autonomous modules called microRNA-target biclusters (MTBs). Each MTB contains a relatively small number of densely connected miRNAs and mRNAs with few connections to other miRNAs and mRNAs. Each MTB can thus be individually analyzed with minimal crosstalk with other MTBs. Our approach differs from previous methods for finding modules in miRNA-target networks by not making any pre-assumptions about expression patterns, thereby providing objective information for testing the ceRNA hypothesis. We show that the expression levels of miRNAs and mRNAs in an MTB are significantly more anti-correlated than random miRNA-mRNA pairs and other validated and predicted miRNA-target pairs, demonstrating the biological relevance of MTBs. We further show that there is widespread correlation of expression between mRNAs in same MTBs under a wide variety of parameter settings, and the correlation remains even when co-regulatory effects are controlled for, which suggests potential widespread expression buffering between these mRNAs, which is consistent with the ceRNA hypothesis. Lastly, we also propose a potential use of MTBs in functional annotation of miRNAs.
MTBs can be used to help identify autonomous miRNA-target modules for testing the generality of the ceRNA hypothesis experimentally. The identified modules can also be used to test other properties of miRNA-target networks in general.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-1178) contains supplementary material, which is available to authorized users.
Competing endogeneous RNA; MicroRNA-target bicluster; MicroRNA network
Patient-derived tumor xenografts in mice are widely used in cancer research and have become important in developing personalized therapies. When these xenografts are subject to DNA sequencing, the samples could contain various amounts of mouse DNA. It has been unclear how the mouse reads would affect data analyses. We conducted comprehensive simulations to compare three alignment strategies at different mutation rates, read lengths, sequencing error rates, human-mouse mixing ratios and sequenced regions. We also sequenced a nasopharyngeal carcinoma xenograft and a cell line to test how the strategies work on real data.
We found the "filtering" and "combined reference" strategies performed better than aligning reads directly to human reference in terms of alignment and variant calling accuracies. The combined reference strategy was particularly good at reducing false negative variants calls without significantly increasing the false positive rate. In some scenarios the performance gain of these two special handling strategies was too small for special handling to be cost-effective, but it was found crucial when false non-synonymous SNVs should be minimized, especially in exome sequencing.
Our study systematically analyzes the effects of mouse contamination in the sequencing data of human-in-mouse xenografts. Our findings provide information for designing data analysis pipelines for these data.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-1172) contains supplementary material, which is available to authorized users.
Xenografts; Nasopharyngeal carcinoma; Contamination; High-throughput sequencing
To find signature features shared by various ncRNA sub-types and characterize novel ncRNAs, we have developed a method, RNAfeature, to investigate >600 sets of genomic and epigenomic data with various evolutionary and biophysical scores. RNAfeature utilizes a fine-tuned intra-species wrapper algorithm that is followed by a novel feature selection strategy across species. It considers long distance effect of certain features (e.g. histone modification at the promoter region). We finally narrow down on 10 informative features (including sequences, structures, expression profiles and epigenetic signals). These features are complementary to each other and as a whole can accurately distinguish canonical ncRNAs from CDSs and UTRs (accuracies: >92% in human, mouse, worm and fly). Moreover, the feature pattern is conserved across multiple species. For instance, the supervised 10-feature model derived from animal species can predict ncRNAs in Arabidopsis (accuracy: 82%). Subsequently, we integrate the 10 features to define a set of noncoding potential scores, which can identify, evaluate and characterize novel noncoding RNAs. The score covers all transcribed regions (including unconserved ncRNAs), without requiring assembly of the full-length transcripts. Importantly, the noncoding potential allows us to identify and characterize potential functional domains with feature patterns similar to canonical ncRNAs (e.g. tRNA, snRNA, miRNA, etc) on ∼70% of human long ncRNAs (lncRNAs).
High-throughput experimental methods have fostered the systematic detection of millions of genetic variants from any human genome. To help explore the potential biological implications of these genetic variants, software tools have been previously developed for integrating various types of information about these genomic regions from multiple data sources. Most of these tools were designed either for studying a small number of variants at a time, or for local execution on powerful machines.
To make exploration of whole lists of genetic variants simple and accessible, we have developed a new Web-based system called VAS (Variant Annotation System, available at
https://yiplab.cse.cuhk.edu.hk/vas/). It provides a large variety of information useful for studying both coding and non-coding variants, including whole-genome transcription factor binding, open chromatin and transcription data from the ENCODE consortium. By means of data compression, millions of variants can be uploaded from a client machine to the server in less than 50 megabytes of data. On the server side, our customized data integration algorithms can efficiently link millions of variants with tens of whole-genome datasets. These two enabling technologies make VAS a practical tool for annotating genetic variants from large genomic studies. We demonstrate the use of VAS in annotating genetic variants obtained from a migraine meta-analysis study and multiple data sets from the Personal Genomes Project. We also compare the running time of annotating 6.4 million SNPs of the CEU trio by VAS and another tool, showing that VAS is efficient in handling new variant lists without requiring any pre-computations.
VAS is specially designed to handle annotation tasks with long lists of genetic variants and large numbers of annotating features efficiently. It is complementary to other existing tools with more specific aims such as evaluating the potential impacts of genetic variants in terms of disease risk. We recommend using VAS for a quick first-pass identification of potentially interesting genetic variants, to minimize the time required for other more in-depth downstream analyses.
Annotation; Genetic variants; Genomic studies; Data integration
Identification of noncoding drivers from thousands of somatic alterations in a typical tumor is a difficult and unsolved problem. We report a computational framework, FunSeq2, to annotate and prioritize these mutations. The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression. FunSeq2 is available from funseq2.gersteinlab.org.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0480-5) contains supplementary material, which is available to authorized users.
Transcription factors (TFs) bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 TFs in 458 ChIP-Seq experiments. We found the combinatorial, co-association of TFs to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the TF binding into a hierarchy and integrated it with other genomic information (e.g. miRNA regulation), forming a dense meta-network. Factors at different levels have different properties: for instance, top-level TFs more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs -- e.g. noise-buffering feed-forward loops. Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (i.e., differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
DNA methylation is an important type of epigenetic modification involved in gene regulation. Although strong DNA methylation at promoters is widely recognized to be associated with transcriptional repression, many aspects of DNA methylation remain not fully understood, including the quantitative relationships between DNA methylation and expression levels, and the individual roles of promoter and gene body methylation.
Here we present an integrated analysis of whole-genome bisulfite sequencing and RNA sequencing data from human samples and cell lines. We find that while promoter methylation inversely correlates with gene expression as generally observed, the repressive effect is clear only on genes with a very high DNA methylation level. By means of statistical modeling, we find that DNA methylation is indicative of the expression class of a gene in general, but gene body methylation is a better indicator than promoter methylation. These findings are general in that a model constructed from a sample or cell line could accurately fit the unseen data from another. We further find that promoter and gene body methylation have minimal redundancy, and either one is sufficient to signify low expression. Finally, we obtain increased modeling power by integrating histone modification data with the DNA methylation data, showing that neither type of information fully subsumes the other.
Our results suggest that DNA methylation outside promoters also plays critical roles in gene regulation. Future studies on gene regulatory mechanisms and disease-associated differential methylation should pay more attention to DNA methylation at gene bodies and other non-promoter regions.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0408-0) contains supplementary material, which is available to authorized users.
By its very nature, genomics produces large, high-dimensional datasets that are well suited to analysis by machine learning approaches. Here, we explain some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE.
Eukaryotic protein kinases are generally classified as being either tyrosine or serine-threonine specific. Though not evident from inspection of their primary sequences, many serine-threonine kinases display a significant preference for serine or threonine as the phosphoacceptor residue. Here we show that a residue located in the kinase activation segment, which we term the “DFG+1” residue, acts as a major determinant for serine-threonine phosphorylation site specificity. Mutation of this residue was sufficient to switch the phosphorylation site preference for multiple kinases, including the serine-specific kinase PAK4 and the threonine-specific kinase MST4. Kinetic analysis of peptide substrate phosphorylation and crystal structures of PAK4-peptide complexes suggested that phosphoacceptor residue preference is not mediated by stronger binding of the favored substrate. Rather, favored kinase-phosphoacceptor combinations likely promote a conformation optimal for catalysis. Understanding the rules governing kinase phosphoacceptor preference allows kinases to be classified as serine or threonine specific based on their sequence.
•A single active site residue can determine kinase phosphoacceptor specificity•Favored and disfavored substrates promote distinct kinase-bound conformations•A simple rule predicts kinase phosphoacceptor preference from its DFG+1 residue
Diabetes and obesity are complex diseases associated with insulin resistance and fatty liver. The latter is characterized by dysregulation of the Akt, AMP-activated protein kinase (AMPK), and IGF-I pathways and expression of microRNAs (miRNAs). In China, multicomponent traditional Chinese medicine (TCM) has been used to treat diabetes for centuries. In this study, we used a three-herb, berberine-containing TCM to treat male Zucker diabetic fatty rats. TCM showed sustained glucose-lowering effects for 1 week after a single-dose treatment. Two-week treatment attenuated insulin resistance and fatty degeneration, with hepatocyte regeneration lasting for 1 month posttreatment. These beneficial effects persisted for 1 year after 1-month treatment. Two-week treatment with TCM was associated with activation of AMPK, Akt, and insulin-like growth factor-binding protein (IGFBP)1 pathways, with downregulation of miR29-b and expression of a gene network implicated in cell cycle, intermediary, and NADPH metabolism with normalization of CYP7a1 and IGFBP1 expression. These concerted changes in mRNA, miRNA, and proteins may explain the sustained effects of TCM in favor of cell survival, increased glucose uptake, and lipid oxidation/catabolism with improved insulin sensitivity and liver regeneration. These novel findings suggest that multicomponent TCM may be a useful tool to unravel genome regulation and expression in complex diseases.
Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.
As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.
Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
Natural small compounds comprise most cellular molecules and bind proteins as substrates, products, cofactors and ligands. However, a large scale investigation of in vivo protein-small metabolite interactions has not been performed. We developed a mass spectrometry assay for the large scale identification of in vivo protein-hydrophobic small metabolite interactions in yeast and analyzed compounds that bind ergosterol biosynthetic proteins and protein kinases. Many of these proteins bind small metabolites; a few interactions were previously known, but the vast majority are novel. Importantly, many key regulatory proteins such as protein kinases bind metabolites. Ergosterol was found to bind many proteins and may function as a general regulator. It is required for the activity of Ypk1, a mammalian AKT/SGK1 kinase homolog. Our study defines potential key regulatory steps in lipid biosynthetic pathways and suggests small metabolites may play a more general role as regulators of protein activity and function than previously appreciated.
We propose a method to predict yeast transcription factor targets by integrating histone modification profiles with transcription factor binding motif information. It shows improved predictive power compared to a binding motif-only method. We find that transcription factors cluster into histone-sensitive and -insensitive classes. The target genes of histone-sensitive transcription factors have stronger histone modification signals than those of histone-insensitive ones. The two classes also differ in tendency to interact with histone modifiers, degree of connectivity in protein-protein interaction networks, position in the transcriptional regulation hierarchy, and in a number of additional features, indicating possible differences in their transcriptional regulation mechanisms.
Peptide Recognition Domains (PRDs) are commonly found in signaling proteins. They mediate protein-protein interactions by recognizing and binding short motifs in their ligands. Although a great deal is known about PRDs and their interactions, prediction of PRD specificities remains largely an unsolved problem.
We present a novel approach to identifying these Specificity Determining Residues (SDRs). Our algorithm generalizes earlier information theoretic approaches to coevolution analysis, to become applicable to this problem. It leverages the growing wealth of binding data between PRDs and large numbers of random peptides, and searches for PRD residues that exhibit strong evolutionary covariation with some positions of the statistical profiles of bound peptides. The calculations involve only information from sequences, and thus can be applied to PRDs without crystal structures. We applied the approach to PDZ, SH3 and kinase domains, and evaluated the results using both residue proximity in co-crystal structures and verified binding specificity maps from mutagenesis studies.
Our predictions were found to be strongly correlated with the physical proximity of residues, demonstrating the ability of our approach to detect physical interactions of the binding partners. Some high-scoring pairs were further confirmed to affect binding specificity using previous experimental results. Combining the covariation results also allowed us to predict binding profiles with higher reliability than two other methods that do not explicitly take residue covariation into account.
The general applicability of our approach to the three different domain families demonstrated in this paper suggests its potential in predicting binding targets and assisting the exploration of binding mechanisms.
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation–i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts.
Availability: ACT is available at http://act.gersteinlab.org
We develop a statistical framework to study the relationship between chromatin features and gene expression. This can be used to predict gene expression of protein coding genes, as well as microRNAs. We demonstrate the prediction in a variety of contexts, focusing particularly on the modENCODE worm datasets. Moreover, our framework reveals the positional contribution around genes (upstream or downstream) of distinct chromatin features to the overall prediction of expression levels.
We describe the potential of current Web 2.0 technologies to achieve data mashup in the health care and life sciences (HCLS) domains, and compare that potential to the nascent trend of performing semantic mashup. After providing an overview of Web 2.0, we demonstrate two scenarios of data mashup, facilitated by the following Web 2.0 tools and sites: Yahoo! Pipes, Dapper, Google Maps and GeoCommons. In the first scenario, we exploited Dapper and Yahoo! Pipes to implement a challenging data integration task in the context of DNA microarray research. In the second scenario, we exploited Yahoo! Pipes, Google Maps, and GeoCommons to create a geographic information system (GIS) interface that allows visualization and integration of diverse categories of public health data, including cancer incidence and pollution prevalence data. Based on these two scenarios, we discuss the strengths and weaknesses of these Web 2.0 mashup technologies. We then describe Semantic Web, the mainstream Web 3.0 technology that enables more powerful data integration over the Web. We discuss the areas of intersection of Web 2.0 and Semantic Web, and describe the potential benefits that can be brought to HCLS research by combining these two sets of technologies.
Web 2.0; integration; mashup; Semantic Web; biomedical informatics; bioinformatics; life sciences; health care; public health
We performed computational reconstruction of the in silico gene regulatory networks in the DREAM3 Challenges. Our task was to learn the networks from two types of data, namely gene expression profiles in deletion strains (the ‘deletion data’) and time series trajectories of gene expression after some initial perturbation (the ‘perturbation data’). In the course of developing the prediction method, we observed that the two types of data contained different and complementary information about the underlying network. In particular, deletion data allow for the detection of direct regulatory activities with strong responses upon the deletion of the regulator while perturbation data provide richer information for the identification of weaker and more complex types of regulation. We applied different techniques to learn the regulation from the two types of data. For deletion data, we learned a noise model to distinguish real signals from random fluctuations using an iterative method. For perturbation data, we used differential equations to model the change of expression levels of a gene along the trajectories due to the regulation of other genes. We tried different models, and combined their predictions. The final predictions were obtained by merging the results from the two types of data. A comparison with the actual regulatory networks suggests that our approach is effective for networks with a range of different sizes. The success of the approach demonstrates the importance of integrating heterogeneous data in network reconstruction.