Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.
Transcriptional regulation critically depends on proper interactions between transcription factors (TF) and their cognate DNA binding sites. The widely used model of TF-DNA binding – the Positional Weight Matrix (PWM) – presumes independence between positions within the binding site. However, there is evidence to show that the independence assumption may not always hold, and the extent of interposition dependence is not completely known. We hypothesize that the interposition dependence should partly be manifested as correlated evolution at the positions. We report a Maximum-Likelihood (ML) approach to infer correlated evolution at any two positions within a PWM, based on a multiple alignment of 5 mammalian genomes. Application to a genome-wide set of putative cis elements in human promoters reveals a prevalence of correlated evolution within cis elements. We found that the interdependence between two positions decreases with increasing distance between the positions. The interdependent positions tend to be evolutionarily more constrained and moreover, the dependence patterns are relatively similar across structurally related transcription factors. Although some of the detected mutational dependencies may be due to context-dependent genomic hyper-mutation, notably CG to TG, the majority is likely due to context-dependent preferences for specific nucleotide combinations within the cis elements. Patterns of evolution at individual nucleotide positions within mammalian TF binding sites are often significantly correlated, suggesting interposition dependence. The proposed methodology is also applicable to other classes of non-coding functional elements. A detailed investigation of mutational dependencies within specific motifs could reveal preferred nucleotide combinations that may help refine the DNA binding models.
The short length and high degeneracy of sites recognized by DNA-binding transcription factors limit the amount of information they can carry, and individual sites are rarely sufficient to mediate the regulation of specific targets. Computational analysis of microbial genomes has suggested that many factors function optimally when in a particular orientation and position with respect to their target promoters. To investigate this further, we developed and trained spatial models of binding site positioning and applied them to the genome of the yeast Saccharomyces cerevisiae. We found evidence of non-random organization of sites within promoters, differences in binding site density, or both for thirty-eight transcription factors. We show that these signatures allow transcription factors with substantial differences in binding site specificity to share similar promoter specificities. We illustrate how spatial information dictating the positioning and density of binding sites can in principle increase the information available to the organism for differentiating a transcription factor’s true targets, and we indicate how this information could potentially be leveraged for the same purpose in bioinformatic analyses.
An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control.
The letters in our genome spell words and phrases that control when each gene is activated. To understand how these words and phrases function in health and disease, we have developed a new computational method to determine what word positions in our genomic text are used by each genome regulatory protein, and how these active words are spaced relative to one another. Our method achieves exceptional spatial accuracy by integrating experimental data with the text of our genome to find the precise words that are regulated by each protein factor. Using this analysis we have discovered novel word spacings in the experimental data that suggest novel genome grammatical control constructs.
The field of regulatory genomics today is characterized by the generation of high-throughput data sets that capture genome-wide transcription factor (TF) binding, histone modifications, or DNAseI hypersensitive regions across many cell types and conditions. In this context, a critical question is how to make optimal use of these publicly available datasets when studying transcriptional regulation. Here, we address this question in Drosophila melanogaster for which a large number of high-throughput regulatory datasets are available. We developed i-cisTarget (where the ‘i’ stands for integrative), for the first time enabling the discovery of different types of enriched ‘regulatory features’ in a set of co-regulated sequences in one analysis, being either TF motifs or ‘in vivo’ chromatin features, or combinations thereof. We have validated our approach on 15 co-expressed gene sets, 21 ChIP data sets, 628 curated gene sets and multiple individual case studies, and show that meaningful regulatory features can be confidently discovered; that bona fide enhancers can be identified, both by in vivo events and by TF motifs; and that combinations of in vivo events and TF motifs further increase the performance of enhancer prediction.
With the advent of whole-genome and whole-exome sequencing, high-quality catalogs of recurrently mutated cancer genes are becoming available for many cancer types. Increasing access to sequencing technology, including bench-top sequencers, provide the opportunity to re-sequence a limited set of cancer genes across a patient cohort with limited processing time. Here, we re-sequenced a set of cancer genes in T-cell acute lymphoblastic leukemia (T-ALL) using Nimblegen sequence capture coupled with Roche/454 technology. First, we investigated how a maximal sensitivity and specificity of mutation detection can be achieved through a benchmark study. We tested nine combinations of different mapping and variant-calling methods, varied the variant calling parameters, and compared the predicted mutations with a large independent validation set obtained by capillary re-sequencing. We found that the combination of two mapping algorithms, namely BWA-SW and SSAHA2, coupled with the variant calling algorithm Atlas-SNP2 yields the highest sensitivity (95%) and the highest specificity (93%). Next, we applied this analysis pipeline to identify mutations in a set of 58 cancer genes, in a panel of 18 T-ALL cell lines and 15 T-ALL patient samples. We confirmed mutations in known T-ALL drivers, including PHF6, NF1, FBXW7, NOTCH1, KRAS, NRAS, PIK3CA, and PTEN. Interestingly, we also found mutations in several cancer genes that had not been linked to T-ALL before, including JAK3. Finally, we re-sequenced a small set of 39 candidate genes and identified recurrent mutations in TET1, SPRY3 and SPRY4. In conclusion, we established an optimized analysis pipeline for Roche/454 data that can be applied to accurately detect gene mutations in cancer, which led to the identification of several new candidate T-ALL driver mutations.
During early embryogenesis the zygotic genome is transcriptionally silent and all mRNAs present are of maternal origin. The maternal-zygotic transition marks the time over which embryogenesis changes its dependence from maternal RNAs to zygotically transcribed RNAs. Here we present the first systematic investigation of early zygotic genes (EZGs) in a mosquito species and focus on genes involved in the onset of transcription during 2–4 hr. We used transcriptome sequencing to identify the “pure” (without maternal expression) EZGs by analyzing transcripts from four embryonic time ranges of 0–2, 2–4, 4–8, and 8–12 hr, which includes the time of cellular blastoderm formation and up to the start of gastrulation. Blast of 16,789 annotated transcripts vs. the transcriptome reads revealed evidence for 63 (P<0.001) and 143 (P<0.05) nonmaternally derived transcripts having a significant increase in expression at 2–4 hr. One third of the 63 EZG transcripts do not have predicted introns compared to 10% of all Ae. aegypti genes. We have confirmed by RT-PCR that zygotic transcription starts as early as 2–3 hours. A degenerate motif VBRGGTA was found to be overrepresented in the upstream sequences of the identified EZGs using a motif identification software called SCOPE. We find evidence for homology between this motif and the TAGteam motif found in Drosophila that has been implicated in EZG activation. A 38 bp sequence in the proximal upstream sequence of a kinesin light chain EZG (KLC2.1) contains two copies of the mosquito motif. This sequence was shown to support EZG transcription by luciferase reporter assays performed on injected early embryos, and confers early zygotic activity to a heterologous promoter from a divergent mosquito species. The results of these studies are consistent with the model of early zygotic genome activation via transcriptional activators, similar to what has been found recently in Drosophila.
The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.
To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.
By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.
The availability of sequence specificities for a substantial fraction of yeast's transcription factors and comparative genomic algorithms for binding site prediction has made it possible to comprehensively annotate transcription factor binding sites genome-wide. Here we use such a genome-wide annotation for comprehensively studying promoter architecture in yeast, focusing on the distribution of transcription factor binding sites relative to transcription start sites, and the architecture of TATA and TATA-less promoters. For most transcription factors, binding sites are positioned further upstream and vary over a wider range in TATA promoters than in TATA-less promoters. In contrast, a group of ‘proximal promoter motifs’ (GAT1/GLN3/DAL80, FKH1/2, PBF1/2, RPN4, NDT80, and ROX1) occur preferentially in TATA-less promoters and show a strong preference for binding close to the transcription start site in these promoters. We provide evidence that suggests that pre-initiation complexes are recruited at TATA sites in TATA promoters and at the sites of the other proximal promoter motifs in TATA-less promoters. TATA-less promoters can generally be classified by the proximal promoter motif they contain, with different classes of TATA-less promoters showing different patterns of transcription factor binding site positioning and nucleosome coverage. These observations suggest that different modes of regulation of transcription initiation may be operating in the different promoter classes. In addition we show that, across all promoter classes, there is a close match between nucleosome free regions and regions of highest transcription factor binding site density. This close agreement between transcription factor binding site density and nucleosome depletion suggests a direct and general competition between transcription factors and nucleosomes for binding to promoters.
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Bronchoalveolar stem cells (BASCs) located in the bronchoalveolar duct junction are thought to regenerate both bronchiolar and alveolar epithelium during homeostatic turnover and in response to injury. The mechanisms directing self-renewal in BASCs are poorly understood.
BASCs (Sca-1+, CD34+, CD31− and, CD45−) were isolated from adult mouse lung using FACS, and their capacity for self-renewal and differentiation were demonstrated by immunostaining. A transcription factor network of 53 genes required for pluripotency in embryonic stem cells was assessed in BASCs, Kras-initiated lung tumor tissue, and lung organogenesis by real-time PCR. c-Myc was knocked down in BASCs by infection with c-Myc shRNA lentivirus. Comprehensive miRNA and mRNA profiling for BASCs was performed, and significant miRNAs and mRNAs potentially regulated by c-Myc were identified. We explored a c-Myc regulatory network in BASCs using a number of statistical and computational approaches through two different strategies; 1) c-Myc/Max binding sites within individual gene promoters, and 2) miRNA-regulated target genes.
c-Myc expression was upregulated in BASCs and downregulated over the time course of lung organogenesis in vivo. The depletion of c-Myc in BASCs resulted in decreased proliferation and cell death. Multiple mRNAs and miRNAs were dynamically regulated in c-Myc depleted BASCs. Among a total of 250 dynamically regulated genes in c-Myc depleted BASCs, 57 genes were identified as potential targets of miRNAs through miRBase and TargetScan-based computational mapping. A further 88 genes were identified as potential downstream targets through their c-Myc binding motif.
c-Myc plays a critical role in maintaining the self-renewal capacity of lung bronchoalveolar stem cells through a combination of miRNA and transcription factor regulatory networks.
Interactome networks represent sets of possible physical interactions between proteins. They lack spatio-temporal information by construction. However, the specialized functions of the differentiated cell types which are assembled into tissues or organs depend on the combinatorial arrangements of proteins and their physical interactions. Is tissue-specificity, therefore, encoded within the interactome? In order to address this question, we combined protein-protein interactions, expression data, functional annotations and interactome topology. We first identified a subnetwork formed exclusively of proteins whose interactions were observed in all tested tissues. These are mainly involved in housekeeping functions and are located at the topological center of the interactome. This ‘Largest Common Interactome Network’ represents a ‘functional interactome core’. Interestingly, two types of tissue-specific interactions are distinguished when considering function and network topology: tissue-specific interactions involved in regulatory and developmental functions are central whereas tissue-specific interactions involved in organ physiological functions are peripheral. Overall, the functional organization of the human interactome reflects several integrative levels of functions with housekeeping and regulatory tissue-specific functions at the center and physiological tissue-specific functions at the periphery. This gradient of functions recapitulates the organization of organs, from cells to organs. Given that several gradients have already been identified across interactomes, we propose that gradients may represent a general principle of protein-protein interaction network organization.
Analysis of biological processes is frequently performed with the help of phenotypic assays where data is mostly acquired in single end-point analysis. Alternative phenotypic profiling techniques are desired where time-series information is essential to the biological question, for instance to differentiate early and late regulators of cell proliferation in loss-of-function studies. So far there is no study addressing this question despite of high unmet interests, mostly due to the limitation of conventional end-point assaying technologies. We present the first human kinome screen with a real-time cell analysis system (RTCA) to capture dynamic RNAi phenotypes, employing time-resolved monitoring of cell proliferation via electrical impedance. RTCA allowed us to investigate the dynamics of phenotypes of cell proliferation instead of using conventional end-point analysis. By introducing data transformation with first-order derivative, i.e. the cell-index growth rate, we demonstrate this system suitable for high-throughput screenings (HTS). The screen validated previously identified inhibitor genes and, additionally, identified activators of cell proliferation. With the information of time kinetics available, we could establish a network of mitotic-event related genes to be among the first displaying inhibiting effects after RNAi knockdown. The time-resolved screen captured kinetics of cell proliferation caused by RNAi targeting human kinome, serving as a resource for researchers. Our work establishes RTCA technology as a novel robust tool with biological and pharmacological relevance amenable for high-throughput screening.
The identification of orthologous genes shared by multiple genomes plays an important role in evolutionary studies and gene functional analyses. Based on a recently developed accurate tool, called MSOAR 2.0, for ortholog assignment between a pair of closely related genomes based on genome rearrangement, we present a new system MultiMSOAR 2.0, to identify ortholog groups among multiple genomes in this paper. In the system, we construct gene families for all the genomes using sequence similarity search and clustering, run MSOAR 2.0 for all pairs of genomes to obtain the pairwise orthology relationship, and partition each gene family into a set of disjoint sets of orthologous genes (called super ortholog groups or SOGs) such that each SOG contains at most one gene from each genome. For each such SOG, we label the leaves of the species tree using 1 or 0 to indicate if the SOG contains a gene from the corresponding species or not. The resulting tree is called a tree of ortholog groups (or TOGs). We then label the internal nodes of each TOG based on the parsimony principle and some biological constraints. Ortholog groups are finally identified from each fully labeled TOG. In comparison with a popular tool MultiParanoid on simulated data, MultiMSOAR 2.0 shows significantly higher prediction accuracy. It also outperforms MultiParanoid, the Roundup multi-ortholog repository and the Ensembl ortholog database in real data experiments using gene symbols as a validation tool. In addition to ortholog group identification, MultiMSOAR 2.0 also provides information about gene births, duplications and losses in evolution, which may be of independent biological interest. Our experiments on simulated data demonstrate that MultiMSOAR 2.0 is able to infer these evolutionary events much more accurately than a well-known software tool Notung. The software MultiMSOAR 2.0 is available to the public for free.
Developing analytical methodologies to identify biomarkers in easily accessible body fluids is highly valuable for the early diagnosis and management of cancer patients. Peripheral whole blood is a “nucleic acid-rich” and “inflammatory cell-rich” information reservoir and represents systemic processes altered by the presence of cancer cells.
We conducted transcriptome profiling of whole blood cells from melanoma patients. To overcome challenges associated with blood-based transcriptome analysis, we used a PAXgene™ tube and NuGEN Ovation™ globin reduction system. The combined use of these systems in microarray resulted in the identification of 78 unique genes differentially expressed in the blood of melanoma patients. Of these, 68 genes were further analyzed by quantitative reverse transcriptase PCR using blood samples from 45 newly diagnosed melanoma patients (stage I to IV) and 50 healthy control individuals. Thirty-nine genes were verified to be differentially expressed in blood samples from melanoma patients. A stepwise logit analysis selected eighteen 2-gene signatures that distinguish melanoma from healthy controls. Of these, a 2-gene signature consisting of PLEK2 and C1QB led to the best result that correctly classified 93.3% melanoma patients and 90% healthy controls. Both genes were upregulated in blood samples of melanoma patients from all stages. Further analysis using blood fractionation showed that CD45− and CD45+ populations were responsible for the altered expression levels of PLEK2 and C1QB, respectively.
The current study provides the first analysis of whole blood-based transcriptome biomarkers for malignant melanoma. The expression of PLEK2, the strongest gene to classify melanoma patients, in CD45− subsets illustrates the importance of analyzing whole blood cells for biomarker studies. The study suggests that transcriptome profiling of blood cells could be used for both early detection of melanoma and monitoring of patients for residual disease.
One difficult question facing researchers is how to prioritize SNPs detected from genetic association studies for functional studies. Often a list of the top M SNPs is determined based on solely the p-value from an association analysis, where M is determined by financial/time constraints. For many studies of complex diseases, multiple analyses have been completed and integrating these multiple sets of results may be difficult. One may also wish to incorporate biological knowledge, such as whether the SNP is in the exon of a gene or a regulatory region, into the selection of markers to follow-up. In this manuscript, we propose a Bayesian latent variable model (BLVM) for incorporating “features” about a SNP to estimate a latent “quality score”, with SNPs prioritized based on the posterior probability distribution of the rankings of these quality scores. We illustrate the method using data from an ovarian cancer genome-wide association study (GWAS). In addition to the application of the BLVM to the ovarian GWAS, we applied the BLVM to simulated data which mimics the setting involving the prioritization of markers across multiple GWAS for related diseases/traits. The top ranked SNP by BLVM for the ovarian GWAS, ranked 2nd and 7th based on p-values from analyses of all invasive and invasive serous cases. The top SNP based on serous case analysis p-value (which ranked 197th for invasive case analysis), was ranked 8th based on the posterior probability of being in the top 5 markers (0.13). In summary, the application of the BLVM allows for the systematic integration of multiple SNP “features” for the prioritization of loci for fine-mapping or functional studies, taking into account the uncertainty in ranking.
Intuitive visualization of data and results is very important in genomics, especially when many conditions are to be analyzed and compared. Heat-maps have proven very useful for the representation of biological data. Here we present Gitools (http://www.gitools.org), an open-source tool to perform analyses and visualize data and results as interactive heat-maps. Gitools contains data import systems from several sources (i.e. IntOGen, Biomart, KEGG, Gene Ontology), which facilitate the integration of novel data with previous knowledge.
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing “next-generation” assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.
In the Cancer Genome Atlas (TCGA) project, gene expression of the same set of samples is measured multiple times on different microarray platforms. There are two main advantages to combining these measurements. First, we have the opportunity to obtain a more precise and accurate estimate of expression levels than using the individual platforms alone. Second, the combined measure simplifies downstream analysis by eliminating the need to work with three sets of expression measures and to consolidate results from the three platforms.
We propose to use factor analysis (FA) to obtain a unified gene expression measure (UE) from multiple platforms. The UE is a weighted average of the three platforms, and is shown to perform well in terms of accuracy and precision. In addition, the FA model produces parameter estimates that allow the assessment of the model fit.
The R code is provided in File S2. Gene-level FA measurements for the TCGA data sets are available from http://tcga-data.nci.nih.gov/docs/publications/unified_expression/.
The annotation of genomes from next-generation sequencing platforms needs to
be rapid, high-throughput, and fully integrated and automated. Although a
few Web-based annotation services have recently become available, they may
not be the best solution for researchers that need to annotate a large
number of genomes, possibly including proprietary data, and store them
locally for further analysis. To address this need, we developed a
standalone software application, the Annotation of microbial Genome
Sequences (AGeS) system, which incorporates publicly available and
in-house-developed bioinformatics tools and databases, many of which are
parallelized for high-throughput performance.
The AGeS system supports three main capabilities. The first is the storage of
input contig sequences and the resulting annotation data in a central,
customized database. The second is the annotation of microbial genomes using
an integrated software pipeline, which first analyzes contigs from
high-throughput sequencing by locating genomic regions that code for
proteins, RNA, and other genomic elements through the Do-It-Yourself
Annotation (DIYA) framework. The identified protein-coding regions are then
functionally annotated using the in-house-developed Pipeline for Protein
Annotation (PIPA). The third capability is the visualization of annotated
sequences using GBrowse. To date, we have implemented these capabilities for
bacterial genomes. AGeS was evaluated by comparing its genome annotations
with those provided by three other methods. Our results indicate that the
software tools integrated into AGeS provide annotations that are in general
agreement with those provided by the compared methods. This is demonstrated
by a >94% overlap in the number of identified genes, a significant
number of identical annotated features, and a >90% agreement in
enzyme function predictions.
Copy number alterations are important contributors to many genetic diseases, including cancer. We present the readDepth package for R, which can detect these aberrations by measuring the depth of coverage obtained by massively parallel sequencing of the genome. In addition to achieving higher accuracy than existing packages, our tool runs much faster by utilizing multi-core architectures to parallelize the processing of these large data sets. In contrast to other published methods, readDepth does not require the sequencing of a reference sample, and uses a robust statistical model that accounts for overdispersed data. It includes a method for effectively increasing the resolution obtained from low-coverage experiments by utilizing breakpoint information from paired end sequencing to do positional refinement. We also demonstrate a method for inferring copy number using reads generated by whole-genome bisulfite sequencing, thus enabling integrative study of epigenomic and copy number alterations. Finally, we apply this tool to two genomes, showing that it performs well on genomes sequenced to both low and high coverage. The readDepth package runs on Linux and MacOSX, is released under the Apache 2.0 license, and is available at http://code.google.com/p/readdepth/.
Analysis of the mechanisms underlying pluripotency and reprogramming would benefit substantially from easy access to an electronic network of genes, proteins and mechanisms. Moreover, interpreting gene expression data needs to move beyond just the identification of the up-/downregulation of key genes and of overrepresented processes and pathways, towards clarifying the essential effects of the experiment in molecular terms.
We have assembled a network of 574 molecular interactions, stimulations and inhibitions, based on a collection of research data from 177 publications until June 2010, involving 274 mouse genes/proteins, all in a standard electronic format, enabling analyses by readily available software such as Cytoscape and its plugins. The network includes the core circuit of Oct4 (Pou5f1), Sox2 and Nanog, its periphery (such as Stat3, Klf4, Esrrb, and c-Myc), connections to upstream signaling pathways (such as Activin, WNT, FGF, BMP, Insulin, Notch and LIF), and epigenetic regulators as well as some other relevant genes/proteins, such as proteins involved in nuclear import/export. We describe the general properties of the network, as well as a Gene Ontology analysis of the genes included. We use several expression data sets to condense the network to a set of network links that are affected in the course of an experiment, yielding hypotheses about the underlying mechanisms.
We have initiated an electronic data repository that will be useful to understand pluripotency and to facilitate the interpretation of high-throughput data. To keep up with the growth of knowledge on the fundamental processes of pluripotency and reprogramming, we suggest to combine Wiki and social networking software towards a community curation system that is easy to use and flexible, and tailored to provide a benefit for the scientist, and to improve communication and exchange of research results. A PluriNetWork tutorial is available at http://www.ibima.med.uni-rostock.de/IBIMA/PluriNetWork/.
Eukaryotic transcription is accompanied by combinatorial chromatin modifications that serve as functional epigenetic markers. Composition of chromatin modifications specifies histone codes that regulate the associated gene. Discovering novel chromatin regulatory relationships are of general interest.
Based on the premise that the interaction of chromatin modifications is hypothesized to influence CpG methylation, we present a closeness measure to characterize the regulatory interactions of epigenomic features. The closeness measure is applied to genome-wide CpG methylation and histone modification datasets in human CD4+T cells to select a subset of potential features. To uncover epigenomic and genomic patterns, CpG loci are clustered into nine modules associated with distinct chromatin and genomic signatures based on terms of biological function. We then performed Bayesian network inference to uncover inherent regulatory relationships from the feature selected closeness measure profile and all nine module-specific profiles respectively. The global and module-specific network exhibits topological proximity and modularity. We found that the regulatory patterns of chromatin modifications differ significantly across modules and that distinct patterns are related to specific transcriptional levels and biological function. DNA methylation and genomic features are found to have little regulatory function. The regulatory relationships were partly validated by literature reviews. We also used partial correlation analysis in other cells to verify novel regulatory relationships.
The interactions among chromatin modifications and genomic elements characterized by a closeness measure help elucidate cooperative patterns of chromatin modification in transcriptional regulation and help decipher complex histone codes.
Large efforts have been taken to search for genes responsible for type 2 diabetes (T2D), but have resulted in only about 20 in humans due to its complexity and heterogeneity. The GK rat, a spontanous T2D model, offers us a superior opportunity to search for more diabetic genes. Utilizing array comparative genome hybridization (aCGH) technology, we identifed 137 non-redundant copy number variation (CNV) regions from the GK rats when using normal Wistar rats as control. These CNV regions (CNVRs) covered approximately 36 Mb nucleotides, accounting for about 1% of the whole genome. By integrating information from gene annotations and disease knowledge, we investigated the CNVRs comprehensively for mining new T2D genes. As a result, we prioritized 16 putative protein-coding genes and two microRNA genes (rno-mir-30b and rno-mir-30d) as good candidates. The catalogue of CNVRs between GK and Wistar rats identified in this work served as a repository for mining genes that might play roles in the pathogenesis of T2D. Moreover, our efforts in utilizing bioinformatics methods to prioritize good candidate genes provided a more specific set of putative candidates. These findings would contribute to the research into the genetic basis of T2D, and thus shed light on its pathogenesis.
Transcription is affected by nucleosomal resistance against polymerase passage. In turn, nucleosomal resistance is determined by DNA sequence, histone chaperones and remodeling enzymes. The contributions of these factors are widely debated: one recent title claims “… DNA-encoded nucleosome organization…” while another title states that “histone-DNA interactions are not the major determinant of nucleosome positions.” These opposing conclusions were drawn from similar experiments analyzed by idealized methods. We attempt to resolve this controversy to reveal nucleosomal competency for transcription.
To this end, we analyzed 26 in vivo, nonlinked, and in vitro genome-wide nucleosome maps/replicates by new, rigorous methods. Individual H2A nucleosomes are reconstituted inaccurately by transcription, chaperones and remodeling enzymes. At gene centers, weakly positioned nucleosome arrays facilitate rapid histone eviction and remodeling, easing polymerase passage. Fuzzy positioning is not due to artefacts. At the regional level, transcriptional competency is strongly influenced by intrinsic histone-DNA affinities. This is confirmed by reproducing the high in vivo occupancy of translated regions and the low occupancy of intergenic regions in reconstitutions from purified DNA and histones. Regional level occupancy patterns are protected from invading histones by nucleosome excluding sequences and barrier nucleosomes at gene boundaries and within genes.
Dense arrays of weakly positioned nucleosomes appear to be necessary for transcription. Weak positioning at exons facilitates temporary remodeling, polymerase passage and hence the competency for transcription. At regional levels, the DNA sequence plays a major role in determining these features but positions of individual nucleosomes are typically modified by transcription, chaperones and enzymes. This competency is reduced at intergenic regions by sequence features, barrier nucleosomes, and proteins, preventing accessibility regulation of untargeted genes. This combination of DNA- and protein-influenced positioning regulates DNA accessibility and competence for regulatory protein binding and transcription. Interactive nucleosome displays are offered at http://chromatin.unl.edu/cgi-bin/skyline.cgi.