PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-23 (23)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
Document Types
1.  LOcating Non-Unique matched Tags (LONUT) to Improve the Detection of the Enriched Regions for ChIP-seq Data 
PLoS ONE  2013;8(6):e67788.
One big limitation of computational tools for analyzing ChIP-seq data is that most of them ignore non-unique tags (NUTs) that match the human genome even though NUTs comprise up to 60% of all raw tags in ChIP-seq data. Effectively utilizing these NUTs would increase the sequencing depth and allow a more accurate detection of enriched binding sites, which in turn could lead to more precise and significant biological interpretations. In this study, we have developed a computational tool, LOcating Non-Unique matched Tags (LONUT), to improve the detection of enriched regions from ChIP-seq data. Our LONUT algorithm applies a linear and polynomial regression model to establish an empirical score (ES) formula by considering two influential factors, the distance of NUTs to peaks identified using uniquely matched tags (UMTs) and the enrichment score for those peaks resulting in each NUT being assigned to a unique location on the reference genome. The newly located tags from the set of NUTs are combined with the original UMTs to produce a final set of combined matched tags (CMTs). LONUT was tested on many different datasets representing three different characteristics of biological data types. The detected sites were validated using de novo motif discovery and ChIP-PCR. We demonstrate the specificity and accuracy of LONUT and show that our program not only improves the detection of binding sites for ChIP-seq, but also identifies additional binding sites.
doi:10.1371/journal.pone.0067788
PMCID: PMC3692479  PMID: 23825685
2.  The Majority of Primate-Specific Regulatory Sequences Are Derived from Transposable Elements 
PLoS Genetics  2013;9(5):e1003504.
Although emerging evidence suggests that transposable elements (TEs) have contributed novel regulatory elements to the human genome, their global impact on transcriptional networks remains largely uncharacterized. Here we show that TEs have contributed to the human genome nearly half of its active elements. Using DNase I hypersensitivity data sets from ENCODE in normal, embryonic, and cancer cells, we found that 44% of open chromatin regions were in TEs and that this proportion reached 63% for primate-specific regions. We also showed that distinct subfamilies of endogenous retroviruses (ERVs) contributed significantly more accessible regions than expected by chance, with up to 80% of their instances in open chromatin. Based on these results, we further characterized 2,150 TE subfamily–transcription factor pairs that were bound in vivo or enriched for specific binding motifs, and observed that TEs contributing to open chromatin had higher levels of sequence conservation. We also showed that thousands of ERV–derived sequences were activated in a cell type–specific manner, especially in embryonic and cancer cells, and we demonstrated that this activity was associated with cell type–specific expression of neighboring genes. Taken together, these results demonstrate that TEs, and in particular ERVs, have contributed hundreds of thousands of novel regulatory elements to the primate lineage and reshaped the human transcriptional landscape.
Author Summary
Nearly half of the human genome is composed of repetitive sequences, most of which were derived from transposable elements that have replicated in the genome during the evolution of our species. There is growing evidence showing that some of these transposon-derived sequences have been a source of new binding sites for various mammalian transcription factors. Considering that previous studies were targeting only few transcription factors and cell types, a key question that remains is to what extent the transposable elements have contributed to human transcriptional networks. To systematically survey this contribution, we used datasets generated by the international Encyclopedia of DNA Elements (ENCODE) consortium, identifying the location of active regulatory elements in more than 40 distinct human cell types. Using this resource we measured the contribution of all classes of repetitive sequences and systematically characterized the impact that transposable elements have had on the human chromatin landscape. Our results demonstrate that transposon-derived sequences have contributed hundreds of thousands of novel regulatory elements to the primate lineage and reshaped the human transcriptional landscape.
doi:10.1371/journal.pgen.1003504
PMCID: PMC3649963  PMID: 23675311
3.  Transposable Elements Are Major Contributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs 
PLoS Genetics  2013;9(4):e1003470.
Advances in vertebrate genomics have uncovered thousands of loci encoding long noncoding RNAs (lncRNAs). While progress has been made in elucidating the regulatory functions of lncRNAs, little is known about their origins and evolution. Here we explore the contribution of transposable elements (TEs) to the makeup and regulation of lncRNAs in human, mouse, and zebrafish. Surprisingly, TEs occur in more than two thirds of mature lncRNA transcripts and account for a substantial portion of total lncRNA sequence (∼30% in human), whereas they seldom occur in protein-coding transcripts. While TEs contribute less to lncRNA exons than expected, several TE families are strongly enriched in lncRNAs. There is also substantial interspecific variation in the coverage and types of TEs embedded in lncRNAs, partially reflecting differences in the TE landscapes of the genomes surveyed. In human, TE sequences in lncRNAs evolve under greater evolutionary constraint than their non–TE sequences, than their intronic TEs, or than random DNA. Consistent with functional constraint, we found that TEs contribute signals essential for the biogenesis of many lncRNAs, including ∼30,000 unique sites for transcription initiation, splicing, or polyadenylation in human. In addition, we identified ∼35,000 TEs marked as open chromatin located within 10 kb upstream of lncRNA genes. The density of these marks in one cell type correlate with elevated expression of the downstream lncRNA in the same cell type, suggesting that these TEs contribute to cis-regulation. These global trends are recapitulated in several lncRNAs with established functions. Finally a subset of TEs embedded in lncRNAs are subject to RNA editing and predicted to form secondary structures likely important for function. In conclusion, TEs are nearly ubiquitous in lncRNAs and have played an important role in the lineage-specific diversification of vertebrate lncRNA repertoires.
Author Summary
An unexpected layer of complexity in the genomes of humans and other vertebrates lies in the abundance of genes that do not appear to encode proteins but produce a variety of non-coding RNAs. In particular, the human genome is currently predicted to contain 5,000–10,000 independent gene units generating long (>200 nucleotides) noncoding RNAs (lncRNAs). While there is growing evidence that a large fraction of these lncRNAs have cellular functions, notably to regulate protein-coding gene expression, almost nothing is known on the processes underlying the evolutionary origins and diversification of lncRNA genes. Here we show that transposable elements, through their capacity to move and spread in genomes in a lineage-specific fashion, as well as their ability to introduce regulatory sequences upon chromosomal insertion, represent a major force shaping the lncRNA repertoire of humans, mice, and zebrafish. Not only do TEs make up a substantial fraction of mature lncRNA transcripts, they are also enriched in the vicinity of lncRNA genes, where they frequently contribute to their transcriptional regulation. Through specific examples we provide evidence that some TE sequences embedded in lncRNAs are critical for the biogenesis of lncRNAs and likely important for their function.
doi:10.1371/journal.pgen.1003470
PMCID: PMC3636048  PMID: 23637635
4.  Whole-genome reconstruction and mutational signatures in gastric cancer 
Genome Biology  2012;13(12):R115.
Background
Gastric cancer is the second highest cause of global cancer mortality. To explore the complete repertoire of somatic alterations in gastric cancer, we combined massively parallel short read and DNA paired-end tag sequencing to present the first whole-genome analysis of two gastric adenocarcinomas, one with chromosomal instability and the other with microsatellite instability.
Results
Integrative analysis and de novo assemblies revealed the architecture of a wild-type KRAS amplification, a common driver event in gastric cancer. We discovered three distinct mutational signatures in gastric cancer - against a genome-wide backdrop of oxidative and microsatellite instability-related mutational signatures, we identified the first exome-specific mutational signature. Further characterization of the impact of these signatures by combining sequencing data from 40 complete gastric cancer exomes and targeted screening of an additional 94 independent gastric tumors uncovered ACVR2A, RPL22 and LMAN1 as recurrently mutated genes in microsatellite instability-positive gastric cancer and PAPPA as a recurrently mutated gene in TP53 wild-type gastric cancer.
Conclusions
These results highlight how whole-genome cancer sequencing can uncover information relevant to tissue-specific carcinogenesis that would otherwise be missed from exome-sequencing data.
doi:10.1186/gb-2012-13-12-r115
PMCID: PMC4056366  PMID: 23237666
5.  PPARG Binding Landscapes in Macrophages Suggest a Genome-Wide Contribution of PU.1 to Divergent PPARG Binding in Human and Mouse 
PLoS ONE  2012;7(10):e48102.
Background
Genome-wide comparisons of transcription factor binding sites in different species can be used to evaluate evolutionary constraints that shape gene regulatory circuits and to understand how the interaction between transcription factors shapes their binding landscapes over evolution.
Results
We have compared the PPARG binding landscapes in macrophages to investigate the evolutionary impact on PPARG binding diversity in mouse and humans for this important nuclear receptor. Of note, only 5% of the PPARG binding sites were shared between the two species. In contrast, at the gene level, PPARG target genes conserved between both species constitute more than 30% of the target genes regulated by PPARG ligand in human macrophages. Moreover, the majority of all PPARG binding sites (55–60%) in macrophages show co-occupancy of the lineage-specification factor PU.1 in both species. Exploring the evolutionary dynamics of PPARG binding sites, we observed that PU.1 co-binding to PPARG sites appears to be important for possible PPARG ancestral functions such as lipid metabolism. Thus we speculate that PU.1 may have guided utilization of these species-specific PPARG conserved binding sites in macrophages during evolution.
Conclusions
We propose a model in which PU.1 sites may have served as “anchor” loci for the formation of new and functionally relevant PPARG binding sites throughout evolution. As PU.1 is an essential factor in macrophage biology, such an evolutionary mechanism would allow for the establishment of relevant PPARG regulatory modules in a PU.1-dependent manner and yet permit for nuanced regulatory changes in individual species.
doi:10.1371/journal.pone.0048102
PMCID: PMC3485280  PMID: 23118933
6.  Long Span DNA Paired-End-Tag (DNA-PET) Sequencing Strategy for the Interrogation of Genomic Structural Mutations and Fusion-Point-Guided Reconstruction of Amplicons 
PLoS ONE  2012;7(9):e46152.
Structural variations (SVs) contribute significantly to the variability of the human genome and extensive genomic rearrangements are a hallmark of cancer. While genomic DNA paired-end-tag (DNA-PET) sequencing is an attractive approach to identify genomic SVs, the current application of PET sequencing with short insert size DNA can be insufficient for the comprehensive mapping of SVs in low complexity and repeat-rich genomic regions. We employed a recently developed procedure to generate PET sequencing data using large DNA inserts of 10–20 kb and compared their characteristics with short insert (1 kb) libraries for their ability to identify SVs. Our results suggest that although short insert libraries bear an advantage in identifying small deletions, they do not provide significantly better breakpoint resolution. In contrast, large inserts are superior to short inserts in providing higher physical genome coverage for the same sequencing cost and achieve greater sensitivity, in practice, for the identification of several classes of SVs, such as copy number neutral and complex events. Furthermore, our results confirm that large insert libraries allow for the identification of SVs within repetitive sequences, which cannot be spanned by short inserts. This provides a key advantage in studying rearrangements in cancer, and we show how it can be used in a fusion-point-guided-concatenation algorithm to study focally amplified regions in cancer.
doi:10.1371/journal.pone.0046152
PMCID: PMC3461012  PMID: 23029419
7.  CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells 
Nature genetics  2011;43(7):630-638.
Mammalian genomes are viewed as functional organizations that orchestrate spatial and temporal gene regulation. CTCF, the most characterized insulator-binding protein, has been implicated as a key genome organizer. Yet, little is known about CTCF-associated higher order chromatin structures at a global scale. Here, we applied Chromatin Interaction Analysis by Paired-End-Tag sequencing to elucidate the CTCF-chromatin interactome in pluripotent cells. From this analysis, 1,480 cis and 336 trans interacting loci were identified with high reproducibility and precision. Associating these chromatin interaction loci with their underlying epigenetic states, promoter activities, enhancer binding and nuclear lamina occupancy, we uncovered five distinct chromatin domains that suggest potential new models of CTCF function in chromatin organization and transcriptional control. Specifically, CTCF interactions demarcate chromatin-nuclear membrane attachments and influence proper gene expression through extensive crosstalk between promoters and regulatory elements. This highly complex nuclear organization offers insights towards the unifying principles governing genome plasticity and function.
doi:10.1038/ng.857
PMCID: PMC3436933  PMID: 21685913
insulator; enhancer; chromatin organization; epigenetic regulation; nuclear lamina
8.  K27M mutation in histone H3.3 defines clinically and biologically distinct subgroups of pediatric diffuse intrinsic pontine gliomas 
Acta Neuropathologica  2012;124(3):439-447.
Pediatric glioblastomas (GBM) including diffuse intrinsic pontine gliomas (DIPG) are devastating brain tumors with no effective therapy. Here, we investigated clinical and biological impacts of histone H3.3 mutations. Forty-two DIPGs were tested for H3.3 mutations. Wild-type versus mutated (K27M-H3.3) subgroups were compared for HIST1H3B, IDH, ATRX and TP53 mutations, copy number alterations and clinical outcome. K27M-H3.3 occurred in 71 %, TP53 mutations in 77 % and ATRX mutations in 9 % of DIPGs. ATRX mutations were more frequent in older children (p < 0.0001). No G34V/R-H3.3, IDH1/2 or H3.1 mutations were identified. K27M-H3.3 DIPGs showed specific copy number changes, including all gains/amplifications of PDGFRA and MYC/PVT1 loci. Notably, all long-term survivors were H3.3 wild type and this group of patients had better overall survival. K27M-H3.3 mutation defines clinically and biologically distinct subgroups and is prevalent in DIPG, which will impact future therapeutic trial design. K27M- and G34V-H3.3 have location-based incidence (brainstem/cortex) and potentially play distinct roles in pediatric GBM pathogenesis. K27M-H3.3 is universally associated with short survival in DIPG, while patients wild-type for H3.3 show improved survival. Based on prognostic and therapeutic implications, our findings argue for H3.3-mutation testing at diagnosis, which should be rapidly integrated into the clinical decision-making algorithm, particularly in atypical DIPG.
Electronic supplementary material
The online version of this article (doi:10.1007/s00401-012-0998-0) contains supplementary material, which is available to authorized users.
doi:10.1007/s00401-012-0998-0
PMCID: PMC3422615  PMID: 22661320
DIPG; H3.3; ATRX; TP53; Survival; Targeted therapy
9.  Conserved and non-conserved enhancers direct tissue specific transcription in ancient germ layer specific developmental control genes 
Background
Identifying DNA sequences (enhancers) that direct the precise spatial and temporal expression of developmental control genes remains a significant challenge in the annotation of vertebrate genomes. Locating these sequences, which in many cases lie at a great distance from the transcription start site, has been a major obstacle in deciphering gene regulation. Coupling of comparative genomics with functional validation to locate such regulatory elements has been a successful method in locating many such regulatory elements. But most of these studies looked either at a single gene only or the whole genome without focusing on any particular process. The pressing need is to integrate the tools of comparative genomics with knowledge of developmental biology to validate enhancers for developmental transcription factors in greater detail
Results
Our results show that near four different genes (nkx3.2, pax9, otx1b and foxa2) in zebrafish, only 20-30% of highly conserved DNA sequences can act as developmental enhancers irrespective of the tissue the gene expresses in. We find that some genes also have multiple conserved enhancers expressing in the same tissue at the same or different time points in development. We also located non-conserved enhancers for two of the genes (pax9 and otx1b). Our modified Bacterial artificial chromosome (BACs) studies for these 4 genes revealed that many of these enhancers work in a synergistic fashion, which cannot be captured by individual DNA constructs and are not conserved at the sequence level. Our detailed biochemical and transgenic analysis revealed Foxa1 binds to the otx1b non-conserved enhancer to direct its activity in forebrain and otic vesicle of zebrafish at 24 hpf.
Conclusion
Our results clearly indicate that high level of functional conservation of genes is not necessarily associated with sequence conservation of its regulatory elements. Moreover certain non conserved DNA elements might have role in gene regulation. The need is to bring together multiple approaches to bear upon individual genes to decipher all its regulatory elements.
doi:10.1186/1471-213X-11-63
PMCID: PMC3210094  PMID: 22011226
10.  CpG Deamination Creates Transcription Factor–Binding Sites with High Efficiency 
Genome Biology and Evolution  2011;3:1304-1311.
The formation of new transcription factor–binding sites (TFBSs) has a major impact on the evolution of gene regulatory networks. Clearly, single nucleotide mutations arising within genomic DNA can lead to the creation of TFBSs. Are molecular processes inducing single nucleotide mutations contributing equally to the creation of TFBSs? In the human genome, a spontaneous deamination of methylated cytosine in the context of CpG dinucleotides results in the creation of thymine (C → T), and this mutation has the highest rate among all base substitutions. CpG deamination has been ascribed a role in silencing of transposons and induction of variation in regional methylation. We have previously shown that CpG deamination created thousands of p53-binding sites within genomic sequences of Alu transposons. Interestingly, we have defined a ∼30 bp region in Alu sequence, which, depending on a pattern of CpG deamination, can be converted to functional p53-, PAX-6-, and Myc-binding sites. Here, we have studied single nucleotide mutational events leading to creation of TFBSs in promoters of human genes and in genomic regions bound by such key transcription factors as Oct4, NANOG, and c-Myc. We document that CpG deamination events can create TFBSs with much higher efficiency than other types of mutational events. Our findings add a new role to CpG methylation: We propose that deamination of methylated CpGs constitutes one of the evolutionary forces acting on mutational trajectories of TFBSs formation contributing to variability in gene regulation.
doi:10.1093/gbe/evr107
PMCID: PMC3228489  PMID: 22016335
CpG methylation; CpG deamination; evolution of transcription factor–binding sites; evolution of gene regulatory elements; Alu transposon
11.  Molecular conservation of estrogen-response associated with cell cycle regulation, hormonal carcinogenesis and cancer in zebrafish and human cancer cell lines 
BMC Medical Genomics  2011;4:41.
Background
The zebrafish is recognized as a versatile cancer and drug screening model. However, it is not known whether the estrogen-responsive genes and signaling pathways that are involved in estrogen-dependent carcinogenesis and human cancer are operating in zebrafish. In order to determine the potential of zebrafish model for estrogen-related cancer research, we investigated the molecular conservation of estrogen responses operating in both zebrafish and human cancer cell lines.
Methods
Microarray experiment was performed on zebrafish exposed to estrogen (17β-estradiol; a classified carcinogen) and an anti-estrogen (ICI 182,780). Zebrafish estrogen-responsive genes sensitive to both estrogen and anti-estrogen were identified and validated using real-time PCR. Human homolog mapping and knowledge-based data mining were performed on zebrafish estrogen responsive genes followed by estrogen receptor binding site analysis and comparative transcriptome analysis with estrogen-responsive human cancer cell lines (MCF7, T47D and Ishikawa).
Results
Our transcriptome analysis captured multiple estrogen-responsive genes and signaling pathways that increased cell proliferation, promoted DNA damage and genome instability, and decreased tumor suppressing effects, suggesting a common mechanism for estrogen-induced carcinogenesis. Comparative analysis revealed a core set of conserved estrogen-responsive genes that demonstrate enrichment of estrogen receptor binding sites and cell cycle signaling pathways. Knowledge-based and network analysis led us to propose that the mechanism involving estrogen-activated estrogen receptor mediated down-regulation of human homolog HES1 followed by up-regulation cell cycle-related genes (human homologs E2F4, CDK2, CCNA, CCNB, CCNE), is highly conserved, and this mechanism may involve novel crosstalk with basal AHR. We also identified mitotic roles of polo-like kinase as a conserved signaling pathway with multiple entry points for estrogen regulation.
Conclusion
The findings demonstrate the use of zebrafish for characterizing estrogen-like environmental carcinogens and anti-estrogen drug screening. From an evolutionary perspective, our findings suggest that estrogen regulation of cell cycle is perhaps one of the earliest forms of steroidal-receptor controlled cellular processes. Our study provides first evidence of molecular conservation of estrogen-responsiveness between zebrafish and human cancer cell lines, hence demonstrating the potential of zebrafish for estrogen-related cancer research.
doi:10.1186/1755-8794-4-41
PMCID: PMC3114699  PMID: 21575170
zebrafish; microarray; estrogen; anti-estrogen ICI 182,780; estrogen-responsive genes; signaling pathways; carcinogenesis; human cancer cell lines; molecular conservation; model organism
12.  In silico tandem affinity purification refines an Oct4 interaction list 
Introduction
Octamer-binding transcription factor 4 (Oct4) is a master regulator of early mammalian development. Its expression begins from the oocyte stage, becomes restricted to the inner cell mass of the blastocyst and eventually remains only in primordial germ cells. Unearthing the interactions of Oct4 would provide insight into how this transcription factor is central to cell fate and stem cell pluripotency.
Methods
In the present study, affinity-tagged endogenous Oct4 cell lines were established via homologous recombination gene targeting in embryonic stem (ES) cells to express tagged Oct4. This allows tagged Oct4 to be expressed without altering the total Oct4 levels from their physiological levels.
Results
Modified ES cells remained pluripotent. However, when modified ES cells were tested for their functionality, cells with a large tag failed to produce viable homozygous mice. Use of a smaller tag resulted in mice with normal development, viability and fertility. This indicated that the choice of tags can affect the performance of Oct4. Also, different tags produce a different repertoire of Oct4 interactors.
Conclusions
Using a total of four different tags, we found 33 potential Oct4 interactors, of which 30 are novel. In addition to transcriptional regulation, the molecular function associated with these Oct4-associated proteins includes various other catalytic activities, suggesting that, aside from chromosome remodeling and transcriptional regulation, Oct4 function extends more widely to other essential cellular mechanisms. Our findings show that multiple purification approaches are needed to uncover a comprehensive Oct4 protein interaction network.
doi:10.1186/scrt67
PMCID: PMC3218817  PMID: 21569470
13.  MER41 Repeat Sequences Contain Inducible STAT1 Binding Sites 
PLoS ONE  2010;5(7):e11425.
Chromatin immunoprecipitation combined with massively parallel sequencing methods (ChIP-seq) is becoming the standard approach to study interactions of transcription factors (TF) with genomic sequences. At the example of public STAT1 ChIP-seq data sets, we present novel approaches for the interpretation of ChIP-seq data.
We compare recently developed approaches to determine STAT1 binding sites from ChIP-seq data. Assessing the content of the established consensus sequence for STAT1 binding sites, we find that the usage of “negative control” ChIP-seq data fails to provide substantial advantages. We derive a single refined probabilistic model of STAT1 binding sequences from these ChIP-seq data. Contrary to previous claims, we find no evidence that STAT1 binds to multiple distinct motifs upon interferon-gamma stimulation in vivo. While a large majority of genomic sites with high ChIP-seq signal is associated with a nucleotide sequence ressembling a STAT1 binding site, only a very small subset of the over 5 million potential STAT1 binding sites in the human genome is covered by ChIP-seq data. Furthermore a surprisingly large fraction of the ChIP-seq signal (5%) is absorbed by a small family of repetitive sequences (MER41).
The observation of the binding of activated STAT1 protein to a specific repetitive element bolsters similar reports concerning p53 and other TFs, and strengthens the notion of an involvement of repeats in gene regulation. Incidentally MER41 are specific to primates, consequently, regulatory mechanisms in the IFN-STAT pathway might fundamentally differ between primates and rodents.
On a methodological aspect, the presence of large numbers of nearly identical binding sites in repetitive sequences may lead to wrong conclusions about intrinsic binding preferences of TF as illustrated by the spacing analysis STAT1 tandem motifs. Therefore, ChIP-seq data should be analyzed independently within repetitive and non-repetitive sequences.
doi:10.1371/journal.pone.0011425
PMCID: PMC2897888  PMID: 20625510
14.  An Oestrogen Receptor α-bound Human Chromatin Interactome 
Nature  2009;462(7269):58-64.
Genomes are organized into high-level 3-dimensional structures, and DNA elements separated by long genomic distances could functionally interact. Many transcription factors bind to regulatory DNA elements distant from gene promoters. While distal binding sites have been shown to regulate transcription by long-range chromatin interactions at a few loci, chromatin interactions and their impact on transcription regulation have not been investigated in a genome-wide manner. Therefore, we developed Chromatin Interaction Analysis by Paired-End Tag sequencing (ChIA-PET) for de novo detection of global chromatin interactions, and comprehensively mapped the chromatin interaction network bound by oestrogen receptor α (ERα) in the human genome. We found that most high-confidence remote ERα binding sites are anchored at gene promoters through long-range chromatin interactions, suggesting that ERα functions by extensive chromatin looping to bring genes together for coordinated transcriptional regulation. We propose that chromatin interactions constitute a primary mechanism for regulating transcription in mammalian genomes.
doi:10.1038/nature08497
PMCID: PMC2774924  PMID: 19890323
15.  Success in the DREAM3 Signaling Response Challenge Using Simple Weighted-Average Imputation: Lessons for Community-Wide Experiments in Systems Biology 
PLoS ONE  2010;5(1):e8417.
Our group produced the best predictions overall in the DREAM3 signaling response challenge, being tops by a substantial margin in the cytokine sub-challenge and nearly tied for best in the phosphoprotein sub-challenge. We achieved this success using a simple interpolation strategy. For each combination of a stimulus and inhibitor for which predictions were required, we had noted there were six other datasets using the same stimulus (but different inhibitor treatments) and six other datasets using the same inhibitor (but different stimuli). Therefore, for each treatment combination for which values were to be predicted, we calculated rank correlations for the data that were in common between the treatment combination and each of the 12 related combinations. The data from the 12 related combinations were then used to calculate missing values, weighting the contributions from each experiment based on the rank correlation coefficients. The success of this simple method suggests that the missing data were largely over-determined by similarities in the treatments. We offer some thoughts on the current state and future development of DREAM that are based on our success in this challenge, our success in the earlier DREAM2 transcription factor target challenge, and our experience as the data provider for the gene expression challenge in DREAM3.
doi:10.1371/journal.pone.0008417
PMCID: PMC2811179  PMID: 20126276
16.  Inferring Condition-Specific Modulation of Transcription Factor Activity in Yeast through Regulon-Based Analysis of Genomewide Expression 
PLoS ONE  2008;3(9):e3112.
Background
A key goal of systems biology is to understand how genomewide mRNA expression levels are controlled by transcription factors (TFs) in a condition-specific fashion. TF activity is frequently modulated at the post-translational level through ligand binding, covalent modification, or changes in sub-cellular localization. In this paper, we demonstrate how prior information about regulatory network connectivity can be exploited to infer condition-specific TF activity as a hidden variable from the genomewide mRNA expression pattern in the yeast Saccharomyces cerevisiae.
Methodology/Principal Findings
We first validate experimentally that by scoring differential expression at the level of gene sets or “regulons” comprised of the putative targets of a TF, we can accurately predict modulation of TF activity at the post-translational level. Next, we create an interactive database of inferred activities for a large number of TFs across a large number of experimental conditions in S. cerevisiae. This allows us to perform TF-centric analysis of the yeast regulatory network.
Conclusions/Significance
We analyze the degree to which the mRNA expression level of each TF is predictive of its regulatory activity. We also organize TFs into “co-modulation networks” based on their inferred activity profile across conditions, and find that this reveals functional and mechanistic relationships. Finally, we present evidence that the PAC and rRPE motifs antagonize TBP-dependent regulation, and function as core promoter elements governed by the transcription regulator NC2. Regulon-based monitoring of TF activity modulation is a powerful tool for analyzing regulatory network function that should be applicable in other organisms. Tools and results are available online at http://bussemakerlab.org/RegulonProfiler/.
doi:10.1371/journal.pone.0003112
PMCID: PMC2518834  PMID: 18769540
17.  iTools: A Framework for Classification, Categorization and Integration of Computational Biology Resources 
PLoS ONE  2008;3(5):e2265.
The advancement of the computational biology field hinges on progress in three fundamental directions – the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources–data, software tools and web-services. The iTools design, implementation and resource meta - data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page http://iTools.ccb.ucla.edu.
doi:10.1371/journal.pone.0002265
PMCID: PMC2386255  PMID: 18509477
18.  Fine-Tuning Enhancer Models to Predict Transcriptional Targets across Multiple Genomes 
PLoS ONE  2007;2(11):e1115.
Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35th TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks.
doi:10.1371/journal.pone.0001115
PMCID: PMC2047340  PMID: 17973026
19.  A Structural Split in the Human Genome 
PLoS ONE  2007;2(7):e603.
Background
Promoter-associated CpG islands (PCIs) mediate methylation-dependent gene silencing, yet tend to co-locate to transcriptionally active genes. To address this paradox, we used data mining to assess the behavior of PCI-positive (PCI+) genes in the human genome.
Results
PCI+ genes exhibit a bimodal distribution: (1) a ‘housekeeping-like’ subset characterized by higher GC content and lower intron length/number, and (2) a ‘pseudogene paralog’ subset characterized by lower GC content and higher intron length/number (p<0.001). These subsets are functionally distinguishable, with the former gene group characterized by higher expression levels and lower evolutionary rate (p<0.001). PCI-negative (PCI-) genes exhibit higher evolutionary rate and narrower expression breadth than PCI+ genes (p<0.001), consistent with more frequent tissue-specific inactivation.
Conclusions
Adaptive evolution of the human genome appears driven in part by declining transcription of a subset of PCI+ genes, predisposing to both CpG→TpA mutation and intron insertion. We propose a model of evolving biological complexity in which environmentally-selected gains or losses of PCI methylation respectively favor positive or negative selection, thus polarizing PCI+ gene structures around a genomic core of ancestral PCI- genes.
doi:10.1371/journal.pone.0000603
PMCID: PMC1904255  PMID: 17622348
20.  Whole-Genome Cartography of Estrogen Receptor α Binding Sites 
PLoS Genetics  2007;3(6):e87.
Using a chromatin immunoprecipitation-paired end diTag cloning and sequencing strategy, we mapped estrogen receptor α (ERα) binding sites in MCF-7 breast cancer cells. We identified 1,234 high confidence binding clusters of which 94% are projected to be bona fide ERα binding regions. Only 5% of the mapped estrogen receptor binding sites are located within 5 kb upstream of the transcriptional start sites of adjacent genes, regions containing the proximal promoters, whereas vast majority of the sites are mapped to intronic or distal locations (>5 kb from 5′ and 3′ ends of adjacent transcript), suggesting transcriptional regulatory mechanisms over significant physical distances. Of all the identified sites, 71% harbored putative full estrogen response elements (EREs), 25% bore ERE half sites, and only 4% had no recognizable ERE sequences. Genes in the vicinity of ERα binding sites were enriched for regulation by estradiol in MCF-7 cells, and their expression profiles in patient samples segregate ERα-positive from ERα-negative breast tumors. The expression dynamics of the genes adjacent to ERα binding sites suggest a direct induction of gene expression through binding to ERE-like sequences, whereas transcriptional repression by ERα appears to be through indirect mechanisms. Our analysis also indicates a number of candidate transcription factor binding sites adjacent to occupied EREs at frequencies much greater than by chance, including the previously reported FOXA1 sites, and demonstrate the potential involvement of one such putative adjacent factor, Sp1, in the global regulation of ERα target genes. Unexpectedly, we found that only 22%–24% of the bona fide human ERα binding sites were overlapping conserved regions in whole genome vertebrate alignments, which suggest limited conservation of functional binding sites. Taken together, this genome-scale analysis suggests complex but definable rules governing ERα binding and gene regulation.
Author Summary
Estrogen receptors (ERs) play key roles in facilitating the transcriptional effects of hormone functions in target tissues. To obtain a genome-wide view of ERα binding sites, we applied chromatin immunoprecipitation coupled with a cloning and sequencing strategy using chromatin immunoprecipitation pair end-tagging technology to map ERα binding sites in MCF-7 human breast cancer cells. We identified 1,234 high quality ERα binding sites in the human genome and demonstrated that the binding sites are frequently adjacent to genes significantly associated with breast cancer disease status and outcome. The mapping results also revealed that ERα can influence gene expression across distances of up to 100 kilobases or more, that genes that are induced or repressed utilize sites in different regions relative to the transcript (suggesting different mechanisms of action), and that ERα binding sites are only modestly conserved in evolution. Using computational approaches, we identified potential interactions with other transcription factor binding sites adjacent to the ERα binding elements. Taken together, these findings suggest complex but definable rules governing ERα binding and gene regulation and provide a valuable dataset for mapping the precise control nodes for one of the most important nuclear hormone receptors in breast cancer biology.
doi:10.1371/journal.pgen.0030087
PMCID: PMC1885282  PMID: 17542648
21.  Whole-Genome Cartography of Estrogen Receptor α Binding Sites 
PLoS Genetics  2007;3(6):e87.
Using a chromatin immunoprecipitation-paired end diTag cloning and sequencing strategy, we mapped estrogen receptor α (ERα) binding sites in MCF-7 breast cancer cells. We identified 1,234 high confidence binding clusters of which 94% are projected to be bona fide ERα binding regions. Only 5% of the mapped estrogen receptor binding sites are located within 5 kb upstream of the transcriptional start sites of adjacent genes, regions containing the proximal promoters, whereas vast majority of the sites are mapped to intronic or distal locations (>5 kb from 5′ and 3′ ends of adjacent transcript), suggesting transcriptional regulatory mechanisms over significant physical distances. Of all the identified sites, 71% harbored putative full estrogen response elements (EREs), 25% bore ERE half sites, and only 4% had no recognizable ERE sequences. Genes in the vicinity of ERα binding sites were enriched for regulation by estradiol in MCF-7 cells, and their expression profiles in patient samples segregate ERα-positive from ERα-negative breast tumors. The expression dynamics of the genes adjacent to ERα binding sites suggest a direct induction of gene expression through binding to ERE-like sequences, whereas transcriptional repression by ERα appears to be through indirect mechanisms. Our analysis also indicates a number of candidate transcription factor binding sites adjacent to occupied EREs at frequencies much greater than by chance, including the previously reported FOXA1 sites, and demonstrate the potential involvement of one such putative adjacent factor, Sp1, in the global regulation of ERα target genes. Unexpectedly, we found that only 22%–24% of the bona fide human ERα binding sites were overlapping conserved regions in whole genome vertebrate alignments, which suggest limited conservation of functional binding sites. Taken together, this genome-scale analysis suggests complex but definable rules governing ERα binding and gene regulation.
Author Summary
Estrogen receptors (ERs) play key roles in facilitating the transcriptional effects of hormone functions in target tissues. To obtain a genome-wide view of ERα binding sites, we applied chromatin immunoprecipitation coupled with a cloning and sequencing strategy using chromatin immunoprecipitation pair end-tagging technology to map ERα binding sites in MCF-7 human breast cancer cells. We identified 1,234 high quality ERα binding sites in the human genome and demonstrated that the binding sites are frequently adjacent to genes significantly associated with breast cancer disease status and outcome. The mapping results also revealed that ERα can influence gene expression across distances of up to 100 kilobases or more, that genes that are induced or repressed utilize sites in different regions relative to the transcript (suggesting different mechanisms of action), and that ERα binding sites are only modestly conserved in evolution. Using computational approaches, we identified potential interactions with other transcription factor binding sites adjacent to the ERα binding elements. Taken together, these findings suggest complex but definable rules governing ERα binding and gene regulation and provide a valuable dataset for mapping the precise control nodes for one of the most important nuclear hormone receptors in breast cancer biology.
doi:10.1371/journal.pgen.0030087
PMCID: PMC1885282  PMID: 17542648
22.  Multiplatform genome-wide identification and modeling of functional human estrogen receptor binding sites 
Genome Biology  2006;7(9):R82.
Refinement of the functional human estrogen receptor binding site model using a multi-platform genome-wide approach reveals extended binding specificity signal.
Background
Transcription factor binding sites (TFBS) impart specificity to cellular transcriptional responses and have largely been defined by consensus motifs derived from a handful of validated sites. The low specificity of the computational predictions of TFBSs has been attributed to ubiquity of the motifs and the relaxed sequence requirements for binding. We posited that the inadequacy is due to limited input of empirically verified sites, and demonstrated a multiplatform approach to constructing a robust model.
Results
Using the TFBS for the estrogen receptor (ER)α (estrogen response element [ERE]) as a model system, we extracted EREs from multiple molecular and genomic platforms whose binding to ERα has been experimentally confirmed or rejected. In silico analyses revealed significant sequence information flanking the standard binding consensus, discriminating ERE-like sequences that bind ERα from those that are nonbinders. We extended the ERE consensus by three bases, bearing a terminal G at the third position 3' and an initiator C at the third position 5', which were further validated using surface plasmon resonance spectroscopy. Our functional human ERE prediction algorithm (h-ERE) outperformed existing predictive algorithms and produced fewer than 5% false negatives upon experimental validation.
Conclusion
Building upon a larger experimentally validated ERE set, the h-ERE algorithm is able to demarcate better the universe of ERE-like sequences that are potential ER binders. Only 14% of the predicted optimal binding sites were utilized under the experimental conditions employed, pointing to other selective criteria not related to EREs. Other factors, in addition to primary nucleotide sequence, will ultimately determine binding site selection.
doi:10.1186/gb-2006-7-9-r82
PMCID: PMC1794554  PMID: 16961928
23.  Reconstructing the genomic architecture of mammalian ancestors using multispecies comparative maps 
Human Genomics  2003;1(1):30-40.
Rapidly developing comparative gene maps in selected mammal species are providing an opportunity to reconstruct the genomic architecture of mammalian ancestors and study rearrangements that transformed this ancestral genome into existing mammalian genomes. Here, the recently developed Multiple Genome Rearrangement (MGR) algorithm is applied to human, mouse, cat and cattle comparative maps (with 311-470 shared markers) to impute the ancestral mammalian genome. Reconstructed ancestors consist of 70-100 conserved segments shared across the genomes that have been exchanged by rearrangement events along the ordinal lineages leading to modern species genomes. Genomic distances between species, dominated by inversions (reversals) and translocations, are presented in a first multispecies attempt using ordered mapping data to reconstruct the evolutionary exchanges that preceded modern placental mammal genomes.
doi:10.1186/1479-7364-1-1-30
PMCID: PMC3525001  PMID: 15601531
genome evolution; synteny; mammals; ancestral genome

Results 1-23 (23)