Increased autoantibody reactivity in plasma from Myelodysplastic Syndromes (MDS) patients may provide novel disease signatures, and possible early detection. In a two-stage study we investigated Immunoglobulin G reactivity in plasma from MDS, Acute Myeloid Leukemia post MDS patients, and a healthy cohort. In exploratory Stage I we utilized high-throughput protein arrays to identify 35 high-interest proteins showing increased reactivity in patient subgroups compared to healthy controls. In validation Stage II we designed new arrays focusing on 25 of the proteins identified in Stage I and expanded the initial cohort. We validated increased antibody reactivity against AKT3, FCGR3A and ARL8B in patients, which enabled sample classification into stable MDS and healthy individuals. We also detected elevated AKT3 protein levels in MDS patient plasma. The discovery of increased specific autoantibody reactivity in MDS patients, provides molecular signatures for classification, supplementing existing risk categorizations, and may enhance diagnostic and prognostic capabilities for MDS.
The signal transducer and activator of transcription 3 (STAT3) is a transcription factor that, when dysregulated, becomes a powerful oncogene found in many human cancers, including diffuse large B-cell lymphoma. Diffuse large B-cell lymphoma is the most common form of non-Hodgkin’s lymphoma and has two major subtypes: germinal center B-cell−like and activated B–cell—like. Compared with the germinal center B-cell−like form, activated B-cell−like lymphomas respond much more poorly to current therapies and often exhibit overexpression or overactivation of STAT3. To investigate how STAT3 might contribute to this aggressive phenotype, we have integrated genome-wide studies of STAT3 DNA binding using chromatin immunoprecipitation-sequencing with whole-transcriptome profiling using RNA-sequencing. STAT3 binding sites are present near almost a third of all genes that differ in expression between the two subtypes, and examination of the affected genes identified previously undetected and clinically significant pathways downstream of STAT3 that drive oncogenesis. Novel treatments aimed at these pathways may increase the survivability of activated B-cell−like diffuse large B-cell lymphoma.
genomics; next-generation sequencing; cancer signaling; signal transduction; oncogenic pathways
Understanding of gene regulatory networks requires discovery of expression modules within gene co-expression networks and identification of promoter motifs and corresponding transcription factors that regulate their expression. A commonly used method for this purpose is a top-down approach based on clustering the network into a range of densely connected segments, treating these segments as expression modules, and extracting promoter motifs from these modules. Here, we describe a novel bottom-up approach to identify gene expression modules driven by known cis-regulatory motifs in the gene promoters. For a specific motif, genes in the co-expression network are ranked according to their probability of belonging to an expression module regulated by that motif. The ranking is conducted via motif enrichment or motif position bias analysis. Our results indicate that motif position bias analysis is an effective tool for genome-wide motif analysis. Sub-networks containing the top ranked genes are extracted and analyzed for inherent gene expression modules. This approach identified novel expression modules for the G-box, W-box, site II, and MYB motifs from an Arabidopsis thaliana gene co-expression network based on the graphical Gaussian model. The novel expression modules include those involved in house-keeping functions, primary and secondary metabolism, and abiotic and biotic stress responses. In addition to confirmation of previously described modules, we identified modules that include new signaling pathways. To associate transcription factors that regulate genes in these co-expression modules, we developed a novel reporter system. Using this approach, we evaluated MYB transcription factor-promoter interactions within MYB motif modules.
Gene co-expression networks unite genes with similar expression patterns. From these networks, gene co-expression modules can be identified. A specific family of transcription factor(s) may regulate the genes within a co-expression module. Thus, module identification is important to decipher the gene regulatory network. Previously, module identification relied on clustering the gene network into gene clusters that were then treated as modules. This represents a top-down approach. Here, we introduce a reverse approach aiming at identifying gene co-expression modules regulated by known promoter motifs. For a given promoter motif, we calculated the probability of each gene within the network to belong to a module regulated by that motif via motif enrichment analysis or motif position bias analysis. A sub-network containing the genes with a high probability of belonging to a motif driven module was then extracted from the gene co-expression network. From this sub-network, the modular structure can be identified via visual inspection. Our bottom-up approach recovered many known and novel modules for the G-box, MYB, W-box and site II elements motif, whose expression may be regulated by the transcription factors that bind to these motifs. Additionally, we developed a rapid transcription factor-promoter interaction screening system to validate predicted interactions.
Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment.
Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org.
Contact: email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
The Cleavage Factor 1A (CF1A) complex, which is required for the termination of transcription in budding yeast, occupies the 3′ end of transcriptionally active genes. We recently demonstrated that CF1A subunits also crosslink to the 5′ end of genes during transcription. The presence of CF1A complex at the promoter suggested its possible involvement in the initiation/reinitiation of transcription. To check this possibility, we performed transcription run-on assay, RNAP II-density ChIP and strand-specific RT-PCR analysis in a mutant of CF1A subunit Clp1. As expected, RNAP II read through the termination signal in the temperature-sensitive mutant of clp1 at elevated temperature. The transcription readthrough phenotype was accompanied by a decrease in the density of RNAP II in the vicinity of the promoter region. With the exception of TFIIB and TFIIF, the recruitment of the general transcription factors onto the promoter, however, remained unaffected in the clp1 mutant. These results suggest that the CF1A complex affects the recruitment of RNAP II onto the promoter for reinitiation of transcription. Simultaneously, an increase in synthesis of promoter-initiated divergent antisense transcript was observed in the clp1 mutant, thereby implicating CF1A complex in providing directionality to the promoter-bound polymerase. Chromosome Conformation Capture (3C) analysis revealed a physical interaction of the promoter and terminator regions of a gene in the presence of a functional CF1A complex. Gene looping was completely abolished in the clp1 mutant. On the basis of these results, we propose that the CF1A-dependent recruitment of RNAP II onto the promoter for reinitiation and the regulation of directionality of promoter-associated transcription are accomplished through gene looping.
The termination of transcription requires two major multisubunit complexes in budding yeast. These termination complexes are localized at the 3′ end of genes. Recent studies have found the termination factors occupying the 5′ end of genes as well. In this study, we investigate the physiological role of a termination factor at the 5′ end of a gene. Our results show that the CF1 termination complex affects the recruitment of the transcription enzyme RNAP II onto the promoter for reinitiation of transcription. The complex also affects the directionality of transcription of the promoter-bound polymerase. We also found that the looped gene conformation was disrupted in the absence of a functional termination complex. The overall conclusion of these results is that the terminator-bound factors contact the 5′ end of genes due to gene looping, and affect both the recruitment of the polymerase at the promoter for reinitiation, and directionality of the promoter-initiated transcription. Thus, the role of termination factors is not restricted to the 3′ end of the gene, but they are also involved in promoter-associated transcription.
Metabolites comprise the molar majority of chemical substances in living cells, and metabolite-protein interactions are expected to be quite common. Many interactions have already been identified and have been shown to be involved in the regulation of different types of cellular processes including signaling events, enzyme activities, protein localizations and interactions. Recent technological advances have greatly facilitated the detection of metabolite-protein interactions at high sensitivity and some of these have been applied on a large scale. In this manuscript, we review the available in vitro, in silico and in vivo technologies for mapping small-molecule-protein interactions. Although some of these were developed for drug-protein interactions they can be applied for mapping metabolite-protein interactions. Information gained from the use of these approaches can be applied to the manipulation of cellular processes and therapeutic applications.
Metabolite-protein interaction; metabolite detection; protein separation; technique
The West Nile virus (WNV) is an emerging infection of biodefense concern and there are no available treatments or vaccines. Here we used a high-throughput method based on a novel gene expression analysis, RNA-Seq, to give a global picture of differential gene expression by primary human macrophages of 10 healthy donors infected in vitro with WNV. From a total of 28 million reads per sample, we identified 1,514 transcripts that were differentially expressed after infection. Both predicted and novel gene changes were detected, as were gene isoforms, and while many of the genes were expressed by all donors, some were unique. Knock-down of genes not previously known to be associated with WNV resistance identified their critical role in control of viral infection. Our study distinguishes both common gene pathways as well as novel cellular responses. Such analyses will be valuable for translational studies of susceptible and resistant individuals—and for targeting therapeutics—in multiple biological settings.
anti-viral gene expression; immune response; macrophage; RNA-Seq; West Nile virus
Dilated cardiomyopathy (DCM) is the most common cardiomyopathy, characterized by ventricular dilatation, systolic dysfunction, and progressive heart failure. DCM is the most common diagnosis leading to heart transplantation and places a significant burden on healthcare worldwide. The advent of induced pluripotent stem cells (iPSCs) offers an exceptional opportunity for creating disease-specific models, investigating underlying mechanisms, and optimizing therapy. Here we generated cardiomyocytes (CMs) from iPSCs derived from patients of a DCM family carrying a point mutation (R173W) in the gene encoding sarcomeric protein cardiac troponin T. Compared to the control healthy individuals in the same family cohort, DCM iPSC-CMs exhibited altered Ca2+ handling, decreased contractility, and abnormal sarcomeric α-actinin distribution. When stimulated with β-adrenergic agonist, DCM iPSC-CMs showed characteristics of failure such as reduced beating rates, compromised contraction, and significantly more cells with abnormal sarcomeric α-actinin distribution. β-adrenergic blocker treatment and over-expression of sarcoplasmic reticulum Ca2+ ATPase (Serca2a) improved DCM iPSC-CMs function. Our study demonstrated that human DCM iPSC-CMs recapitulated to some extent the disease phenotypes morphologically and functionally, and thus can serve as a useful platform for exploring molecular and cellular mechanisms and optimizing treatment of this particular disease.
Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here we present an integrative Personal Omics Profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14-month period. Our iPOP analysis revealed various medical risks, including Type II diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high coverage genomic and transcriptomic data, which provide the basis of our iPOP, discovered extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and disease states by connecting genomic information with additional dynamic omics activity.
Michael Snyder answers Genome Biology's questions on the human and professional stories underlying his Snyderome integrative omics project.
Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type–specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes.
Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning. Here we use long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and generate two large datasets in the human K562 and HeLa S3 cell lines. Both data sets comprised at least 4 million reads and had median read lengths greater than 500 bp. We show that annotation-independent alignments of these reads provide partial gene structures that are very much in-line with annotated gene structures, 15% of which have not been obtained in a previous de novo analysis of short reads. For long-noncoding RNAs (i.e., lncRNA) genes, however, we find an increased fraction of novel gene structures among our alignments. Other important aspects of transcriptome analysis, such as the description of cell type-specific splicing, can be performed in an accurate, reliable and completely annotation-free manner, making it ideal for the analysis of transcriptomes of newly sequenced genomes. Furthermore, we demonstrate that long read sequence can be assembled into full-length transcripts with considerable success. Our method is applicable to all long read sequencing technologies.
RNA; Roche sequencing; human; splicing; transcriptome
To understand the diversity of transcripts in yeast (Saccharomyces cerevisiae) we analyzed the transcriptional landscapes for cells grown under 18 different environmental conditions. Each sample was analyzed using RNA-sequencing, and a total of 670,446,084 uniquely mapped reads and 377,263 poly-adenylated end tags were produced. Consistent with previous studies, we find that the majority of yeast genes are expressed under one or more different conditions. By directly comparing the 5′ and 3′ ends of the transcribed regions, we find extensive differences in transcript ends across many conditions, especially those of stationary phase, growth in grape juice, and salt stimulation, suggesting differential choice of transcription start and stop sites is pervasive in yeast. Relative to the exponential growth condition (i.e., YPAD), transcripts differing at the 5′ ends and 3′ ends are predicted to differ in their annotated start codon in 21 genes and their annotated stop codon in 63 genes. Many (431) upstream open reading frames (uORFs) are found in alternate 5′ ends and are significantly enriched in transcripts produced during the salt response. Mutational analysis of five genes with uORFs revealed that two sets of uORFs increase the expression of a reporter construct, indicating a role in activation which had not been reported previously, whereas two other uORFs decreased expression. In addition, RNA binding protein motifs are statistically enriched for alternate ends under many conditions. Overall, these results demonstrate enormous diversity of transcript ends, and that this heterogeneity is regulated under different environmental conditions. Moreover, transcript end diversity has important biological implications for the regulation of gene expression. In addition, our data also serve as a valuable resource for the scientific community.
yeast; RNA-sequencing; environmental conditions; UTRs
Genome sequencing technologies have advanced rapidly, dramatically decreasing cost and increasing throughput. But beyond faster and cheaper, these advances have also stimulated the development of innovative new experimental approaches, and are opening new doors in human medicine and health.
Advances in genome sequencing have progressed at a rapid pace, with increased throughput accompanied by plunging costs. But these advances go far beyond faster and cheaper. High-throughput sequencing technologies are now routinely being applied to a wide range of important topics in biology and medicine, often allowing researchers to address important biological questions that were not possible before. In this review, we discuss these innovative new approaches—including ever finer analyses of transcriptome dynamics, genome structure and genomic variation—and provide an overview of the new insights into complex biological systems catalyzed by these technologies. We also assess the impact of genotyping, genome sequencing and personal omics profiling on medical applications, including diagnosis and disease monitoring. Finally, we review recent developments in single-cell sequencing, and conclude with a discussion of possible future advances and obstacles for sequencing in biology and health.
biology; high-throughput; medicine; sequencing; technologies
Higher-order chromosomal organization for transcription regulation is poorly understood in eukaryotes. Using genome-wide Chromatin Interaction Analysis with Paired-End-Tag sequencing (ChIA-PET), we mapped long-range chromatin interactions associated with RNA polymerase II in human cells and uncovered widespread promoter-centered intra-genic, extra-genic and inter-genic interactions. These interactions further aggregated into higher-order clusters, wherein proximal and distal genes were engaged through promoter-promoter interactions. Most genes with promoter-promoter interactions were active and transcribed cooperatively, and some interacting promoters could influence each other implying combinatorial complexity of transcriptional controls. Comparative analyses of different cell lines showed that cell-specific chromatin interactions could provide structural frameworks for cell-specific transcription, and suggested significant enrichment of enhancer-promoter interactions for cell-specific functions. Furthermore, genetically-identified disease-associated non-coding elements were found to be spatially engaged with corresponding genes through long-range interactions. Overall, our study provides insights into the transcription regulation by three-dimensional chromatin interactions for both housekeeping and cell-specific genes in human cells.
Accurate chromosome segregation requires centromeres (CENs), the DNA sequences where kinetochores form, to attach chromosomes to microtubules. In contrast to most eukaryotes, which have broad centromeres, Saccharomyces cerevisiae possesses sequence-defined point CENs. Chromatin immunoprecipitation followed by sequencing (ChIP–Seq) reveals colocalization of four kinetochore proteins at novel, discrete, non-centromeric regions, especially when levels of the centromeric histone H3 variant, Cse4 (a.k.a. CENP-A or CenH3), are elevated. These regions of overlapping protein binding enhance the segregation of plasmids and chromosomes and have thus been termed Centromere-Like Regions (CLRs). CLRs form in close proximity to S. cerevisiae CENs and share characteristics typical of both point and regional CENs. CLR sequences are conserved among related budding yeasts. Many genomic features characteristic of CLRs are also associated with these conserved homologous sequences from closely related budding yeasts. These studies provide general and important insights into the origin and evolution of centromeres.
Centromeres (CENs) are chromosomal regions essential for proper chromosome segregation through their ability to establish evolutionarily conserved protein complexes called kinetochores. During mitosis, kinetochores attach to microtubules emanating from spindle poles, thus providing the mechanism for chromosome segregation. Eukaryotes have different types of CENs. Most eukaryotes have large multimeric centromeres lacking DNA sequence specificity. In contrast, the budding yeast, S. cerevisiae, has short punctate centromeres, comprised of specific DNA sequences. Combining chromatin immunoprecipitation and deep sequencing, we identified regions of the yeast genome that are bound by key kinetochore components; we refer to these regions as Centromere-Like Regions (CLRs). We found that CLRs can promote segregation on episomal plasmids and native chromosomes. Most CLRs are found in intergenic regions, close to native CENs. CLRs resemble point CENs by their short size and regional centromeres by their lack of determining DNA sequences. CLR sequences are conserved among related budding yeasts. Our findings indicate that, similar to other fungi and eukaryotes, S. cerevisiae possesses the ability to form sequence-independent centromeric structures. Establishment of centromeric elements outside regular CENs, or neocentromerization, can lead to chromosome missegregation and is a hallmark of cancer cells. CLR formation in budding yeast provides a simple model of neocentromerization.
somatic mosaicism; genomic rearrangement; CNV; aCGH; cancer genomics
Metabolites interact with proteins in vivo in various ways other than enzymatic reactions. Profiling of such interactions may help disclose unknown molecular mechanisms that regulate protein functions, and provide potential targets for disease treatment. Here we describe a procedure for systematic analyses of metabolite-protein interactions in vivo. This procedure couples protein affinity purification and mass spectrometry to identify metabolite-protein interactions. The primary effort can be completed within one day and scaled to process hundreds of samples in a batch. Originally developed in yeast, the same principle and protocol can be adapted to other organisms.
Metabolite-protein interaction; liquid chromatography; mass spectrometry; LC-MS; metabolite; protein affinity purification; yeast
DNA capture technologies combined with high-throughput sequencing now enable cost-effective, deep-coverage, targeted sequencing of complete exomes. This is well suited for SNP discovery and genotyping. However there has been little attention devoted to Copy Number Variation (CNV) detection from exome capture datasets despite the potentially high impact of CNVs in exonic regions on protein function.
As members of the 1000 Genomes Project analysis effort, we investigated 697 samples in which 931 genes were targeted and sampled with 454 or Illumina paired-end sequencing. We developed a rigorous Bayesian method to detect CNVs in the genes, based on read depth within target regions. Despite substantial variability in read coverage across samples and targeted exons, we were able to identify 107 heterozygous deletions in the dataset. The experimentally determined false discovery rate (FDR) of the cleanest dataset from the Wellcome Trust Sanger Institute is 12.5%. We were able to substantially improve the FDR in a subset of gene deletion candidates that were adjacent to another gene deletion call (17 calls). The estimated sensitivity of our call-set was 45%.
This study demonstrates that exonic sequencing datasets, collected both in population based and medical sequencing projects, will be a useful substrate for detecting genic CNV events, particularly deletions. Based on the number of events we found and the sensitivity of the methods in the present dataset, we estimate on average 16 genic heterozygous deletions per individual genome. Our power analysis informs ongoing and future projects about sequencing depth and uniformity of read coverage required for efficient detection.
SNAPc is one of a few basal transcription factors used by both RNA polymerase (pol) II and pol III. To define the set of active SNAPc-dependent promoters in human cells, we have localized genome-wide four SNAPc subunits, GTF2B (TFIIB), BRF2, pol II, and pol III. Among some seventy loci occupied by SNAPc and other factors, including pol II snRNA genes, pol III genes with type 3 promoters, and a few un-annotated loci, most are primarily occupied by either pol II and GTF2B, or pol III and BRF2. A notable exception is the RPPH1 gene, which is occupied by significant amounts of both polymerases. We show that the large majority of SNAPc-dependent promoters recruit POU2F1 and/or ZNF143 on their enhancer region, and a subset also recruits GABP, a factor newly implicated in SNAPc-dependent transcription. These activators associate with pol II and III promoters in G1 slightly before the polymerase, and ZNF143 is required for efficient transcription initiation complex assembly. The results characterize a set of genes with unique properties and establish that polymerase specificity is not absolute in vivo.
SNAPc-dependent promoters are unique among cellular promoters in being very similar to each other, even though some of them recruit RNA polymerase II and others RNA polymerase III. We have examined all SNAPc-bound promoters present in the human genome. We find a surprisingly small number of them, some 70 promoters. Among these, the large majority is bound by either RNA polymerase II or RNA polymerase III, as expected, but one gene hitherto considered an RNA polymerase III gene is also occupied by significant levels of RNA polymerase II. Both RNA polymerase II and RNA polymerase III SNAPc-dependent promoters use a largely overlapping set of a few transcription activators, including GABP, a novel factor implicated in snRNA gene transcription.
Protein phosphorylation continues to be regarded as one of the most important post-translational modifications found in eukaryotes and has been implicated in key roles in the development of a number of human diseases. In order to elucidate roles for the 518 human kinases, phosphorylation has routinely been studied using the budding yeast Saccharomyces cerevisiae as a model system. In recent years, a number of technologies have emerged to globally map phosphorylation in yeast. In this article, we review these technologies and discuss how these phosphorylation mapping efforts have shed light on our understanding of kinase signaling pathways and eukaryotic proteomic networks in general.
dynamic networks; kinase; substrate relationships; phosphorylation mapping; proteomic networks; Saccharomyces cerevisiae; scale-free networks
Chromatin-remodeling enzymes play essential roles in many biological processes, including gene expression, DNA replication and repair, and cell division. Although one such complex, SWI/SNF, has been extensively studied, new discoveries are still being made. Here, we review SWI/SNF biochemistry; highlight recent genomic and proteomic advances; and address the role of SWI/SNF in human diseases, including cancer and viral infections. These studies have greatly increased our understanding of complex nuclear processes.
Cancer; Chromatin; Chromatin Immunoprecipitation (ChIP); Chromatin Remodeling; DNA Sequencing; HIV-1; Mass Spectrometry (MS); Transcriptional Regulation; Viral Transcription; SWI/SNF
Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.
As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.
Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
The discovery of DNA regulatory motifs in the sequenced genomes using computational methods remains challenging. Here, we present MotifIndexer - a comprehensive strategy for de novo identification of DNA regulatory motifs at a genome level. Using word-counting methods, we indexed the existence of every 8-mer oligo composed of bases A, C, G, T, r, y, s, w, m, k, n or 12-mer oligo composed of A, C, G, T, n, in the promoters of all predicted genes of Arabidopsis thaliana genome and of selected stress-induced co-expressed genes. From this analysis, we identified number of over-represented motifs. Among these, major critical motifs were identified using a position filter. We used a model based on uniform distribution and the z-scores derived from this model to describe position bias. Interestingly, many motifs showed position bias towards the transcription start site. We extended this model to show biased distribution of motifs in the genomes of both A. thaliana and rice. We also used MotifIndexer to identify conserved motifs in co-expressed gene groups from two Arabidopsis species, A. thaliana and A. lyrata. This new comparative genomics method does not depend on alignments of homologous gene promoter sequences.