|Home | About | Journals | Submit | Contact Us | Français|
Motivation: One of the challenges in interpreting high-throughput genomic studies such as a genome-wide associations, microarray or ChIP-seq is their open-ended nature—once a set of experimentally identified regions is identified as statistically significant, at least two questions arise: (i) besides P-value, do any of these significant regions stand out in terms of biological implications? (ii) Does the set of significant regions, as a whole, have anything in common genome wide? These issues are difficult to address because of the growing number of annotated genomic features (e.g. single nucleotide polymorphisms, transcription factor binding sites, methylation peaks, etc.), and it is difficult to know a priori which features would be most fruitful to analyze. Our goal is to provide partial automation of this process to begin examining associations between experimental features and annotated genomic regions in a hypothesis-free, data-driven manner.
Results: We created GenomeRunner—a tool for automating annotation and enrichment of genomic features of interest (FOI) with annotated genomic features (GFs), in different organisms. Besides simple association of FOIs with known GFs GenomeRunner tests whether the enriched FOIs, as a group, are statistically associated with a large and growing set of genomic features.
Availability: GenomeRunner setup files and source code are freely available at http://sourceforge.net/projects/genomerunner.
Contact: mikhail-dozmorov/at/omrf.org; Jonathan-Wren/at/omrf.org; jdwren/at/gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Genomes encode the instructions for life, and within genomes are many different features with different roles (e.g. genes, CpG islands, microRNAs). High-throughput (HT) technologies have enabled us to experimentally examine whether or not genomic variations associate with certain conditions (e.g. disease). These variations in features of interest (FOI) such as single nucleotide polymorphisms (SNPs) come from various technologies like genome-wide association (GWA) studies, peaks called from ChIP-on-Chip and ChIP-seq experiments and other data from deep sequencing experiments. Regardless of the type of data, each FOI has genomic coordinates that uniquely identify it within the genome.
Tools to prioritize FOIs usually analyze local genomic features (e.g. evolutionary conservation) of individual FOIs and do not consider the set of experimentally identified FOIs as a whole. Functional interpretation of FOI sets is often performed using gene- and pathway-based tools adopted from microarray data analysis (Dennis et al., 2003; Ji et al., 2008; Subramanian et al., 2005; Wang et al., 2007). These tools perform mapping of FOIs to genes, prioritize lists of FOIs and calculate gene- and pathway-enrichment statistics. They, however, rely upon a gene's role being known and, for humans, approximately one-third of genes have no known function (Wren, 2009). Plus, as projects like ENCODE are showing, there are many non-coding regions of interest in the genome (Birney et al., 2007). As such, we're lacking a way to automatically explore the genome and associate our FOIs with genomic features beyond well-annotated gene regions.
The idea of automating genome exploration is not new (Wren et al., 2005), but automated exploration of correlations is complicated by the heterogeneous nature of the HT data versus annotated genomic features (GF) types. Many efforts have been devoted to developing tools for prioritizing individual SNPs (Cline and Karchin 2011; Gauderman et al., 2007), but analysis of what annotated genomic features other than well-annotated gene regions may be associated with FOI sets has not received as much attention, particularly with regards to data-driven exploration. The GREAT tool, for example, associates genomic regions with their putative target genes and calculates Gene Ontology enrichment statistics (McLean et al., 2010), but is gene-centric, whereas GenomeRunner is designed as a much more general-purpose association tool. Galaxy has a variety of tools for operating on genomic intervals and SNP prioritization (Goecks et al., 2010), and GenGen (Chen et al., 2010; Wang et al., 2007) provides a set of Perl scripts for the association of FOIs with genes, transcription factor binding sites, microRNAs, EvoFold regions and other user-provided data but requires programming skills and data formatting. GenomeRunner differs from these tools in that it annotates SNPs and regions, and automatically searches for any statistically significant enrichment of FOIs with multiple GFs.
GenomeRunner has an intuitive point-and-click interface for querying of FOIs against a database of GFs. It accepts lists of FOIs in a tab-delimited format where each region is represented by chromosome name and a start and end coordinates. A user has an option to annotate and calculate enrichment of a set of FOIs with >750 GFs (hg19 database), including genes/exons/introns, upstream promoter regions, DNAse clusters, empirically validated and predicted conserved transcription factor binding sites, epigenetics marks, empirical and predicted microRNAs and regions conserved across organisms and more.
Tracks in UCSC genome browser are represented by tables (Karolchik et al., 2004) available for download. As such they represent an ideal mechanism for assembling genome annotation into a single database and for querying associations of FOIs with a GF of interest. We host selected UCSC tracks (Supplementary Material) in a MySQL database available to all GenomeRunner users, but there is also an option for local installation of this database. GenomeRunner's database is monthly synchronized with UCSC data and new GFs are added for GenomeRunner's analysis as they become available.
Besides simple associations of individual FOIs with GFs, GenomeRunner estimates the likelihood of observing associations with individual GFs for a set of FOIs versus a set of random regions with the same characteristics. GenomeRunner runs enrichment tests against whole genome as a background by default; options for loading user-defined background are available. Each simulation is done by selecting the same number of random points from within a background and correlating them with the same GF. Parameters of random simulations are evaluated and an observed number of associations of FOIs with a GF are compared with a Gaussian's distribution of random simulations. A user has the option to generate a file with random features/SNPs and run them as an input to evaluate the performance of Monte-Carlo simulations.
GenomeRunner was developed using Visual Basic 2010 and the ALGLIB add-on (http://www.alglib.net), and is distributed as an executable program along with source code.
As a proof of principle of the potential of GenomeRunner to discover associations, we ran an enrichment analysis of all 5190 probes recognizing non-coding RNAs on Affymetrix's Human Gene 1.0 ST array. GenomeRunner found them enriched within genes (P=3.04E-05) but significantly underrepresented in exon regions (P=5.25E-21), confirming recently published findings (Rearick et al., 2011). Another positive control was validation of the H3K4me2 -DNAse hypersensitive site association (Birney et al., 2007), (P < 1.00E-32). More examples of GenomeRunner analyses are shown in Supplementary Material.
GenomeRunner is designed to scan a large set of genomic features, including non-gene regions such as ncRNAs and epigenetic marks, in search of correlations with genomic FOI. The open-ended structure of GenomeRunner can and will be adapted to be included in the Galaxy suite. Currently, human and mouse genome annotations are available for GenomeRunner; and new organisms will be added in the near future. Besides automatic correlation of FOIs with known genomic features, GenomeRunner will calculate enrichment against either the whole genome or a user-defined background. In summary, we provide the scientific community with a tool for automated exploration of statistically significant associations between experimental FOIs and annotated genomic features.
Funding: National Institutes of Health (grant # 5P20RR020143-08 to J.D.W.) and a pilot project subaward (to M.G.D.) to develop GenomeRunner.
Conflict of Interest: None declared.