Almost 20 vertebrate genomes have been fully sequenced up to date. Gene annotation of the human, mouse and several other genomes reaches high confidence levels (Pruitt
et al.,
2007), and functional classification databases provide valuable information to understand the biological processes associated with different groups of genes in these genomes. Gene Ontology (GO) (Ashburner
et al.,
2000), KEGG (Kanehisa
et al.,
2006,
2008), OMIM (Boyadjiev and Jabs,
2000; Hamosh
et al.,
2002,
2005) and OBO Cell Ontology (Smith
et al.,
2007), are just some of the most widely used functional annotation databases, which have enabled intriguing discoveries during the last decade (Hvidsten
et al.,
2001; King
et al.,
2003).
The classical approach to functional inference identifies annotation terms that are significantly over- or underrepresented within a given class of genes; over- or underrepresentations are identified by comparing the count of occurrences for each annotation term to the expected value, which usually arises from considering the number of genes assigned to each category in the complete genome. Several tools have been developed to perform the classification analysis that are mainly based on the hypergeometric test [among them BiNGO (Maere
et al.,
2005), GO::TermFinder (Boyle EI,
2004) and GOToolBox (Martin,
2004)] or Fisher's exact test, which relies on properties of the hypergeometric distribution [like GOstat (Beissbarth and Speed,
2004) and FatiGO (Al-Shahrour
et al.,
2004,
2007)].
However, the vast majority of the genome consists of non-protein-coding (non-coding) sequences. Functional non-coding sequences may be associated with protein-coding sequences by either directly or indirectly regulating the expression of protein-coding genes, or playing structural roles in chromosome architecture or encoding RNA genes. In any case, annotation databases for non-coding elements are still in their infancy. In particular, there are at least two databases that store and openly share functional annotation of candidate gene regulatory elements in vertebrates, tested
in vivo in mice and zebrafish—Vista Enhancer Database (VED) (Pennacchio
et al.,
2006) and CONDOR (Woolfe
et al.,
2007). However, ~1000 elements profiled in these databases represent only a small fraction of gene regulatory elements in a vertebrate genome, which are expected to exceed the number of exons (~200 000) (Waterston
et al.,
2002). In practice, this precludes the application of these databases to the functional annotation of non-coding elements, which could be represented by a set of non-coding SNPs (Schwarz
et al.,
2008), a set of transcription factor binding sites from ChIP-chip experiments (Hu
et al.,
2007), or a set of candidate regulatory elements scattered across a vertebrate genome (Ovcharenko
et al.,
2005; Woolfe
et al.,
2005), for example. A sensible solution to this problem proposes to infer the function of a given non-coding element from that of the gene it belongs to or the closest neighboring gene; this strategy is especially well justified for promoter or UTR elements. However, the interpretation of the results is not always straightforward—promoter elements only represent a small component of the complex gene regulatory machinery, also constituted by distant intergenic and intronic elements (Machon
et al.,
2002; Nobrega
et al.,
2003), which do not necessarily regulate the gene they are inserted in or close to (Lettice
et al.,
2003; Santagati
et al.,
2003).
Uncertain association of putative regulatory elements and genes aside, gene annotation databases can be useful to characterize non-coding elements. While the association with the gene is often straightforward for promoters and UTR elements, intronic elements are commonly associated with the gene containing them and intergenic elements are usually associated with the nearest gene. After that, it is reasonable to infer the function of non-coding elements by examining the function of the corresponding set of genes. In a classical approach, the number of genes assigned to a given functional category in this set is compared with the number of genes assigned to that category in the entire genome, and deviations are evaluated according to a statistical test. The problem with this logic is the implicit assumption that the probability of sampling a particular annotation term is equal to the fraction of genes associated with it in the genome, which does not depend on the total number of non-coding elements a particular gene is associated with. Basically, a gene with many non-coding elements and a gene with zero non-coding elements are assumed to have equal probability of discovery through the analysis of their non-coding DNA space, which is obviously wrong and leads to a GO ascertainment bias. As a result, non-coding elements will be predicted in some loci of the genome more often than in others purely by chance, and any random subset of non-coding DNA may appear significantly enriched or depleted for some annotation terms, i.e. the above-mentioned strategy for an indirect functional analysis is biased due to the variable locus length. To correct for the GO ascertainment bias, within the context of functional inference on non-coding elements, the probability of a given annotation term should be set proportional to the fraction of the length of non-coding DNA assigned to it, which is strongly correlated with the length of the locus that contains it, and is highly variable across different loci (
Supplementary Fig. 1).
The aim of this work is to evaluate the effect of the variability in locus length on the functional analysis of non-coding DNA. We consider the total population of non-coding DNA elements in the human genome and the annotation terms attributed to their neighbor genes, and assess whether a set of non-coding DNA elements randomly sampled from the genome will appear artificially enriched and/or depleted for any annotation term. In our study, we report systematic false positive associations for a particular set of GO categories; the choice of the GO database is just exemplary. Finally, we propose a statistical method and a set of correction coefficients to perform an unbiased functional analysis for a set of non-coding elements in a genome.