In promoters, we found 174 highly conserved motifs. Among these, some could be immediately recognized as known regulatory elements. For example, the three strongest motifs correspond to binding sites for the oncogene regulator ELK-1, the cell-cycle regulator Myc and the mitochondrial respiratory chain regulator NRF-1. Overall, 59 discovered motifs showed strong matches to known motifs and 10 showed weaker matches (see Supplementary Information
), together accounting for 72% of the 123 previously known motifs that we assembled from the TRANSFAC database (Supplementary Information
). (By contrast, a comparable collection of random motifs would match only ~5% of the TRANSFAC motifs; see Supplementary Information
The remaining 105 discovered motifs represent potentially new regulatory elements. The list includes numerous notable examples, some with conservation scores much higher than for most known motifs. For example, the newly discovered motif M4 (ACTAYRNNNCCCR) occurs 520 times in human promoter regions, of which 317 (61%) are conserved, and the new nearpalindromic motif M8 (TMTCGCGANR) occurs 368 times, of which 236 (64%) are conserved.
In the absence of specific information about the role of these putative motifs, we used two approaches to demonstrate that most of the new motifs are likely to be biologically meaningful. First, we correlated the presence of motifs with the tissue specificity of gene expression. We reasoned that the genes controlled by a common regulator would often (although not always) show enriched expression in specific sets of tissues. For each motif, we defined the set S1
of genes with conserved occurrences of the motif and an equal-sized control set S2
of genes in which the motif occurs in the human genome but is not conserved (see Methods). Using gene expression data from 75 human tissues24
, significant enrichment in one or more tissues (z
-score > 4.0) was seen for 59 of the 69 (86%) known motifs and 53 of the 105 (50%) new motifs ( and ). In contrast, the control sets show little or no enrichment across the same tissues (Supplementary Fig. S2
). For example, the best-conserved new motifs M4
show enrichment in haematopoietic cells, motif M27
KCCAR) is enriched in trachea and lung, and motif M29
(CTTTAAR) is enriched in brain-related tissues such as pons, parietal lobe and cingular cortex.
Top 50 of 174 discovered motifs in human promoters
Figure 2 Tissue specificity of expression for genes containing discovered motifs. For each of the 174 motifs, we defined the set of genes whose promoters contain conserved occurrences of the motif, and tested for enriched expression in 75 human tissues. The enrichment (more ...)
Second, we examined positional bias of the motifs relative to the TSS. Although the analysis covered a 4-kb region surrounding the TSS, the discovered motifs preferentially occur in the human genome within ~100 bases of the TSS, and their conserved occurrences across all four species show an even more marked enrichment in this region (), consistent with the motifs being involved in transcription initiation. Within this overall trend, certain motifs show distinctive positional preferences. Approximately 28% of the known motifs and 35% of the new motifs show significant positional preference (see Methods). For example, the two strongest new motifs (M4 and M8) tend to occur at distances centred around −89 and −62 bp upstream of the TSS, respectively ( and ). Overall, 89% of the known motifs and 69% of the new motifs show tissue specificity, positional bias or both. Taken together, these results strongly suggest that the new promoter motifs are likely to be biologically meaningful.
Figure 3 Discovered promoter motifs show positional bias with respect to transcriptional start site (TSS). a, Distribution of distance from TSS for all occurrences in human genome peaks within 100 bp before TSS. b, Distribution for conserved occurrences shows (more ...)
In addition, several of the motifs tend to appear in multiple copies in the promoter. For example, 17.5% of genes containing motif M4 have more than one copy of M4 within 200 bp of each other (27-fold increase compared with 0.66% expected at random), and 10% of genes containing motif M8 have multiple copies of this motif within 200 bp (compared with 0.26% expected by chance). Such clustering is a common feature of several known regulatory motifs.