Accurate transcription initiation of eukaryotic genes requires a DNA region, referred to as the core promoter, which includes the transcription start site (TSS) and immediately flanking sequences. Most eukaryotic genes, including all protein-coding genes, are transcribed by RNA polymerase II (RNAPII) and are referred to as class II genes. Class II core promoters generally extend from ~40 base pairs (bp) upstream (-40) to ~40 bp downstream (+40) of the TSS (+1) and contain different combinations of various functional DNA motifs referred to as core promoter elements. These core promoter elements direct the recruitment and assembly of the class II basal/general transcription factors (TFIIA, TFIIB, TFIID, TFIIE, TFIIF and TFIIH) and RNAPII into a functional pre-initiation complex (PIC) at the TSS and thus determine the intrinsic “basal” (i.e., unregulated) transcription activity of the core promoter (
Roeder, 1998). The specific core promoter sequence also influences the transcriptional response of a given gene to particular enhancers and gene-specific transcription regulators. Thus, the information contained within the DNA sequence of the core promoter is critical for the proper regulation of gene-selective transcription in eukaryotes albeit via mechanisms that remain poorly understood (
Smale and Kadonaga, 2003).
Class II core promoter elements have been best characterized in metazoan genes. They include: (i) the TATA box located at -30 relative to the TSS (+1), which is directly bound by the TATA-binding protein (TBP) subunit of the TFIID complex; (ii) the initiator (INR) element located at, or immediately adjacent to, the TSS, which is recognized by the TBP-associated factors TAF1 and TAF2 of the TFIID complex; (iii) the TFIIB recognition element (BRE) immediately flanking the TATA box and directly bound by TFIIB; and (iv) the downstream promoter element (DPE) centered at +30 downstream of the TSS, which is recognized by the TFIID subunits TAF6 and TAF9 (see for consensus sequences). Additional less extensively characterized core promoter sequences downstream of the TSS of specific viral and metazoan genes have been reported to influence core promoter activity and appear to be also recognized by TFIID/TAFs (
Smale and Kadonaga, 2003;
Lim et al., 2004;
Deng and Roberts, 2005;
Lewis et al., 2005;
Lee et al., 2005; and references therein). Notably, none of the core promoter elements identified thus far is ubiquitous or universally required for transcription.
In yeast, core promoters still remain poorly characterized and, except for the TATA box (which in
S. cerevisiae is generally located between -40 and -120 relative to the TSS), the other metazoan core promoter elements are generally thought to be absent. Nevertheless, specific DNA sequences have long been known to determine the position of the TSS in a small number of yeast genes and include the purine (R)-rich consensus sequence RR
YRR, where the underlined pyrimidine corresponds to the initiation site, and the consensus sequence T
CRA, where either C and/or R are the initiation sites (
Chen and Struhl, 1985;
Hahn et al., 1985; Mosch et al., 1992;
Hampsey, 1998;
Smale and Kadonaga, 2003). More recently, an extended A-rich consensus sequence A(A-rich)
5NY
A(A/T)NN(Arich)
6 has been derived from a 5’-SAGE analysis of TSSs in 2231 yeast genes (Zhang and Dietrich, 2005). Although these sequences have been referred to as yeast “initiators”, their incidence in core promoters genome-wide is unclear and they are thought to function differently from metazoan INR elements (reviewed in
Smale and Kadonaga, 2003). Here we use the term initiator (INR) to refer exclusively to the mammalian consensus INR sequence YYANWYY. Significantly, as in higher eukaryotes, yeast TAFs are important for core promoter-dependent transcription regulation, suggesting the existence of still unidentified TAF-dependent core promoter motifs in yeast (
Green, 2000).
The binding of TBP/TFIID to the core promoter is a critical step in stable PIC assembly. Accordingly, a widely accepted model for PIC assembly at class II promoters involves the direct binding of TBP to the -30 region as an essential (and generally first) step in PIC assembly, which nucleates the recruitment of the other general/basal factors and ultimately RNAPII. In this model TBP-DNA interactions are essential for PIC assembly whether or not a TATA box is present, which is supported by the fact that TBP binds specifically and in a functional manner to a wide variety of DNA sequences that significantly diverge from the canonical TATA box sequence TATAAA (
Hahn et al., 1989;
Singer et al., 1990;
Wobbe and Struhl, 1990;
Wiley et al., 1992;
Zenzie-Gregory et al., 1993;
Aso et al., 1994;
Kraus et al., 1996;
Weis and Reinberg, 1997;
Patikoglou et al., 1999). The binding of TBP to various TATA sequences induces a dramatic DNA bend (
Patikoglou et al., 1999) and is stabilized by cooperative interactions with TFIIB and TFIIA, which contact flanking DNA, and with TAFs, which interact with the INR and other downstream core promoter elements (reviewed in
Hahn, 2004).
The universality of this model, however, and more specifically the essential role of TBP-DNA interactions in PIC assembly, has been challenged by in vitro transcription experiments in mammalian systems indicating that the TATA-binding activity of TBP is dispensable, while TAFs are essential for basal and activated transcription from an INR-dependent “TATA-less” core promoter (
Martinez et al., 1994;
1995). Furthermore, basal transcription from both mammalian and
Drosophila TATA-less core promoters requires additional cofactors distinct from TAFs/TFIID and the general transcription factors (
Martinez et al., 1998;
Willy et al., 2000). Thus, TATA-less core promoters that lack AT-rich sequences in the -30 region and do not stably bind TBP are likely to assemble PICs via alternative pathways and to be regulated by distinct mechanisms (
Smale and Kadonaga, 2003). However, the number of such bona fide TATA-less genes remains unclear in eukaryotic genomes.
Computational analyses of 205 experimentally-defined core promoters and 1941 putative core promoter regions in
Drosophila indicated that about 43% and 33.9%, respectively, contain the TATA box consensus sequence TATAAA or a sequence matching 5 out of the 6 consensus nucleotides, while about 67% contained the Drosophila INR element (Kutach and Kadonaga, 2000; Ohler et al., 2002). This suggested for the first time the low frequency of TATA elements and the relative abundance of INR-containing promoters in the
Drosophila genome. Interestingly, recent analyses of
Saccharomyces genomes also revealed that the canonical TATA box (TATAWAWR) is present in ~20% of yeast genes (
Basehoar et al., 2004) suggesting that specific transcription initiation at most yeast promoters might rely on other as yet unidentified core promoter elements.
Similar studies of the frequency of the TATA box in human promoters have yielded apparently conflicting results. Analyses of ~1800 experimentally characterized promoters in the eukaryotic promoter database (EPD) indicated that 11.6% (Bajic et al., 2003), 21.8% (
Gershenzon and Ioshikhes, 2005), and 76% (
Trinklein et al. 2003) have a TATA box. However, the EPD database is relatively small and appears “enriched” in TATA-containing core promoters that are more amenable to experimental TSS mapping techniques (discussed in
Gershenzon and Ioshikhes, 2005). Indeed, analyses of larger databases, including the database of transcription start sites (DBTSS,
Suzuki et al., 2001a,
2004), obtained by aligning the 5’end of full-length cDNAs to the human genome sequence, revealed a more restricted number of TATA-containing genes, although the actual percentages still varied greatly between studies - i.e., 2.6% (
Fitzgerald et al., 2004), 10.4% (
Gershenzon and Ioshikhes, 2005), 11-17% (
Kimura et al., 2006;
Jin et al., 2006), and up to 64% (
Trinklein et al. 2003). Studies on the frequency of other core promoter elements in human genes also varied. One study found no preferential positioning for the consensus INR and DPE sequences in 13,010 human promoters (
Fitzgerald et al., 2004), while two more recent studies found 49-63% INR-containing promoters, 22-25% BRE-containing promoters, and 12-25% DPE-containing promoters (
Gershenzon and Ioshikhes, 2005;
Jin et al., 2006).
Here, in an attempt to address some of the ambiguities noted above, we performed genome-scale computational analyses of human core promoters in the UCSC GoldenPath (15,685 genes) and DBTSS (10,271 genes) databases and compared the annotated biological functions of human genes with different core promoter structures. In contrast to previous analyses, we searched human core promoters both with the canonical 8-mer TATA consensus sequence (TATAWAWR) and with a list of 532 different 8-mer TATA-like elements that fit the structural definition of the TATA-TBP interface (
Patikoglou et al., 1999) and most often occur at -30 in human core promoters. Our results are consistent with and extend previous observations on the frequency of TATA elements in human core promoters and further identify DNA motifs selectively enriched in TATA-less core promoters. In addition, we show that human genes with distinct core promoter structures, as defined by the presence or absence of TATA and/or INR elements, tend to control different biological processes. Unexpectedly, elements matching the metazoan consensus INR sequence also cluster specifically in the TSS region of many yeast genes suggesting that specific transcription initiation at a large fraction of eukaryotic promoters, from yeast to human, might involve similar INR elements and thus might be more conserved than previously thought.