|Home | About | Journals | Submit | Contact Us | Français|
Proximal promoters have a major impact on transcriptional regulation. Studies of the sequence-based nature of this regulation usually require collection of proximal promoter sequences for large sets of co-regulated genes. We report a newly implemented web service that facilitates extraction of user specified regions around the transcription start site of all annotated human, mouse or rat genes. The transcription start sites have been identified computationally by considering alignments of a large number of partial and full-length mRNA sequences to genomic DNA, with provision for alternative promoters. The service is publicly available at http://biowulf.bu.edu/zlab/PromoSer/.
Recent advances in biotechnology have opened the door for large scale and high throughput analysis approaches to the genomes of many organisms. Using microarray technology for example, patterns of similar expression profiles (conditionally or temporally) have been linked to shared regulatory mechanisms (1–5). The methods used to analyze and elicit these control mechanisms vary. Common approaches include searching for novel cis-elements using expectation maximization (6) or Gibbs sampling (7) algorithms. Alternatively, known motifs are used to look for significant clusters (8) of these elements or for recurring patterns that seem to correlate well with certain clusters of genes (9,10).
The common prerequisite for all computational analysis methods is the availability of promoter sequence data. It is well known that enhancer and suppressor control elements can exist at sites tens of thousands of bases upstream or even downstream of the transcription start site (TSS) (11). In many cases however, the essential control elements are present within the proximal promoter a few hundred to a couple of thousand bases upstream of the TSS (12–14).
The reliable identification of the TSS and the extraction of the proximal promoter is not a trivial task (15) and seems to be the bottleneck for many large scale projects that attempt to analyze microarray data, because it has not yet been automated to a satisfactory level. For example, if a researcher identifies a few hundreds of genes with similar expression patterns, it would require tedious script parsing to obtain their promoter sequences. Since most sequences on a microarray are not full length, one at least has to search for the corresponding entry in RefSeq, align the RefSeq entry to the genome and extract the upstream sequence. Due to the wide applications of microarrays, there have been substantial redundant efforts in such promoter finding activities, and sometimes the biologist may simply take the upstream sequence corresponding to the microarray entry, which may in fact correspond to a coding region.
The bioinformatics community's response to this problem takes three forms: one is to collect databases of experimentally verified TSSs as in the Eukaryotic Promoter Database (16). The data set is of high quality but is also relatively small due to the experimental verification requirement. Another approach is to generate and sequence libraries of full-length mRNA molecules using a variety of biochemical methods such as cap-trapping and oligo-capping (17–19). Large-scale projects have yielded rich sets of such full-length mRNA sequences (19–21). The availability of the draft assemblies of an increasing number of mammalian genomes, currently human (22), mouse (23) and rat (Rat Genome Sequencing Consortium), has allowed for yet a third approach where large sets of truncated as well as full-length mRNA sequences are clustered by sequence similarity and aligned to the genome. The TSS is identified as the furthest 5′ aligned position of the cluster (19,24).
Until now, however, the available resources have not been directly helpful to researchers looking for a large set of promoters. Possible reasons include: (i) the systems only provide the full-length mRNA sequence without localizing it to the genome; (ii) the dataset is too small and many genes sought are not included; and (iii) the user interface simply does not allow for batch jobs of many queries. In this study, we aim to provide such a resource. The service is publicly accessible at http://biowulf.bu.edu/zlab/PromoSer/. A typical application of our resource is to couple with the analysis of microarray results.
PromoSer is a freely accessible web-based service to facilitate the extraction of a large number of proximal promoter sequences. The user supplies a list of mRNA accession numbers (without version numbers) and selects the required sequence range around the TSS. Upon request execution, the user will receive a FASTA formatted text file of the sequences. If the indicated range overlaps with the transcribed region of the immediate upstream gene, the user will be notified and given the choice of retrieving only intergenic sequences. A genome assembly can have gaps. Gaps with known lengths are treated as regular nucleotides (marked with ‘N’). Gaps with unknown lengths are considered ‘breaks’ in the genome assembly; we will notify the user if such a gap occurs in the sequence range requested by the user and return the sequence only up to the gap. Currently, the user can query for any genomic (non-organelle) mRNA from the human (Homo sapiens), mouse (Mus musculus) and rat (Rattus norvegicus) genomes. The system utilizes the most recent genome assemblies of each organism and the mRNA set will be updated frequently to keep pace with an expanding GenBank mRNA collection. By 1 June 2003, we should have entire sets of 1, 2 and 5kb upstream sequences for most commonly used Affymetrix chips available for download at the PromoSer website.
To allow for fast interactive response times, alignments have been pre-computed and stored into a database. To construct the database, we first downloaded the most recent assemblies of the human (22) genome (14 November 2002), mouse (23) genome (February 2002) and rat genome (Rat Genome Sequencing Consortium, November 2002) from the UCSC (25) genome browser (http://genome.ucsc.edu/). These sequences had already been masked with RepeatMasker and Tandem Repeat Finder. During the compute-intensive alignment phase, masked regions were excluded from consideration as likely TSSs.
We then downloaded all available mRNA sequences for the three genomes. These include all available TSS flanking sequences from the Eukaryotic Promoter Database (16), all EST and non-EST mRNA sequences from GenBank (the dbEST and nr databases) and from RefSeq (26). We also downloaded the publicly available set of full-length cDNA sequences from RIKEN (20). In addition we downloaded the available DBTSS (19) extensions to the RefSeq sequences. The human mRNA dataset contains a large subset of full-length cDNA sequences deposited by the Institute of Medical Science, University of Tokyo, Japan.
Using a powerful cluster of 128 dual-processor compute nodes and the efficient BLAT tool (27), each of these >9000000 mRNA or EST sequences were aligned to their corresponding genomes and localized to specific chromosomal regions. BLAT is a local alignment tool, which means it occasionally can produce spurious high scoring short alignments; therefore, the alignments were then scored and filtered according to the following criteria:
Alignments that satisfied the filter criteria were scored based on match, mismatch and indel counts. Currently we only keep the best genomic alignment for each query mRNA or EST sequence. Table Table11 shows the number of sequences considered and the number of alignments retained after the filtering process. The percentage of aligned human sequences from EPD is much higher than those for mouse and rat, possibly reflecting the quality of genome assembly. A sharp reduction in the number of ESTs can be observed due to the exclusion of non-spliced ESTs.
All the sequences that hit the same genomic region in the same orientation and overlapped fully or partially were grouped into one cluster extending from the 5′ most genomic position to the 3′ most position. Sequences that shared a minimum of 80 bases of transcribed region were linked together producing a graph. We resolve all disconnected components of the graph, which represent independent groups of transcripts within this cluster. This manipulation is necessary to untangle interleaved transcripts and recover genes that are embedded within the introns of larger genes. Table Table22 shows the current number of clusters thus obtained. Clusters consisting purely of ESTs are considered of the lowest quality and assigned a quality level 1. Clusters that contain a single non-EST sequence are assigned quality level 2. Those that have >1 non-EST sequence but no full-length sequences are given a quality level 3; for this purpose, sequences presumed to be full length but had >10 bases truncated from their 5′ side were downgraded to ‘ordinary’ mRNA. All other clusters were assigned quality level 4. The TSS prediction is the 5′ most genomic position of each alignment within the cluster and upstream of the TSS of a full-length sequence if available. If multiple TSS positions >20bp apart were found, they were reported as alternative promoters. Except in quality 4 clusters, individual ESTs are not considered for alternative promoters and only the 5′ most position from all the ESTs in the cluster is considered a potential TSS.
All that information was pre-computed and stored in a highly indexed MySQL database. A web-based user interface allows users to submit queries using almost all available GenBank accession IDs (for the supported organisms and referencing an mRNA or EST sequence) to extract promoters of the required genes. Users may request up to 2000 sequences per operation and may specify a large range for the promoter region (10000 bases upstream of the TSS and 1000 bases downstream). In case of multiple promoters, the user has the choice of extracting all of them or only the one that corresponds to the 3′ most TSS or the 5′ most TSS (representing the most conservative and most aggressive degrees of extension, respectively). Alternatively, the user may choose to extract only the longest extension that is supported by the largest number of sequences in the cluster. If the requested region overlaps with another cluster on the same chromosome that is upstream of the cluster in consideration, the user may wish to ignore this fact or stop extraction at the boundary of the upstream cluster, which can be restricted to the same strand or be on either one.
There are a number of options in promoter extraction. We believe that the choice and information should be passed to the users so that they would have the freedom to decide on the course of action in case of ambiguity. This is in contrast with adopting certain solutions that would inevitably seem inappropriate for the purposes of one user or the other. In the result page, we display a summary table indicating various statistics of each extracted promoter region, e.g. the starting and ending coordinates of the predicted TSS, the quality and size of the cluster to which the promoter belongs, the number of supporting sequences and the extent of genomic extension (in base pairs).
PromoSer is an easy-to-use service for extracting high quality promoter sequences in a batch mode. Without PromoSer, the common approach has usually been to work with a small set of promoters and to extract them manually. RefSeq and LocusLink have often been the resource of choice for these situations, but it requires a fair bit of manual work to trace and extract the genomic region of the promoters. A drawback of this method has been well known but perhaps not fully appreciated. Many sequences in RefSeq are not full length; they are often truncated before the 5′ end of the actual transcript. It has previously been reported (19) that ~34% of RefSeq mRNA sequences could be extended toward the 5′ direction by 87 bases on average, using alignments with sequences obtained from a full-length mRNA sequence library.
Compared to earlier efforts, PromoSer has compiled a larger and more varied data set. This revealed a surprising sequel to the previous reports. Using PromoSer, we found that at least 63% of human RefSeq entries (58% and 17% for mouse and rat, respectively) could be further extended towards the 5′ end. The TSS position was shifted, on average, by a surprising 16981 bases upstream on the chromosome. The average number of sequences overlapping an extendable RefSeq sequence was 123. For the mouse genome, the shift averaged 6801 bases and the clusters contained on average 47 sequences. In the case of the rat genome, an average shift of 8880 bases upstream from that identified by aligning a RefSeq sequence alone was found, with on average 14 sequences overlapping an extendable RefSeq entry. Moreover, 17% of human RefSeq, 12% of mouse RefSeq and 4% of rat RefSeq sequences could be extended by at least 2000 bases on the chromosome. Figure Figure11 provides a histogram of the amount of genomic extension obtained by PromoSer for RefSeq sequences in each organism. It is worth noting, however, that the rat genome is still in an early draft stage and the figures reported throughout the paper are very likely to change as resources for the rat genome mature to the levels of the human and mouse genomes. Even without relying on any EST data, ~40% of human RefSeq and 46% of mouse RefSeq sequences can be extended towards the 5′ end by utilizing recent full-length mRNA data only.
There are a number of interesting examples in which utilizing clusters of alignments produces an unexpected large change in the TSS location prediction. In one extreme case, MMP26 (matrix metalloproteinase 26, LocusID: 56547) is a well-characterized gene. Figure Figure22 contains snapshots from the UCSC genome browser that illustrates this gene. The RefSeq entry for MMP26 (NM_021801) is curated and confirmed to be complete only at the 3′ end. PromoSer localized NM_021801 to a cluster on chromosome 11 that also contained four mRNA sequences and several spliced ESTs (Fig. (Fig.2A).2A). All these sequences aligned well with NM_021801 in its coding regions (Fig. (Fig.2B).2B). Two ESTs, BG189720 and BG198356, extended NM_021801 in the 5′ direction. BG189720 was spliced into five regions, the four regions in the 3′ end aligned well with NM_021801, and the 5′ most region was about 80 bases long and aligned at a position more than 220000 bases upstream of NM_021801. The other EST BG198356 is 97% identical to BE189720. Interestingly, BLAT aligned BG198356 to the genome in two pieces, even though it is 100% identical to BG189720 at the break point of these two pieces. Thus, the 5′ fragment is classified as an unspliced EST in the UCSC browser (a snapshot is shown in Fig. Fig.2C).2C). We note that BG189720 is a high quality sequence obtained using a random gene activation method (28) that can induce activation patterns not normally observed in the cell lines used. Therefore, it is likely that there is an alternative TSS more than 220000 bases upstream of the TSS defined using RefSeq. For sequences in this cluster, PromoSer reports two alternative promoters, one according to the RefSeq sequence NM_021801 and the other one defined by the 5′ end of BG189720.
In another example, Figure Figure33 shows a case where a RefSeq sequence (NM_033543) was extended by more than 26700 bases upstream using mRNA data, where the mRNA sequence AK023602 overlaps the RefSeq sequence completely and extends it in both the 5′ and 3′ directions. Several ESTs that overlap and cover the entire cluster also support this large 5′ extension.
Large-scale computational analysis of transcription regulation is a powerful and promising technology that should provide us with a better understanding of the nature of the intricate network of regulatory proteins that provide living cells with their remarkable properties. Biological results clearly show that the regulation mechanism is a multilevel high complexity system that involves physical, informational, proximal, distal, upstream and downstream cis-regulatory elements on the DNA, as well as protein/protein, protein/DNA and protein/small molecule interacting factors. Hopeless as it may look, computational analysis has nevertheless focused on the most obvious and likely the most revealing component of this system, that is, the proximal promoter. Defined operationally as the region immediately upstream of the transcription start site and extending for a few hundred to a couple thousand bases, this region still remains an elusive target for the exact characterization of its structure.
Much has been learned by grouping sets of promoter regions and comparing their sequences to look for shared motifs of short (5–20bp) cis-elements. Both grouping and analysis can each be done in a large number of ways. The data from which groups and clusters can be identified can also come from many sources, most notably and commonly from microarray experiments. Yet, a common element in all of these methods is the need for an accurate set of proximal promoter sequences to analyze. Until recently this task seemed like the bottleneck of the computational pipeline. To the best of our knowledge, resources for accurate large-scale promoter extraction did not exist publicly.
A program previously developed, PEG, was based on computational text analysis of GenBank mRNA records, tracing the annotations and iteratively stitching and composing the promoter using a complex algorithm (29). However, the program requires non-trivial installation on a local UNIX computer, which is not feasible for experimental biologists. A simpler approach (the EZ-Retrieve web server) is based on parsing NCBI UniGene, RefSeq and LocusLink entries for annotated TSSs (30). However, the TSSs for many genes are not annotated, and the mRNA TSS record annotations can be error prone and unreliable. It is known that mRNAs are commonly truncated before the 5′ end of the molecule. This truncation exists even in the highly curated RefSeq (19,31) set of reference sequence data. A java-based tool Toucan allows online genomic sequence retrieval from the Ensembl database (including orthologous sequences) and cis-element analysis (32). It assumes the most 5′ position of the annotated transcript to be the TSS and therefore is likely to suffer from similar shortcomings as the above two methods.
PromoSer has set out to fill this gap by providing a publicly accessible and easy to use service that relies on a large data set. By utilizing nearly all major public data sets of full-length cDNA sequence information, PromoSer achieves both coverage and accuracy. We have made a conscious decision not to include TSS annotations in existing databases or de novo promoter predictions due to their undetermined accuracy. PromoSer will continue to be updated frequently to utilize the most recent data sets. As more mammalian genomes come off the sequencing pipeline, they will also be incorporated. Future directions for PromoSer include the consideration of cross-species conservation and the ability to localize and identify the TSS for a user supplied set of sequences.
Although the backbone for PromoSer shares much with the UCSC Genome Browser, we decided against using the UCSC alignment information as provided for a number of reasons. Firstly, our dataset consists of a more heterogeneous collection than exists in the Genome Browser and is expanding. If we simply add on to the UCSC set, we risk conflicts due to non-uniform filtering criteria of the two methods that would be difficult to resolve. Secondly, we plan to keep the service up-to-date by frequent updating and incorporation of new datasets as they emerge. Thirdly, a separate server gives us finer control over the quality of alignments that are considered for inferring the TSS. Obviously, PromoSer and the Genome Browser serve different purposes and need not handle things exactly the same way.
The utility of accurately identified promoter sequences is illustrated by the recent work of Ohler et al. on the core promoters in the Drosophila genome. Several new regulatory motifs have been identified and the computational TSS prediction has also been improved with the more accurate training data (33). Likewise, PromoSer should prove useful in studying mammalian gene regulation, and aid the development of computational tools for promoter prediction (34–37).
We wish to thank Martin Frith, Peter Haverty and Joel Graber for thought provoking discussions and advice and the sharing of helpful resources. We also would like to thank the reviewers for valuable feedback. This work has been supported in part by NSF grants DBI-0078194 and MRI DBI-0116574 and NIH grant 1P20GM066401-01.