PromoSer is a freely accessible web-based service to facilitate the extraction of a large number of proximal promoter sequences. The user supplies a list of mRNA accession numbers (without version numbers) and selects the required sequence range around the TSS. Upon request execution, the user will receive a FASTA formatted text file of the sequences. If the indicated range overlaps with the transcribed region of the immediate upstream gene, the user will be notified and given the choice of retrieving only intergenic sequences. A genome assembly can have gaps. Gaps with known lengths are treated as regular nucleotides (marked with ‘N’). Gaps with unknown lengths are considered ‘breaks’ in the genome assembly; we will notify the user if such a gap occurs in the sequence range requested by the user and return the sequence only up to the gap. Currently, the user can query for any genomic (non-organelle) mRNA from the human (Homo sapiens
), mouse (Mus musculus
) and rat (Rattus norvegicus
) genomes. The system utilizes the most recent genome assemblies of each organism and the mRNA set will be updated frequently to keep pace with an expanding GenBank mRNA collection. By 1 June 2003, we should have entire sets of 1, 2 and 5
kb upstream sequences for most commonly used Affymetrix chips available for download at the PromoSer website.
To allow for fast interactive response times, alignments have been pre-computed and stored into a database. To construct the database, we first downloaded the most recent assemblies of the human (22
) genome (14 November 2002), mouse (23
) genome (February 2002) and rat genome (Rat Genome Sequencing Consortium, November 2002) from the UCSC (25
) genome browser (http://genome.ucsc.edu/
). These sequences had already been masked with RepeatMasker and Tandem Repeat Finder. During the compute-intensive alignment phase, masked regions were excluded from consideration as likely TSSs.
We then downloaded all available mRNA sequences for the three genomes. These include all available TSS flanking sequences from the Eukaryotic Promoter Database (16
), all EST and non-EST mRNA sequences from GenBank (the dbEST and nr databases) and from RefSeq (26
). We also downloaded the publicly available set of full-length cDNA sequences from RIKEN (20
). In addition we downloaded the available DBTSS (19
) extensions to the RefSeq sequences. The human mRNA dataset contains a large subset of full-length cDNA sequences deposited by the Institute of Medical Science, University of Tokyo, Japan.
Using a powerful cluster of 128 dual-processor compute nodes and the efficient BLAT tool (27
), each of these >9
000 mRNA or EST sequences were aligned to their corresponding genomes and localized to specific chromosomal regions. BLAT is a local alignment tool, which means it occasionally can produce spurious high scoring short alignments; therefore, the alignments were then scored and filtered according to the following criteria:
- EPD sequences (which are genomic) had to match at ≥95% identity over the length of the query sequence.
- All other sequences >250 bases had to have their full length aligned to the genome, minus ≤50bp to allow for poly-A tail truncation. Sequences <250bp had to align for ≥80% of their length.
- In addition to length requirements, the alignments had to achieve a minimum match identity to the genomic region they aligned to, according to the sequence type; EST: >90%, ‘regular’ mRNA: >95%, full-length mRNA: >97%.
- Only spliced EST sequences were retained to reduce the danger of a genomic contamination to the EST library from which the sequence was obtained.
Alignments that satisfied the filter criteria were scored based on match, mismatch and indel counts. Currently we only keep the best genomic alignment for each query mRNA or EST sequence. Table shows the number of sequences considered and the number of alignments retained after the filtering process. The percentage of aligned human sequences from EPD is much higher than those for mouse and rat, possibly reflecting the quality of genome assembly. A sharp reduction in the number of ESTs can be observed due to the exclusion of non-spliced ESTs.
Distribution of successfully aligned regionsa
All the sequences that hit the same genomic region in the same orientation and overlapped fully or partially were grouped into one cluster extending from the 5′ most genomic position to the 3′ most position. Sequences that shared a minimum of 80 bases of transcribed
region were linked together producing a graph. We resolve all disconnected components of the graph, which represent independent groups of transcripts within this cluster. This manipulation is necessary to untangle interleaved transcripts and recover genes that are embedded within the introns of larger genes. Table shows the current number of clusters thus obtained. Clusters consisting purely of ESTs are considered of the lowest quality and assigned a quality level 1. Clusters that contain a single non-EST sequence are assigned quality level 2. Those that have >1 non-EST sequence but no full-length sequences are given a quality level 3; for this purpose, sequences presumed to be full length but had >10 bases truncated from their 5′ side were downgraded to ‘ordinary’ mRNA. All other clusters were assigned quality level 4. The TSS prediction is the 5′ most genomic position of each alignment within the cluster and upstream of the TSS of a full-length sequence if available. If multiple TSS positions >20
bp apart were found, they were reported as alternative promoters. Except in quality 4 clusters, individual ESTs are not considered for alternative promoters and only the 5′ most position from all the ESTs in the cluster is considered a potential TSS.
Number of clusters after combining overlapping alignments in the same orientation
All that information was pre-computed and stored in a highly indexed MySQL database. A web-based user interface allows users to submit queries using almost all available GenBank accession IDs (for the supported organisms and referencing an mRNA or EST sequence) to extract promoters of the required genes. Users may request up to 2000 sequences per operation and may specify a large range for the promoter region (10
000 bases upstream of the TSS and 1000 bases downstream). In case of multiple promoters, the user has the choice of extracting all of them or only the one that corresponds to the 3′ most TSS or the 5′ most TSS (representing the most conservative and most aggressive degrees of extension, respectively). Alternatively, the user may choose to extract only the longest extension that is supported by the largest number of sequences in the cluster. If the requested region overlaps with another cluster on the same chromosome that is upstream of the cluster in consideration, the user may wish to ignore this fact or stop extraction at the boundary of the upstream cluster, which can be restricted to the same strand or be on either one.
There are a number of options in promoter extraction. We believe that the choice and information should be passed to the users so that they would have the freedom to decide on the course of action in case of ambiguity. This is in contrast with adopting certain solutions that would inevitably seem inappropriate for the purposes of one user or the other. In the result page, we display a summary table indicating various statistics of each extracted promoter region, e.g. the starting and ending coordinates of the predicted TSS, the quality and size of the cluster to which the promoter belongs, the number of supporting sequences and the extent of genomic extension (in base pairs).