The total collection of records that PromoSer handles is over 10 million. To efficiently map these onto their corresponding genomes, BLAT (7
) was used on a 256-processor Linux cluster. To improve sensitivity, the standalone version of BLAT was used to compare all sequences against each chromosome. The mapping results of every sequence were compared to determine the best alignment, or in some cases where several mappings were nearly equally good, the best alignment and those within 1% of its score were retained.
Furthermore, transcript sequences are canonically supposed to be 5′ to 3′ oriented. This rule is broken in about 50% of the EST records and the annotation of the direction is notoriously unreliable. For reliable orientation identification, which is essential for the clustering, the alignments were checked against the genome to determine the intron orientation based on the GT/AG rule. For unspliced sequences, the original sequence was examined for the presence of the polyA tail or, if not found, a polyA signal. The location of the tail or the signal was used to infer the transcript orientation. The tail length was subtracted to determine a more accurate percentage identity score of an alignment (which is only a part of the scoring scheme used for optimal alignment selection).
In the initial design of PromoSer, alignments were subjected to strict criteria in order to be used. This resulted in the exclusion of a significant fraction of sequences and the inability to locate their promoters. To improve performance and retain accuracy, the filtering process was performed in three stages. First, a low-stringency (80% identity over at least 100 bases) initial filter was applied to select alignments used in the clustering. Once clustered, a second, stringent filter (90% identity for EST and 95% otherwise, and no more than 5 unaligned bases at the start of the sequence) was used to select alignments that may be used to predict the TSS. Finally, a third pass heuristically assigns the alignments that did not pass the initial filter to the smallest cluster that fully overlaps that alignment. Those guessed alignments are marked and reported as being a guess when the user searches for them. With that scheme, PromoSer coverage increases to about 90% of the attempted sequences instead of the 28% reported previously, with improvement mainly in the EST category.
To cluster sequences, all alignments that overlap a genomic region and are transcribed in the same orientation are collected. They are then separated into sub-clusters based on the sharing of transcribed regions. Finally, each sub-cluster is examined to determine if it contains independent subcomponents that are connected through a single EST. If so, the sub-cluster is broken up into its subcomponents. Such ESTs have been observed when the clusters were visualized using the cluster viewer (described below) and they could be artifacts of the EST library.
After clustering, candidate TSSs are identified as the 5′-most position of transcripts passing the stringent filter (see above) plus the 5′-most position in the cluster overall. Sites within 20 bp are grouped and the 5′-most one is retained. Instead of an overall cluster quality score, individual TSSs are now assigned a quality and a support score as follows: A TSS that coincides with an EPD-identified TSS has a quality score of 4. A TSS that comes from a RefSeq sequence is given quality 3, one from an mRNA record a quality of 2 and one from an EST only a quality score of 1. If no evidence supports a site (e.g. the 5′-end of a sequence that is known to be truncated) it gets a score of 0. The support score is the count of sequences that contribute to the TSS prediction. For quality 2 and above, ESTs are excluded from this count.
The cluster is finally annotated based on its largest RefSeq or mRNA sequence. The locations of gaps longer than 500 bases and other clusters upstream are noted as possible boundaries to promoter sequence extraction.