Mapping bidirectional promoters in the mouse genome
In an analogous approach to our studies in the human genome, we systematically mapped bidirectional promoters in the mouse genome. These promoters were defined by their position between two oppositely-oriented transcription units, whose transcription start sites (TSSs) were no more than 1000 bp apart. All transcripts used in the analysis originated at one of three repositories :
• The UCSC List of Known Genes [5
• Spliced EST data from the GenBank dbEST database [6
As discussed in [3
] the procedure for mapping bidirectional promoters from the Known Gene annotations is quite straightforward due to the quality of these gene descriptions. Initially, all genes are represented as clusters containing overlapping transcripts. Each cluster extends from the farthest 5′ to the farthest 3′ coordinate of any included transcript. Neighboring clusters are then examined with respect to the distance and orientation of their 5′ ends. If the 5′ ends of two genes are no more than 1000 bp apart and the genes are transcribed in opposite directions, the region between them is considered to be a bidirectional promoter. Identifying bidirectional promoters from other annotation sources in the mouse genome can be more complex due to the diversity and fragmented nature of the current transcripts. For instance, both the spliced ESTs and the GenBank mRNA transcripts contain multiple overlapping segments of transcribed regions, which are frequently updated as new information becomes available. To handle the complexity of the data in the spliced ESTs, we applied an algorithm to extract the bidirectional promoters that passed a variety of conditional tests. These included conformity to the rules of distance and orientation.
Furthermore, transcripts were classified as intergenic or intragenic by comparison with the Known Genes as a reference track. Additional criteria requiring majority agreement with the orientation of co-localized ESTs and with the orientation of Known Genes are described in Yang and Elnitski (2007) [3
The mapping algorithm identified 5,647 candidate bidirectional promoter regions in the mouse genome. This number is similar to the number of candidate bidirectional promoters identified in the human genome using a similar strategy [3
]. In both genomes, the number of bidirectional promoters was larger than previously reported [1
], as a result of updated gene annotations and the use of spliced EST data. The validity of these candidate regions was assessed by comparison to the RIKEN CAGE dataset [7
]. The CAGE technique captures the true 5′ ends of transcripts, allowing a direct comparison to our bidirectional promoters by their coordinates in the mouse genome. Figure shows bidirectional promoters that are fully validated when a CAGE transcript flanks both sides of the promoter region. In the human genome, bidirectional promoters from the Known Gene, mRNA, and EST data are validated at 96%, 78%, and 81%, respectively (Figure , upper panel), while in the mouse genome, bidirectional promoters from the Known Gene, mRNA, and EST data are validated at 95%, 40%, and 65%, respectively (Figure , lower panel). The low validation score for mouse mRNA appears to reflect an incomplete description of the mouse genes in the mouse genome assembly mm5 (May 2004).
Figure 1 Validation of bidirectional promoters using the RIKEN CAGE dataset. Pie charts depict the number of bidirectional promoters with CAGE transcripts that correspond to detectable transcripts on both sides (black), only one side (gray), or no evidence (white). (more ...)
Comparison of human and mouse bidirectional promoter sets
Bidirectional promoters are ancient features, exhibiting orthology from human to Fugu rubripes
]. To compare the co-occurrence of bidirectional promoters in the human and mouse genomes, we mapped human genes regulated by bidirectional promoters to the mouse genome and assessed whether the corresponding mouse gene also formed a bidirectional promoter with its 5′ neighbor. Of 1637 Known Genes, as shown in Figure , 41% were associated with bidirectional promoters in the mouse genome by the same gene name. An additional 4% were added from Genbank mRNA and 7% from the spliced ESTs. Roughly 7% of the set had a gene in the mouse genome but shows no evidence of a bidirectional promoter. The remaining 40% could not be mapped to the mouse using this method. Table shows the orthologous pairs of mouse genes corresponding to ten human genes involved in cancer that have bidirectional promoters. From this data we predict that 4 mouse genes will be positioned closer together than they currently appear. BRCA2, ERBB2, FANCA and FANCF are much farther apart in mouse than in human. Table shows the GO terms for genes that are regulated by bidirectional promoters in human, but not in mouse, implying that regulatory changes could change the expression of these genes between species. It should be noted that strategies such as ours to map orthologs by gene name provide high confidence assignments, but underestimate the number of orthologous bidirectional promoters in the human and mouse genomes. We have further proven this point by mapping orthologous gene pairs regulated by bidirectional promoters in twelve species using rigorous genomic alignment information [9
Figure 2 Orthologous mapping of human bidirectional promoters to mouse. Promoter orthology was de-termined by identifying ortholgous genes in mouse and checking for evidence of bidirectional promoters. Genes that had a 5′ neighbor transcribed in the opposite (more ...)
Tumor suppressor genes in human and mouse
Molecular function (P < 0.05) of human genes having a unique bidirectional promoter not detected in mouse
Although bidirectional promoters are orthologous between humans and mice, they exhibit sparse conservation signals in multi-species alignments. This is a slightly surprising result, given that sequence conservation is a reliable marker for functional elements. Nevertheless, it is possible that alternative methods may reveal similarities in bidirectional promoters across species.
To test for similarity in sequence characteristics that may reveal subtle similarities between the sets of human and mouse bidirectional promoters, we calculated a log-likelihood score called Regulatory Potential (RP). The RP score was used in ESPERR (Evolutionary and Sequence Pattern Extraction through Reduced Representations) [10
] to capture information in sequence alignments over seven vertebrate species. This method has been shown to discriminate regulatory regions from nonfunctional regions with an accuracy of 80% [10
The RP score cumulative distribution functions plotted in Figure reveal that regulatory potential scores are similar for bidirectional promoters defined by Known Genes, ESTs, and mRNA in both human and mouse. The similarity in profiles exhibited by all three datasets for each species indicates that sequence characteristics are similar in bidirectional promoter regions, both across species (human vs. mouse) and across datasets (Known Genes, mRNA, and ESTs). The strategy used to map these gene pairs across species strongly identifies orthologous genes that are characterized by name. Therefore the conclusions should not change as more data is added.
Figure 3 RP score cumulative distribution functions for bidirectional promoters in human and mouse. Bidirectional promoters identified from Known Genes (KG), mRNA, and ESTs all yield similar scores in both human and mouse genomes. RP scores were calculated based (more ...)
Discriminating functional elements based on RP scores
Having established the orthology of bidirectional promoters between human and mouse, we now shift our attention to the problem of discriminating functional elements in the human genome. We again make use of RP scores, which have proven useful for discriminating functional elements from nonfunctional elements, yet their ability to discriminate among types of functional elements remains unknown.
To test the hypothesis that sequence characteristics differ between classes of functional elements, thereby allowing these classes to be discriminated, we compared RP scores for human bidirectional promoters to those for other functional regions, including enhancers, unidirectional promoters, unbounded promoters, non-promoters (i.e. tail-to-tail regions), coding regions, and neutral regions.
The cumulative distribution functions of RP score for the different functional classes are shown in Figure . We observe that:
Figure 4 Cumulative distribution functions of RP scores for different functional classes. These include bidirectional promoters (red, green, blue), non-bidirectional promoters (purple) and unbounded promoters (light blue, pink, light green). Other functional elements (more ...)
• As expected, neutral regions (represented by ancestral repeats) separated very distinctly from functional regions such as enhancers.
• Despite the fact that bidirectional promoters do not have a strong signal for sequence conservation, they have slightly higher RP scores than enhancers. This is significant because the enhancers used in this analysis are enhancers of genes involved in essential developmental processes, such as neurogenesis [11
], which are characterized by strong signals of sequence conservation known as Multi-species Conserved Sequences (MCSs) [12
• Bidirectional promoters have high RP scores, similar to unidirectional promoters, which are promoter regions that are defined by two genes in a head-to-tail configuration. Like bidirectional promoters, unidirectional promoters are bounded on both sides by exons.
• High scores are not a feature of all promoter regions. For example, unbounded promoters, which are promoters having no neighboring upstream gene, tend not to have high RP scores. We examined unbounded promoter regions with no upstream gene within 1000, 5,000, and 10,000 bp and found moderately low RP scores for all three classes. Furthermore, the range of these scores was indistinguishable from non-promoter regions.
• Coding regions score nearly as well as bidirectional promoters. This suggests that the types of nucleotide substitutions and the “word” content of bidirectional promoters and coding regions may be governed by the same rules, despite that fact that coding regions are strongly conserved and bidirectional promoters are not.
Prediction of bidirectional promoters from RP scores
On the basis of Figure , it is apparent that bidirectional promoter regions tend to have higher RP scores than either non-promoter or unbounded promoter regions. Another way to see this is to plot the class-conditional density functions p(x|C), where x is the RP score, and C is a functional class; this is simply the probability density function of RP scores, restricted to the functional class C. Given the class-conditional density functions p(x|C1) and p(x|C2) for classes C1 and C2, respectively, we can construct a likelihood ratio classifier that maps an RP score x to a functional class using the rule:
The performance of this classifier for different values of the threshold μ is summarized by a Receiver Operating Characteristic (ROC), which is a plot of sensitivity against (1—specificity). We constructed two such classifiers: one to discriminate bidirectional promoters from non-promoters, and the other to discriminate bidirectional promoters from unbounded promoters.
Distinguishing bidirectional promoters from non-promoters
We constructed a likelihood-based classifier to distinguish bidirectional promoters from non-promoters; this is a two-class classification problem, in which the two classes are:
The class-conditional probability distributions p(x|BP) and p(x|NP) are shown in Figure (a) (here “BP” denotes the class of bidirectional promoters, and “NP” denotes the class of non-promoters). The corresponding ROC curve is shown in Figure (a). A Maximum Likelihood classification rule (obtained by setting μ = 1 in the likelihood ratio classifier (1)) yielded a test set accuracy of 74%, a specificity of 92% (relatively high), and a sensitivity of 65% (relatively low), as shown in Table . The ROC curve reveals that the sensitivity can be boosted above 80% by trading off for a specificity below 80%.
(a) Class-conditional probability density functions p(x|BP) (bidirectional promoters) and p(x|NP) (non-promoters). (b) Class-conditional probability density functions p(x|BP) (bidirectional promoters) and p(x|UBP1000) (unbounded promoters).
(a) Receiver operating characteristic (ROC) for classifier that discriminates bidirectional promoters from non-promoters. (b) Receiver operating characteristic (ROC) for classifier that discriminates bidirectional promoters from unbounded promoters.
Performance of classifiers on test data
Distinguishing bidirectional from unbounded promoters
We constructed a likelihood-based classifier to distinguish bidirectional promoters from unbounded promoters (specifically, the class of promoters with no upstream gene within 1000 base pairs); this is a two-class classification problem, in which the two classes are:
The class-conditional probability distributions p(x|BP) and p(x|UBP1000) are shown in Figure (b) (here “BP” denotes the class of bidirectional promoters, and “UBP1000” denotes the class of promoters with no upstream gene within 1000 base pairs). The corresponding ROC curve is shown in Figure (b). A Maximum Likelihood classification rule (obtained by setting μ = 1 in the likelihood ratio classifier (1)) yielded a test set accuracy of 80%, a specificity of 81% (relatively high), and a sensitivity of 67% (relatively low), as shown in Table . The ROC curve reveals that the sensitivity can be boosted above 80% by trading off for a specificity below 75%.
Multiple Class Prediction
We then tackled a more challenging problem—to construct a classifier that distinguishes the following four classes:
It turns out that bidirectional promoters and unbounded promoters are enriched in CpG islands, while enhancers and non-promoters are depleted in CpG islands. Furthermore, bidirectional promoters and enhancers tend to have relatively high RP scores as compared to unbounded promoters and non-promoters. It follows that by making use of both features (presence of CpG islands and RP score), we may be able to separate the four classes. We therefore implemented a two-stage hierarchical classifier (Figure ). The first stage only looks at the CpG island feature: if CpG islands are present, the instance is passed to the left child at level 2 (node N2), while if CpG islands are not present, the instance is passed to the right child at level 2 (node N3). There is also a classification outcome Z1 of the first stage; if the instance was passed to the left child, then Z1 = 1, else Z1 = 0. Ideally, instances that end up in node N2 should be either bidirectional or unbounded promoters, while instances that end up in node N3 should be either enhancers or non-promoters. The next stage of the classifier then refines the classification further. Node N2 uses a support vector machine to separate bidirectional from unbounded promoters based on two features—the presence of CpG islands and RP score, while node N3 uses a decision tree to separate enhancers from non-promoters based on one feature—RP score (it turns out that these two classes cannot be distinguished based on the presence of CpG islands, so this feature would not be helpful). A decision tree was used at node N3 because it gave better results that a support vector machine. There is a classification outcome Z2 associated to each node at level 2. For node N2, Z2 = 1 implies that the instance is classified as a bidirectional promoter, while Z2 = 0 implies that the instance is classified as an unbounded promoter. For node N3, Z2 = 1 implies that the instance is classified as an enhancer, while Z2 = 0 implies that the instance is classified as a non-promoter. The overall classification is then given by the pair (Z1, Z2) as follows:
Algorithm for classifying regions into one of four classes: bidirectional promoter, unbounded promoter, non-promoter, or enhancer.