Application of IAMMS to Rod and Cone-specific Promoters
Input to IAMMS consisted of the upstream region of 11 rod-specific, 12 cone-specific, and 84 non-photoreceptor genes (see table for a list of rod/cone-specific genes, and additional file 1
online for background genes). The flowchart of the IAMMS algorithm is shown in figure (see methods for details). The first step involved an iterative alignment procedure conducted on all rod, cone, and non-photoreceptor promoters. This step resulted in a dataset of 71,195 conserved motifs between 8 and 150 bp in length. Each entry of the dataset contains nucleotide sequences, the location of motif occurrences with respect to the transcription start site, strand, and promoter from which each occurrence originated. To illustrate the composition of the dataset, we plotted motif length against the number of occurrences of each motif in photoreceptor promoters (figure ; background occurrences are not shown). The color map represents the number of motifs with each length/frequency combination. As may be expected, motif size has an inverse relationship with the number of occurrences.
Rod-specific and cone-specific genes
Figure 1 A block diagram of the iterative alignment/modular motif selection (IAMMS) algorithm used to identify putative functional sites in photoreceptor promoter regions. Boxes represent the input/output of each successive step. Arrows show flow. Circles show (more ...)
Figure 2 3D histogram representing features of potential motifs after the iterative alignment. The vertical and horizontal axis plot the number of non-overlapping occurrences of a motif, and the motif length in nucleotides (nt), respectively. Color shows the number (more ...)
Analysis showed that the majority of motifs identified after the first step were repeat sequences. The motifs occurring most frequently (> 25 occurrences) were primarily simple repeats. All longer motifs (> 19 bp) were highly similar to microsatellites and interspersed repeats, as revealed by comparison to a database of known repeats (RepBase). Repeat sequences were filtered out at step 2.
After repeat filtering, the remaining motifs, those inside and immediately above the marked box in figure , were evaluated for potential enrichment in rod or cone photoreceptors (step 3). Since we are interested in motifs that occur in the promoters of only one photoreceptor cell type, motifs that have occurrences in both rod and cone promoters were classified as ambiguous and were excluded from consideration during this step. To evaluate enrichment of a motif compared to background, we assume a binomial distribution of kr rod specific (or kc cone-specific) promoters drawn from the total number of promoters that contain occurrences. A Bonferroni correction for multiple hypothesis testing (E-value) is applied to the resulting p-value, as described in the Statistical annotation section of methods. The top scoring motifs identified during this step were subjected to phylogenetic analysis (step 6) and compared to known motifs using the Transcriptional Element Search System (TESS; step 7).
Figure shows representative examples of top-scoring cone- and rod-enriched motifs identified during step 3, after being subjected to phylogenetic analysis, and compared to TESS. The cone-enriched motif is 13 bp in length, contains 5 occurrences in cone-specific promoters and none in rods (non-photoreceptor occurrences are not shown). The c
cores (CSCS) for each occurrence is shown in the last column. Four occurrences have a negative CSCS. A negative CSCS means that the predicted occurrence is more conserved than surrounding sequences of the same length (see Methods for details). Comparison with known photoreceptor-specific motifs indicated that this sequence is similar to the preferred binding site for the Retnoid X Receptor (RXR). Involvement of RXR in cone-specific expression is well established [2
], but binding sites for this transcription factor in cone photoreceptor promoters have not yet been identified, making this prediction valuable for planning experimental studies.
Figure 3 Example of cone (A) and rod (B) enriched DNA motifs after statistical annotation. Columns from left to right give gene MGI Symbol, cell-specific expression patterns (C, cone; R, rod; background matches are removed for this figure), start position of motif (more ...)
The rod-enriched motif (figure ) is 12 bp in length and contains 6 occurrences in rod promoters. Cross-species conservation shows that Pde6a, Gnb1, and Nr2e3 occurrences are phylogenetically conserved (a cross-species alignment is not available for the region containing the Pde6g occurrence, and thus no score is reported). According to TESS, this motif is similar to a c-Myb binding site. The prediction that c-Myb may have a function unique to one type of photoreceptor is consistent with publicly available microarray data (see Methods). We found that c-Myb is between 2.6 and 7.6 fold enriched in cones compared to rod photoreceptors.
After step 3, IAMMS identified a total of 6 motifs (3 rod- and 3 cone-enriched) with E < 2.5. Since no position filtering was applied to identify these motifs, we refer to them as position independent. All position independent rod- and cone-enriched motifs, sorted based on E-value, are shown on the top of figure . The highest scoring rod prediction at the top of figure contains two 5 bp invariant core regions separated by two ambiguous positions (CCTTTNNGCCCT; rod-enriched position independent, row 1). The position variance of this prediction is remarkably small (± 45) considering that no position-based selection was applied to identify this sequence. The top scoring cone motif contains a core region 5 bp in width (aGGGTTca). It occurs in 8/12 cone promoter sequences with no discernable bias in position. Detailed information on the position and phylogenetic conservation of each occurrence is available as additional data (files 1
Figure 4 Highest scoring rod (left) and cone (right) enriched motifs returned after statistical annotation in IAMMS step 3 (position independent) and IAMMS step 5 (position dependant). From the left, columns give the motif logo, the fraction of rod/cone specific (more ...)
Those motifs classified as ambiguous during step 3 were subjected to position-based clustering (step 4). As described previously, we acted under the hypothesis that occurrences of a motif near the transcription start site, and those occurring in clusters around a preferred position, are more likely to be functional. One example of clusters selected by the hierarchical clustering algorithm is shown in figure . This particular motif contains 55 occurrences, plotted as triangles based on their 1-dimensional position relative to the transcription start site. These occurrences are broken into clusters by the algorithm, denoted by blue ovals. A cone-enriched cluster just upstream of the transcription start site is shown in pink. This cluster contains 5/12 occurrences from cone-specific promoters, and only 4/84 occurrences in non-photoreceptor promoter regions.
Figure 5 (A) Occurrences of a sample ambiguous motif (triangles) analyzed using position cluster discovery. The horizontal axis represents position relative to the putative transcription start site. The vertical position of occurrences was offset to ease viewing. (more ...)
After motifs were broken into position-dependant clusters, we used the same statistical procedure described above to select those clusters enriched in rod or cone promoters (IAMMS, step 5). Figure plots the ratio between cell-specific and total occurrences (vertical axis) against the total number of promoters with at least one occurrence (horizontal axis). Points are colored based on the number of motifs with a given combination, in a similar manner to figure . The cone-enriched cluster cAGAAG shown in figure is one of the motifs represented by the point marked in figure . This point lies just inside the gray region representing a statistical threshold of p = 0.005 that was used to classify motifs as enriched in rod (or cone) specific promoter sequences. A motif corresponding to the known cone-specific cis-element ROP2 is also represented by a point in the gray region of figure . Figure shows the same representation as figure for rod-specific motifs. A previously characterized rod-specific motif, NRE, is represented by a point that lies just inside the gray region (marked in figure ), indicating the biological relevance of motifs represented in this region.
A detailed view of the NRE-like motif identified after step 5 is shown in the left panel of figure . The predicted motif contains a core region (aTGCTGa). The occurrence in the Rho
promoter at -88 bp (occurrences are enumerated below the logo in figure ) has already been validated experimentally [23
]. Two sample cross-species phylogenetic alignments are shown below the functional alignment in figure (Pde6b
). In this case, these occurrences are very highly conserved relative to the surrounding sequence.
Figure 6 Predicted rod (left) and cone (right) enriched motifs. Notations are the same as figure 3. Cross-species alignments for Pde6b, Rho (left), Cnga3, and Opn1mw (right) occurrences are shown on the bottom. All occurrences are highly conserved across species (more ...)
Another known transcription factor binding site detected in this study corresponds to the recently discovered cone-specific sequence ROP2, shown in the right panel of figure . This prediction contains an occurrence in the Opn1mw
promoter that was recently discovered to be required for cone-specific expression [24
]. Previously unknown occurrences of ROP2 were predicted in the promoter of Opn1sw
, and Cnga3
. The newly-discovered occurrence in the Opn1sw
promoter shows remarkable position-conservation relative to the transcription start site when compared with the known Opn1mw
occurrence: -94 and -97 bp, respectively, a difference of only 3 bp. Selected phylogenetic alignments (figure , right panel, bottom) show that the occurrences in the Cnga3
promoters are highly conserved through evolution. In addition to increasing confidence in predictions, the ROP2 detection also provides exciting new targets for a cis
-element that is pertinent for cone-specificity.
The 12 highest scoring (E-value) rod- and cone-enriched position dependent predictions are shown on the bottom of figure . The example given in figure (cAGAAG) can be found among cone-enriched motifs in row 7. Among the high scoring motifs, 6 rod and 3 cone predictions are similar to known motifs whose specific binding positions (with the exception of NRE) are not known, including four putative initiator (INR-like) elements, NRE, an IL-6 effector, an RXR binding site, ROP2, a putative TATA-like motif, and an Engrailed homeodomain binding site. Phylgoenetic conservation is relatively high for several of the elements, including two conservation scores less than -1 for cone-enriched predictions (TATA-like: -1.08 and En2: -1.35). As we show in the next section, many of these motifs are corroborated by motifs predicted by DME and/or BioProspector.
Comparison with DME and BioProspector
To increase confidence in our predictions, we compared motifs discovered using IAMMS to those discovered using existing de novo motif discovery algorithms, DME and BioProspector. For both of these algorithms, a smaller section of the upstream region was employed (500 bp of upstream sequence and 100 bp of UTR) for a more similar comparison to IAMMS position clustering implementation. In order to return useful results, promoter regions needed to be repeat masked prior to analysis. Since the rod- and cone-specific sets are too small to be compared directly against each other, cone promoters were compared against the combined set of background and rod promoters to evaluate cone-enrichment. The same approach was used to identify rod-enriched predictions.
The top 10 motifs for each motif length between 6 and 10 bp (DME) or 6 and 12 bp (BioProspector) were compared with the top IAMMS predictions. This comparison is shown in Figure . Predictions made by IAMMS and confirmed by DME or BioProspector are highlighted in yellow (DME), blue (BioProspector), or orange (both DME and BioProspector). It is interesting to note that rod predictions for DME and BioProspector were in agreement with IAMMS a much higher proportion of the time (nearly 80%) compared to cone predictions (just under 50%). This difference between the numbers results from a much lower rate of agreement between IAMMS and BioProspector in cone sequences. Compared to BioProspector, the rate of agreement between IAMMS and DME in rods and cones is similar (47% in cones, 57% in rods). We conclude that although they use different underlying algorithms, results obtained using DME are more similar to IAMMS compared with BioProspector.
Figure 7 Comparison of rod (top) and cone (bottom) specific predictions made by IAMMS to those made by either DME (yellow), BioProspector (blue), or both DME and BioProspector (orange). For each prediction, the consensus sequence is given using IUPAC ambiguity (more ...)
Overall, of 40 rod- and cone-specific predictions, 25 (over 60%) are confirmed by either DME or BioProspector and 11 (nearly 30%) were confirmed by both. Major predictions, including the ROP2 binding site, Initiator, TATA-like, and IL-6 (discussed in detail below) were corroborated by at least one motif discovery algorithm. The initiator-like and TATA-like predictions were identified by all 3 algorithms, increasing our confidence in these predictions.