Rice genome projects have generated in depth genomic datasets and a comprehensive set of genomic upstream sequences. However, promoter sequences from other grass genomes have become available only sporadically. Comparative or computational biology approaches were therefore restricted to studies of individual pairs of interest and limited by the availability of only a few hundreds of grass promoter sequences. Our knowledge of cis
-regulatory elements in monocotyledonous plants by the low number of known transcription factor binding sites is limited to those that have been reported and deposited in plant motif databases. The few dozens of known motifs are in sharp contrast to findings that higher plant genomes typically encode on average more than 1,500 transcription factors [36
With the completion of the sorghum genome, a genome-wide assessment of regulatory sites in rice and sorghum upstream sequences has now become feasible. In this survey, we employed approaches based on two different tools, PhyloCon and FASTCOMPARE. Both tools and approaches have been successfully applied to motif discovery in many non-plant organisms including yeast and mammals. In addition, PhyloCon has previously been applied with success to cis
-element analysis in genome survey sequences of Brassica oleraceae
vs. Arabidopsis thaliana
FASTCOMPARE is based on the 'network-level conservation' principle. This presupposes that regulatory circuitries will be largely conserved between two evolutionary related species and functional network motifs can be detected by their higher global or genome-wide conservation rate compared to non-functional sequences. Evolutionary conservation of functional elements is also assumed for phylogenetic footprinting that discovers motifs from a group of orthologous gene pairs. For the analysis based on PhyloCon, the orthologous groups that are compared and combined result from a prior selection of orthologous mate-pairs by co-expression analysis.
analysis of cis
-regulatory elements is notoriously error-prone due to small motif sizes and motif degeneracy. Our study was designed to select functional candidate sites and motifs that are associated with transcriptional activity. Co-expression was derived from correlations exceeding the top 1% of background similarities. Additionally, our clique approach required all group members to have a significant expression correlation with all other group members. In our survey, we restricted motif searches to 5'-upstream sequences of size 2 kb (for k-mer searches) or 3 kb (for PhyloCon) (i) to model current knowledge of plant promoter sizes and, (ii) to focus on plant core promoters that presumably contain most functional elements. Though functional enhancers and cis
-elements in e.g. mammalian promoters, have been reported up to several tens of thousands bases distant to transcription initiation sites (TIS), plant promoters seem to be more compact [38
]. In addition, chance co-occurrences will strongly increase, in particular, for smaller k-mers and degenerated motifs. Upstream sequences of larger size would thus have adverse effects by accumulating false positives or losing statistical power.
Results reported in this study can be divided into two categories: conserved sites and motifs. PhyloCon position specific scoring matrices (PSSMs) are supported by their conservation between orthologous promoters and their simultaneous co-occurrence in genes with expression similarities. Sizes of cis
-elements in plants are comparable to non-plant species and typically range between 6 and 12 base pairs [5
]. The mean size of PhyloCon PSSMs detected in this study was considerably longer (37 bp). Hence PSSMs likely represent concrete conserved sites rather than generalized statistical models for transcription factors. Large sizes for phylogenetic footprints in grasses are consistent with a previous study of 288 maize and rice pairwise and 56 rice, maize and sorghum three-way comparisons, in which a minimum motif size ≥ 20 bp was found to be significant [19
]. Such long sites for PhyloCon PSSMs can be composed of two or more motifs and close proximity of these sites is required for functionality in the respective co-expressed group. Alternatively, some of the detected sites could represent signals associated with transcriptional gene activity such as mRNA stability signals or miRNA target sites, for which longer sizes have been reported [39
]. Complementary to these long conserved regions, many of the detected network-level conserved motifs represent candidates for transcription factor binding sites. After subjecting the individual detected sites to clustering, in total 3,809 non-redundant motifs were found. The rice genome contains more than 1,600 genes encoding transcription factors and a similar number of cis
-regulatory motifs could be expected [37
]. However, some of our motifs may still be too specific and one transcription factor may bind to several related motifs. Consistent with this assumption, only for few k-mer positions did we observe sequence variability indicating that scoring functions favor specific k-mers or overrepresented k-mers with an overall low occurrence rate in a genome. Furthermore, many of these motifs were obtained from dyadic motif searches that converged to motifs with highly specified spacer sequences. For these long motifs, similar considerations may apply as for PhyloCon sites discussed above. Taking this into account, the number of motifs reported in this study is close to the number of transcription factors present in rice. On the other hand, our method may have missed transcription factor binding sites that tolerate high degeneracy. Similar findings for highly degenerated motifs have been reported for a FASTCOMPARE analysis in yeast [13
]. Nevertheless, our list of motifs up to now provides the most comprehensive analysis of cis
-elements in a grass genome.
In previous studies, the functionality of motifs has been confirmed by a variety of approaches. Many surveys have reported an association of motifs with particular biological processes. For large-scale analysis, gene ontologies or metabolic pathways were correlated with particular motifs. In this study, however, we were only able to detect a few such associations, and all enrichments were in very broad biological categories, e.g. 'transcription' (see Methods, results not shown). Missing associations likely result from limitations of the current rice GO annotation. In our search, we found for only 755 RAP2 rice genes (2.7%) at least one GO term belonging to the category 'biological process'. Similarly, only 1,376 rice genes (4.9%) could be mapped on KEGG pathways. In total, a functional annotation has been found for less than 5% of all rice genes. The sparse data basis and low resolution of the current rice GO annotation that mostly assigns top level terms, are the most probable causes for the limited success in detecting significant enrichments.
Several findings support the functionality of our motifs. PhyloCon sites are associated with conservation and co-expression. Despite the limited availability of experimentally verified cis
-regulatory elements in grasses, we find numerous matches to known plant motifs or sites in public databases and literature reports. This includes many variants of the ACGT motif, like the G-box or the ABA response element as well as ethylene response elements among others. Interestingly, some top-scoring motifs do not match previously published elements and indicate novel cis
-regulatory motifs. The number of motifs two rice genes has in common positively correlates with their expression similarity. This is consistent with the combinatorial nature of transcription regulation [40
] and strongly indicates that a large fraction of detected motifs are associated with control of transcription. Control may be exerted as transcription factor binding sites or, as discussed previously, as miRNA target sites or signals for mRNA stability.
In summary, motifs reported in this study will provide researchers with a prioritized list of candidates for the gene of interest and can guide experimental designs for numerous sorghum and rice genes. Additional grass genome projects, for instance Brachypodium distachyon
, a wheat relative, and maize are well advanced and can be expected to deliver important and information-rich comparative genome templates in the future [41
]. This will enable and stimulate whole-genome comparative studies between three and more grass genome sequences. In particular, comparisons between two closely related grasses, maize and sorghum, will allow (i) branch-specific motifs to be accessed and, at the same time, (ii) the identification of motifs common to the monocot clade.