Understanding the intricate and finely tuned process of gene regulation in vertebrate development remains a major challenge facing post-genomic research. In order to begin to understand how genomic information can coordinate regulatory processes, we have adopted an approach integrating comparative genomics and a medium-throughput functional assay. Nearly 1,400 non-coding DNA sequence elements were identified that exhibit extreme conservation throughout the vertebrate lineage. Despite a degree of overlap, less than half of the non-coding ultra-conserved regions (109 out of 256) identified from the mouse and human genomes [21
] are present in this set. Most, if not all, of the CNE sequences appear to be associated with genes involved in the control of development, many of them transcription factors. A significant proportion of genes identified in this study are homologous to those identified in the sea urchin and other invertebrates as master regulators of early development, leading us to believe that they interact in GRNs. Consequently, it is extremely likely that the CNEs identified compose at least part of the genomic component of GRNs in vertebrates, acting as critical regions of regulatory control for their associated genes. Such regions would mediate up- or down-regulation of expression, effecting a cascade of downstream events.
In agreement with current GRN models, and given the function of many of the genes we have identified in our analysis, it is logical to speculate that CNEs consist of modules of binding sites for transcription factors. However, the model of CNEs as transcription factor binding sites, even for large numbers of transcription factors, does not fully explain their high sequence identity across vertebrates, given that transcription factor binding sites are generally rather short and exhibit a level of redundancy. Consequently, we have not ruled out the possibility that the CNEs may have a completely different mode of action or act in numerous different ways.
The relative positions and order of CNEs within a cluster is completely conserved in all vertebrate genomes we have analysed (generally mouse, rat, human, and Fugu) together with some degree of proportional compaction in the Fugu genome. This suggests that the CNEs might play a role in structuring the genomic architecture around trans-dev genes, which in turn may lead to an additional level of transcriptional control. Further evidence that genomic architecture may be important comes from the fact the trans-dev genes are generally located in regions of low gene density.
Alternatively, despite the lack of EST data, it is possible that CNEs are transcribed and work at the RNA level. A number of other ideas on the evolutionary mechanisms responsible for “ultra-conservation” have been suggested [21
], involving decreased mutation rate, increased DNA repair, and multiply-overlapping transcription factor binding sites, but without more functional studies such hypotheses remain speculative. Whatever their mode of action, the striking degree of conservation displayed by this set of CNEs suggests they play critically important functional roles.
Having established a “map” of the major locations of CNEs in the genome, we were able to take a more sensitive alignment approach in a number of these regions in order to identify additional CNEs (rCNEs). The distinction between CNEs and rCNEs is purely a bioinformatics one, based on our search parameters, and we have no reason to believe that there is any functional distinction between the two sets of elements. We selected a number of elements (both CNEs and rCNEs) as candidates for functional analysis. Data from our functional assay of 25 elements from four different developmental genes demonstrate that a significant proportion can act as enhancers, inducing expression of a GFP reporter gene in a tissue-specific manner. The observed expression patterns differ among elements, but are reproducible for individual elements. Enhanced GFP expression domains frequently coincide with endogenous expression domains of the trans-dev gene most closely associated with a particular element, although in several instances, expression of GFP was induced in a tissue in which the most closely associated developmental gene is not normally expressed. This is not surprising because we are assaying elements out of context and individually. Thus, in our assay, we may have excluded another regulatory sequence in the region that under normal circumstances acts to silence the enhancer activity of an element in a specific tissue. Indeed GRN models would predict that a number of different regulatory regions must interact in order to precisely effect a particular spatiotemporal pattern of expression. One of our future directions will therefore be to assay the combinatorial effects of injecting a number of elements together. Alternatively, we may have associated a CNE with the wrong gene, particularly where there are two or more trans-dev genes in the same region (see below).
Whilst it is straightforward to assign CNEs unequivocally to the SOX21 and PAX6 genes based on their location in the genome, the situation is more complex for elements in the vicinity of the SHH and HLXB9 genes, which are situated in close proximity to each other in the human, rodent, and Fugu
genomes. This is exacerbated by the fact that some CNEs may also be found within or around neighbouring genes. This phenomenon has been described for both the PAX6 [65
] and PAX9 [32
] genes, as well as for the SHH gene [30
], where a long-range enhancer in the intron of a neighbouring, unrelated gene regulates SHH expression in developing limb buds and demonstrates the large genomic distances over which regulatory regions may act. This enhancer is identified as a CNE in our dataset and, despite its established mode of action, is located much closer to the HLXB9 gene (200 kb in human and 12 kb in Fugu
) than to SHH (1,000 kb in human and 60 kb in Fugu
). Furthermore, a number of elements are located directly 5′ of the HLXB9 gene, whilst others are found located further upstream, in introns of the next gene, KIAA0010. Although we strongly suspect that all these elements are associated functionally with the HLXB9 gene (e.g., KIAA0010_1 directs expression prominently to the notochord, an expression domain of the zebrafish HLXB9 orthologue), we cannot rule out the possibility that they may associate with the SHH gene (also expressed in the notochord), which lies a few genes downstream. There are a number of cases where a CNE cluster is located close to more than one trans-dev
gene, illustrating the value of correlating endogenous expression pattern with CNE enhancer activity. However, it should be noted that in order to build GRN maps for the elements, it is desirable but not essential to know which element associates functionally with which gene.
Our confidence in the correctness of our gene assignment for the elements tested in this study is borne out by the results of our functional analysis. For the elements that we have associated with PAX6 and SOX21, there is a good correlation between tissues that express the gene endogenously and tissues induced by the associated co-injected elements to express GFP, i.e., the major sites of endogenous gene expression are highly represented in our mosaically expressing embryos (e.g., eye, hindbrain, and spinal cord for PAX6; forebrain, midbrain, hindbrain, and spinal cord for SOX21; see ). However, for elements in the vicinity of the HLXB9, KIAA0010, and SHH genes, GFP expression overlaps less often with expression domains of the associated gene to which the element has been assigned. As mentioned above, this reduced correlation with endogenous expression of their “associated” genes is probably due to the difficulty of assigning genes to elements in this region of relatively high trans-dev gene density.
It is likely that we have missed some developmental regulators in our whole-genome analysis owing to the stringency of our search parameters. Both the RUNX2
] and WNT1
] genes, for instance, share conserved non-coding sequences in humans and fish but were excluded because they failed to satisfy our stringent whole-genome search parameters. We may also have missed some elements because they were inadvertently hidden during the process used to mask coding sequence. Nevertheless, this is the first comprehensive attempt to identify the most highly conserved non-coding sequences common to all vertebrates. The use of the compact Fugu
genome sequence, with its large evolutionary divergence from mammals, was critical in providing an exceptionally low degree of background noise in comparisons at the level of whole-genome and genomic regions.
As with any high-throughput approach, our functional screen has limitations. Since there is a negligible background level of GFP expression from our reporter construct alone, as well as from our other negative controls (see ), the expression we see is most likely to be directly attributable to the enhancer properties of the CNEs. However, since GFP is a relatively stable protein [67
], down-regulation of expression will not be detected during the time course of this screen; instead, expression of GFP by a particular cell indicates that expression was stimulated at some previous point in that cell's lineage. False negatives are a further limitation of the assay, e.g., tissues that develop from few cells will be under-represented and late-developing tissues or cell types (after 24 h) will be missed completely in this screen, since there is a delay between the time of onset of GFP transcription and the time when GFP fluorescence is detectable.
The proportion of screened embryos that showed GFP expression varied from around 4% (SOX21_21) to around 44% (SHH_6); this is probably due to many factors, e.g., variations in the embryonic stage at the time of injection and stochastic variations from embryo to embryo with regard to which cells the injected DNA is segregated into during cleavage. However, by combining expression data from a number of expressing embryos (an average of 30 embryos per positive element), we can gain insight into the overall pattern of reporter gene expression prescribed by each element.
In addition to seeing GFP expression in “expected” domains (with respect to the associated gene), GFP expression was also often detected in tissues in which the associated gene is not normally expressed (e.g., muscle cells for SHH_6 and notochord for SOX21_1; see ). This might be due to incorrect association of gene to element (see above); alternatively, it might reflect the importance of genomic context for function of CNEs and rCNEs. It is possible that certain regions of the genome function as silencers or suppressors, repressing the transcription-stimulating activity of enhancer elements. In our assay we are testing the autonomous enhancing function of our CNEs independent of their normal genomic context. Whilst this enables us to screen rapidly for function in an unconstrained context, it might also result in a loss of the endogenous negative constraints. It will be interesting to determine the combinatorial language of CNEs, and to uncover the importance of genomic context for their function.
Conserved non-coding sequences are likely to function as negative as well as positive regulatory elements. Indeed, it is possible for a conserved non-coding element to act as either an enhancer or repressor of transcription depending on what factors are bound to it [68
]. Whether any of our CNEs can function as negative regulatory elements is an interesting question that is beyond the scope of the present study.
Zebrafish are the ideal model vertebrate for this screen. These sequences are, by definition, highly similar between mammals and fish, and the data generated are therefore relevant to any vertebrate. Given that CNE DNA can easily be generated from any vertebrate species (given its high degree of sequence identity), subtle differences between CNE sequences may be tested functionally in this system. Zebrafish embryos are both readily produced and easily visualised, allowing convenient live screening throughout development. Their transparency makes the embryos ideally suited to GFP analysis and the problems associated with mosaicism in this screen are relatively easily overcome by injecting large numbers of embryos. Technical advances, such as the use of meganuclease injection, may facilitate this further.
The combination of a comparative genomics approach together with functional screening of conserved elements produces a large and complex dataset. Efficient access, integration, and interrogation of this bioinformatics and functional data is crucial, and of increasing interest to the scientific community, to begin to characterise GRNs in vertebrates. To this end, we have submitted all CNE DNA sequences from Fugu to the EMBL nucleotide database and are developing a publicly available relational database in order to store, curate, and analyse data from this study as well as data generated from ongoing identification and characterisation of rCNEs surrounding trans-dev genes.
We have identified an important set of highly conserved non-coding vertebrate sequences that associate with developmental regulators and have provided evidence that at least some of them demonstrate regulatory function. They are likely to be implicated in genetic disease, as has already been shown for the SHH gene [30
]. Their distal location from coding sequence, often megabases away, makes them strong candidates as causative agents in position effect and breakpoint disorders [69
]. They are amongst the most highly conserved of all sequences in vertebrate genomes yet they are completely unrecognisable in invertebrates. Given their strong association with genes involved in developmental regulation, they are most likely to contain the essential heritable information for the coordination of vertebrate development.