In our previous work, we calculated a relative distance difference (RDD [5
]; Methods) value to assess whether the genomic distances between consecutive UCEs show less change than that between other adjacent genomic elements (i.e. genes and exons) in the human, mouse and dog genomes. The analysis showed that in addition to an extreme level of sequence conservation, UCEs also display strong conservation of mutual genomic distances among mammalian species [5
]. The conservation of distance between pairs of UCEs [1
] is also found between evolutionarily more distant vertebrates, but the distributions of RDD values show a persistent nature of distinct two-peak profiles in all mammal – non-mammal comparisons, with one peak close to zero, and another at a more negative value [5
]. Low number of UCEs [1
] was used in the previous analysis therefore a large data set is warranted to validate these findings.
To facilitate this investigation, we constructed a dataset integrated from three independent works [1
]. A direct element by element comparison shows that two-thirds of the non-exonic UCEs from data set [1
] do not overlap with HCEs from any of the two other data sets [see Additional file 1
]. The smallest data set of ~1,400 conserved non-coding elements (CNEs) [2
] had the highest fraction of overlaps (~80%) to data set [1
] and [3
], whereas the set of ultraconserved regions (UCRs) [3
] has ~50% overlaps with others. We combined these three published data sets [1
] to form an integrated data set consisting of 7,570 distinct highly conserved elements (HCEs) in the human genome. We used BLASTn with non-stringent parameters and criteria for order and genomic distance conservation to locate all occurrences of the same HCEs in the mouse, rat, chicken, frog, zebrafish, fugu and tetraodon genomes [see details in Methods; Additional file 2
]. The resulted number of orthologous HCEs that can be located uniquely in the different genomes is variable: more than 95 percent of human HCEs could be anchored to the rodent genomes, 71 percent to the chicken genome, and around 24 to 30 percent in fish [see Additional file 3
]. From the comparisons with the human genome more than 99 percent of HCEs were found linked with at least one other HCE/HCEs in all other genomes, including the linkage relationship with quite a number of HCEs in the fish genomes. More than sixty percent of HCEs were found ordered together with at least 5 individual elements [see Additional file 4
], which indicates the tendency for HCEs to preserve order conservation among vertebrate species.
We calculated RDD values between pairs of HCEs and compared them with RDD values for pairs of genes and exons of these genes. Similar to what has been reported for mammalian comparisons [5
], the absolute relative distance difference (|RDD|) were significantly lower for HCE pairs than for pairs of genes or exons [see Additional file 5
; Wilcoxons unpaired test, p value < 2.2e-16]. Calculated as absolute values (|RDD|), the median distance difference for HCE pairs in the human-chicken comparisons was 0.46, which is about half that for gene pairs (0.91) and exons (0.95) [see Additional file 5
]. The difference between distance conservation of HCE pairs and gene pairs is most pronounced for the human – zebrafish comparison; median |RDD|HCE
being only 32 percent of median |RDD|gene
. HCE-HCE absolute distance differences are also significantly less than exon-exon distance differences (within gene); the latter being only slightly different from the gene-gene relative distance differences.
The RDD distribution profiles were also markedly different for the three different pair comparisons (HCE-HCE, gene-gene, and exon-exon). The RDD distributions for HCE pairs show distinct two-peak profiles, with one peak close to zero and another at a more negative value. RDD values for gene pairs, in contrast, show only one peak skewed toward more negative values. The distributions of exon RDD values are wider than for both HCE and gene pairs. The distributions of all three data show a peak at relatively low RDD values (-1 to -2) for all four human – chicken/fish comparisons [see Additional file 6
]. However, the distribution of HCE RDD values consistently show an additional, dominant peak around zero, indicating the existence of a subset of HCE-HCE pairs whose distances have been conserved across vertebrate evolution. Even for Fugu and Tetraodon, whose genome sizes are only around 13 percent of the human genome, the result indicates that around 30 percent of the analyzed HCE pairs have largely unaltered distances (i.e. |RDD| within ± 0.116~0.409) compared to the human genome [see Additional file 7
A total of 403 HCE pairs are shared by the five non-mammalian species and human genome. The two-peak distribution profiles of RDD values still persist, with one peak close to zero and another peak (or 'shoulder') at a more negative value, as shown by mapping this integrated data set of HCEs onto the five non-mammalian genomes (Figure ). Most of these common HCE pairs are unique and linked with each other in the query genomes as they do in the human genome [see Additional file 8
]. HCEs have been reported to be unique and clustered in the human genome [1
], here we see a similar tendency in the non-mammalian genomes.
RDD distribution for HCE pairs common to the five genomes.
Inter-HCE regions with distinctive distance conservation patterns
In addition to the persistent nature of the two-peak distribution profiles, a remaining question is whether there exist any other characteristics pertaining to the regions confined by the HCE pairs common to the human and five non-mammalian genomes. To test this, we divided the 403 HCE pairs into two groups by using a partitioning clustering method based on the matrix of absolute RDD (|RDD|) values for the human – non-mammalian comparisons (Methods). RDD values of group one HCE pairs are centered around zero (Figure ), whereas those of group two are more widely scattered around a more negative value (Figure ). The distances between group two pairs (mean 46 Kb) are significantly longer than the distances between group one pairs (mean 2.8 Kb) [see Additional file 9
; Wilcoxon test p value = 2.2e-16
]. The |RDD| value of two consecutive HCEs has been reported to be positively correlated with the distance between the pair [5
], we see here a reflection of the same correlation. We call the inter-HCE regions IHRs and subsequently classify the IHRs into two types based on the (above mentioned) partitioning result [see Additional file 10
]. We obtained 188 IHRs (termed as IHR1s which are bordered by HCE pairs with relative small |RDD| values), and 215 IHRs (termed as IHR2s which are bordered by two consecutive HCEs with larger |RDD| values). All these 403 HCE pairs are also detected in the rodents. An intriguing observation is that for any pair-wise comparisons among the eight genomes, the median |RDD| values for HCE pairs of IHR2s are constantly much higher than those values of IHR1s [Figure , see Additional file 11
]. Given the persistent nature of distinct distance conservation of the two groups of IHRs, it is difficult to assume that such profile was the result of a random assortment. Rather, it seems more likely that subsets of HCE pairs may undergo different evolutionary paths in the sense of genomic distance conservation.
Median |RDD| for HCE pairs of IHRs. Median | RDD| of IHR2s were much higher than that of IHR1s for the comparison of any two pair-wise genomes.
We subsequently defined the subset of intergenic IHRs when their flanking HCEs are intergenic in the human genome with no genes in between. For both IHR1 and IHR2 groups, there are more intergenic IHRs than other categories, with 40 percent intergenic IHR1s and 49 percent intergenic IHR2s, respectively (Table ). We further calculated genomic distances between intergenic IHRs and their closest neighboring genes. Using the distance to the closest gene for statistical analysis, the average distance is 113 Kb for intergenic IHR1s and 150 Kb for intergenic IHR2s (Table ). A high percentage of intergenic IHRs are more than 10 Kb away from the nearest genes [see Additional file 12
Number of HCE pairs with different genomic locations
Distance between intergenic IHRs and their nearest genes.
We also identified a few human genomic regions that are spanned by the same type of IHRs, indicating that the distance variation of HCEs in these regions is probably associated [see Additional file 13
]. An intriguing observation is that ten IHR1s are clustered in a region close to 1 Mb, and the corresponding eight HCE pairs are all located in intergenic regions.
Enrichment of DNA repeat sequences
Human genome has a much greater portion of repeat sequences and it is believed there is a correlation between genome size and repeat content. We therefore asked whether there are any differences in the enrichment of human DNA repeat sequences between IHRs and random genomic regions. As both the number and length of the two groups of IHRs are different, we used randomly selected regions to test the significance. Repeat sequences appear more frequently in IHR2s than in IHR1s. Compared with the sets of corresponding random regions, repeat sequences are significantly less frequent in IHR1s (43 percent, Table ; p value < 0.001, 74 percent for the random background), but more in IHR2s (97 percent, Table ; p value = 0.052, 94 percent for the random background). Here, we found a correlation between repeat sequences and the length expansion of IHRs. Fewer IHR1s containing repeat sequences may reflect evolutionary pressure against either transposon-derived sequence in these regions or the distance-distorting effects of inclusion of longer repeat sequences between the bordering HCEs to maintain the shorter IHR1 length.
Percentage of repeated base pairs within IHRs.
We also found that both types of IHRs contain significantly less sequences of SINE (4.3% for IHR1s, 11.0% for IHR2s), LINE (2.4% for IHR1s, 13.4% for IHR2s) and LTR (0.6% for IHR1s, 4.7% for IHR2s) compared to the random backgrounds (Table ; p value < 0.001); however, both types of IHRs are significantly enriched in low complexity DNA sequences (4.9% for IHR1s, 0.7% for IHR2s) (Table ; p value < 0.001 for IHR1; p value = 0.016 for IHR2;). We also tested the enrichment of long transposon-free regions (TFRs) in IHR1s and IHR2s. TFRs have been reported to be associated with both protein coding genes and UCEs [12
]. Of the 188 IHR1s, 60 percent are intersected with TFRs (2.6% for the random background); and 52 percent of the 215 IHR2s are intersected with TFRs [see Additional file 14
; 12% for the random background]. Both groups of IHRs show a significant enrichment of TFRs compared with random selected regions, indicating a complex relationship between TFRs and distance conservation.
Unexpected enrichment of indel variation
Since HCEs are highly conserved at not only sequence level but also their genomic organization (e.g. order and distance), we suspected that IHRs might not tolerate any large extent of rearrangements. We therefore asked whether there are any differences in the distribution of human indel (i.e. insertion and deletion) polymorphisms in the IHRs.
Mills et al.
] recently identified a set of small indels from three different human populations. As a negative control, we used randomly selected genomic regions with the same number and length of corresponding IHR1s and IHR2s, respectively. The frequency of which the random samples had higher average scores than those of the IHRs provided the basis for the statistical significance. None of the IHR types are deleted in small indels, and IHR2s are actually significantly enriched. We found that 16 percent of IHR1s (30; p value = 0.241, 27 for the random background) versus 81 percent of IHR2s (174; p value < 0.001, 156 for the random background) contain small indels [Table ; see Additional file 15
]. Both results are not in accordance with the expected. Considering the highly conserved length of IHR1s, less IHR1s are expected to contain indels than random background; and as many IHR2s as random selected regions are expected to contain indels. For the regions with indels, we calculated the percentage of insertion/deletion base pairs over the whole length of corresponding IHRs, and found no significant differences in both types of IHRs compared with the randomly selected human genomic regions [Table ; see Additional file 15
]. Previous works have suggested that the genome-wide indel rates are not uniform and that indel events are not neutral [14
]. Investigations of human indels indicated that most indels have arisen from the most recent variation events [15
]. In spite of the observation of overrepresentation of indels in human IHRs, the fact that the length of IHRs remains highly conserved among vertebrate genomes than the distance of gene or exon pairs suggests that the distance between consecutive HCEs is under high selection pressure and is important for HCEs to exert their biological function.
Enrichment of human indels within IHRs.
Conserved sequences within IHRs
A previous observation is that |RDD| and sequence conservation are to some extent positively correlated [5
]. We used the datasets of phastCons elements provided by the UCSC online server to test the conservation characteristic within the IHRs. As for Tetraodon and Fugu, there are presently no phastCons data from the UCSC online service, so these two genomes were excluded from the sequence conservation analysis.
The correlation between the percentage of conserved sequence and human IHR length is stronger for IHR1 than for IHR2. The conservation percentage is below 50 percent in almost all IHR2s, even in short IHR2s with length close to IHR1s (Figure ). Among the IHR1s, some have a high percentage of conserved DNA sequence, whereas others not. Considering the generally high degree of distance conservation of the IHR1s, their length might have been under a higher level of evolutionary constraint than the DNA sequences within the regions.
Correlation between the percentage of conserved sequence and the length of the two groups of IHRs. Circles represent the data for IHR1 and stars for IHR2.
In the human genome, the average length of conserved elements is nearly the same in IHR1s and IHR2s (73 bp for IHR1s and 76 bp for IHR2s, respectively; Table ). However, the average inter-distance between two consecutive conserved elements of IHR2s is almost 2.8 times longer than the IHR1s (327 bp for IHR1s and 894 bp for IHR2s; Table ), thus resulting a lower average sequence conservation for IHR2s. The same tendency was found for the other three genomes. The phastCons element data were derived by a multiple species alignment algorithm [17
], and the length of the same conserved fraction vary little across the compared species and therefore contribute little to the distance differences between species. No significant differences was observed in the length distribution of conserved fractions between the two groups of IHRs in the human genome (Table ; p value = 0.1798), indicating that the same length of potential functional sequences with lower sequence conservation occupying the space of both groups of IHRs.
Length of conserved fractions and distance between two consecutive conserved fractions within IHRs.
IHRs and CpG islands
Both groups of IHRs are significantly enriched for CpG islands compared with the corresponding random backgrounds in the human genome: about 10 percent of IHR1s (0.5% for the random background) and 14 percent of IHR2s (2.3% for the random background) were found to contain CpG islands [see Additional file 16
, Additional file 17
; p value < 0.001]. We further tested the percentage length of CpG islands and observed the difference: average 45% for IHR1s and 7% for IHR2s. The percentage length of CpG islands between IHR1s and IHR2s is significantly different [see Additional file 16
; Wilcoxon test, p value < 5.5e-06].
For both IHR1s and IHR2s with CpG islands, the pair-wise genomic loci of HCEs are only significantly sparse in the "intronic-intronic" class [see Additional file 18
; Hypergeometric test, p value = 0.0024], which can easily be understood that there are exonic sequences residing in between the HCEs and that promoter elements (i.e. CpG islands) are less likely to be located in the exonic regions. We next checked the environment of those intergenic-intergenic IHRs with CpG islands, eleven/fifteen intergenic IHR1s/IHR2s were found with CpG islands, respectively (Table ). Eight intergenic IHR1s with CpG islands are more than 8 Kb away from the closest gene. Fifteen IHR2s with CpG islands are located in the intergenic regions; only three reside in the regions less than 10 Kb away from the nearest gene. A high percentage of intergenic IHRs are more than 10 Kb away from the nearest genes.
HCEs are frequently found in relatively gene poor regions [3
], and their distances are conserved among the mammalian genomes [5
]. Our data show that IHRs shared by the six vertebrates are also enriched in gene poor regions of their genomes. CpG islands are generally associated with human promoters [18
] and most promoter-associated CpG islands that have been reported are located within 2 Kb regions around transcription start sites [19
]. The enrichment of CpG islands in the IHRs over the random background genomic regions suggests the possibility of the existence of potential target genes, and the long distance between the IHRs and the nearest gene indicates that putative targets might be located in a wider genomic range, or that the CpG islands residing in the IHRs along with the two side HCEs could together perform important roles either as regulatory blocks or other unknown functions.