Our new measurement of sequence conservation calculates the sequence similarity based on conserved segments. To obtain the conserved segments, we apply the local alignment software CHAOS [16
] to a pair of human-mouse orthologous non-coding sequences (see Methods for details). The aligned human-mouse segments outputted from CHAOS are called conserved segments. Note that these conserved segments may not be in the same order as that in the input human-mouse sequences (Figure ). We use CHAOS instead of other local alignment software because CHAOS has been shown to correctly align regulatory elements in distant species [16
] in long sequences. We do not use global alignment methods to define conserved segments because a recent study has shown that three most popular methods cannot align a portion of coding regions consistently [7
]. Moreover, conserved regions with ICS are difficult to align well by one single alignment.
With this definition of conserved segments, we implement the following three-step procedure to calculate the conservation score for an m-kilobase (kb) long human region. Assume this human region is in the non-coding region of the gene H1. The ortholog of H1 in the mouse genome is M1. First, we apply the CHAOS software to identify conserved segments in the non-coding region of M1, by comparing this m-kb long human region with the non-coding region of M1. Here and in the whole paper, the non-coding sequence of a gene includes the upstream sequences until the closer endpoint of the 5' adjacent gene, the intron sequences of this gene, and the downstream sequences until the closer endpoint of the 3' adjacent gene. Depending on which codon is closer to this gene under consideration, the endpoint could be the start codon or the stop codon of the adjacent genes. Second, for any m-kb long mouse region starting from a conserved mouse segment, we calculate the score of the mouse region and the m-kb long human region, by summing the scores of the aligned conserved segments within this pair of m-kb long regions. We obtain the score of a pair of aligned conserved segments from the CHAOS output. Third, we define the conservation score of the m-kb long human region as the best score obtained at the second step. The corresponding m-kb long mouse region with the best score is claimed as the mouse homologous region of the human region. If the above H1 has multiple mouse orthologs, we will use the non-coding regions of all of the orthologs to carry out the above three-step procedure.
We applied the above procedure to the 172 TFBS-clustered regions and 1470 DHS regions. For each human region, we identified the best mouse homologous region. In the following, we described our observations regarding these homologous regions and compared these regions with the homologous regions defined by the UCSC genome browser and those defined by BLAST and MegaBLAST [15
More than 17.0% human functional regions contain ICS
We found that many human functional regions contain ICS instead of contiguously conserved sequences. In 121 (70.3%) TFBS-clustered functional regions and 250 (17.0%) DHS functional regions, there are two or more ICS that are separated by divergent sequences (Figure ). In 112 of the 121 (92.6%) TFBS-clustered regions and in 162 of the 250 (64.8%) DHS regions, the overall sequence similarity between the human functional region and the corresponding mouse homologous region is less than 70%. Note that the homologous regions based on the new measurement often comprise the best mouse BLAST hits and the best MegaBLAST hits when using the corresponding human regions as queries against the mouse sequences, which shows that the homologous regions are most likely orthologous [14
]. It is thus clear that the current conservation measurement that requires contiguous sequence similarity may claim that 11% (17%*64.8%) of these functional regions are not conserved. That is, without a new measurement of sequence conservation, at least 11% of the conserved functional regions would be missed by the current methods.
Figure 2 The number of conserved segments in the human functional regions. The x-axis is the number of conserved segments. The y-axis is the percentage of the regions that contain the corresponding number of conserved segments. Here the percentage refers to the (more ...)
To see what may contribute to the number of ICS in functional regions, we checked whether the number of ICS in a functional region is determined by the length of the functional region. First, for the TFBS-clustered regions, the length of the region has a low correlation coefficient, 0.516, with the number of the ICS in the region. Moreover, the longest regions do not have the largest numbers of the ICS (Figure ). This may suggest that the ICS represent the intrinsic functional sequences while the sequences between the adjacent ICS represent the non-functional sequences. Second, for the DHS regions, it is clear that length of a region does not determine the number of ICS in the region. For the DHS regions, the average length and the median length of the regions is 393 bp and 251 bp, respectively, which is similar to the size of many known CRMs. However, we found that 250 (17.0%) DHS regions still contain two or more ICS. Moreover, 26.4% of the DHS regions containing two or more ICSs are shorter than the average length of the DHS regions. Note that, by chance, it is common for a long region of a few thousand bps to contain several ICS and it is not usual for a short sequence of 393 bp to contain two or more ICS. Thus, the observation from the DHS regions clearly indicates that the ICS may be functional segments in these regions. Because these 26.4% of the DHS regions are short, most likely these regions are functional units and the conserved segments in each of these units work together to perform a function.
Figure 3 The length versus the number of conserved segments in a region. The x-axis is the length of the functional regions and y-axis is the number of conserved segments found in these regions. (A) TFBS-clustered functional regions. (B) DNase I Hypersensitivity (more ...)
We next analyzed the length and the conservation of ICS in the functional regions (Figure ). We noticed that, in 57 (33.1%) TFBS-clustered regions and in 905 (61.6%) DHS regions, the average length of the ICS are less than 50 bp. Meanwhile, for the TFBS-clustered regions, the average similarity of the ICS compared with their aligned counterpart, defined as the percentage of the identities based on the CHAOS alignments, is 83.7%, with 68.1% as the minimum and 94.1% as the maximum. For the DHS regions, the average similarity of the ICS compared with their orthologous counterpart, is 81.3%, with 61.5% as the minimum and 100% as the maximum. From the length and the conservation of the ICS in functional regions (Figure ), it is clear that the length distributions of the ICS in TFBS-clustered regions and in DHS regions are on the same scale. That is, the length of the ICS is not related to the length of the whole functional regions, which partially agrees with the observation that the number of the ICS in a region does not depend on the length of the regions.
The distribution of the length and the overall percentage of identities of the conserved segments. (A) and (B) are for TFBS-clustered regions. (C) and (D) are for DHS regions.
We also investigated whether these discontiguously conserved regions are biologically meaningful. We scanned the discontiguously conserved regions using the known motifs in the TRANSFAC database and using stringent score cutoff to define TFBSs (p-value < 0.0001). We found that ICS in both the TFBS-clustered regions and the DHS regions contain conserved TFBSs. On the other hand, we did not find conserved TFBSs in the sequences between adjacent ICS in these regions. For instance, we found seven conserved TFBSs in the two ICS in the DHS region id-1244 (chr11:130824648-130824895). In another example, we found more than 3 TFBSs on average in each of the seven ICS in the TFBS-clustered region id-211591 (chr1:149712431-49714909). These putative TFBSs in the ICS support that these ICS may be responsible for the function of these regions.
Our procedure provides mouse homologous regions that are more similar to the human regions
From the above analyses, we already know that there exist a large number of conserved regions with ICS. Here we show that our procedure provides mouse homologous sequences that are more similar to the human functional sequences compared with the aligned mouse sequences from genome alignments, another advantage of the new measurement.
For every human functional region, we obtained the mouse homologous regions as above. In every mouse homologous region, we defined a CHAOS sequence as the sequence comprised of the conserved segments and other sequences at both sides to cover the entire human region. For example, the CHAOS sequence in Figure is the mouse sequence from the location S' to the location E'. We also obtained the corresponding mouse region from the UCSC genome browser, which is aligned with exactly the entire human region. We call these mouse sequences UCSC sequences (Figure ). To measure the similarity of the CHAOS sequences and the corresponding human sequences, we use LAGAN [17
] to align the CHAOS sequences with the human sequences. For the UCSC sequences, we use the alignments obtained from the UCSC genome browser. We define the similarity as the percentage of aligned identical nucleotides in the alignments.
Figure 5 A CHAOS sequence and the corresponding UCSC sequence. The two sequences between the two pairs of dotted lines are the CHAOS sequence and the UCSC sequence for the same human functional region. In more detail, in the left panel, assume S and E are the (more ...)
We found that CHAOS sequences are often more similar to the corresponding human sequences than the UCSC sequences are to the human sequences. In 139 out of 172 (80.8%) TFBS-clustered regions and in 842 out of 1470 (57.3%) DHS regions, the CHAOS sequences are more similar to the corresponding human sequences than the UCSC sequences. For these 139 TFBS-cluster regions and 842 DHS regions, the CHAOS sequences have on average 22.2% and 29.9% more identities than the UCSC sequences, respectively. This shows that the new measurement may be a better way to measure the similarity of orthologous region for non-coding sequences. It also implies that the current conservation studies may have missed many conserved regions by calculating conservation scores based on genome alignments.
Besides the above 139 TFBS-clustered regions and 842 DHS regions, we found 2 TFBS-clustered regions and 61 DHS regions for which the UCSC sequences are as similar to the human sequences as the CHAOS sequences. Moreover, in 31 (18.0%) TFBS-clustered regions and 567 (38.6%) DHS regions, the UCSC sequences are more similar to the corresponding human sequences. Note that the CHAOS sequences may be not so similar to the human sequences as the UCSC sequences, since the CHAOS sequences are identified based on the conserved segments only. In the following, we wanted to investigate whether this was the case.
We found that there were four types of functional regions where the UCSC sequence was more similar to the human sequence (Figure ). First, for 16 TFBS-clustered regions and 229 DHS regions, the UCSC sequences were not from non-coding regions of the orthologous genes. Thus, for these regions, the CHAOS sequences are in fact better than the UCSC sequence to measure the conservation. Second, in 8 of the remaining 15 TFBS-clustered regions and in 134 of the remaining 338 DHS regions, the difference between the percentages of identities in UCSC sequences and in CHAOS sequences is less than 5%. Such small differences are mostly due to the different end positions in the genome. Third, for the remaining 7 TFBS-clustered regions and the remaining 204 DHS regions, 3 TFBS-clustered regions and 197 DHS regions have only one conserved segment from the CHAOS software. These individual conserved segments have 15% higher identity than the UCSC sequences compared with the human regions, which shows that the UCSC sequences may be misleading by misaligning the most highly conserved sub-regions in functional regions. Fourth, for the remaining 4 TFBS-clustered regions and 7 DHS regions, the UCSC sequences and the CHAOS sequences do overlap more than 80%. The difference is caused by the repeats and exons. Note that the CHAOS sequences do not include repeats and exons while the UCSC sequences can include them. When the corresponding CHAOS sequences only match part of the human regions while the UCSC sequences can match the whole human regions with repeats and exons, the UCSC sequences will have better overall similarity than the CHAOS sequences although they overlap significantly.
Four types of functional regions where the UCSC sequences have a higher overall similarity to the human sequences than the CHAOS sequences.
In summary, in at least 90.1% (139+16 out of 172) TFBS-clustered regions and 72.9% (842+229 out of 1470) DHS regions, the CHAOS sequences are more similar to the human sequences than the UCSC sequences are to the human sequences, in terms of percent identity in the sequence alignments. Such a dominant performance from the new measurement confirms that genome alignments based on contiguous sequence similarity may misalign many conserved regions. For the remaining regions, although the UCSC sequences are the same or more similar to the human sequence, they often misaligned the most conserved sub-regions. Thus, it is questionable that UCSC sequences provide better counterparts in the remaining regions.
Genome alignments misaligned many functional regions
In the previous section, we have shown that many UCSC sequences are not from the non-coding regions of the orthologous genes. We have also shown that the UCSC sequences are not as similar to the human sequences as the CHAOS sequences in most regions. Moreover, in the regions where the UCSC sequences are more similar, we found that the most conserved segments in the UCSC sequences may be misaligned. We found two factors can contribute to this.
First, the genome alignment considers contiguous sequence similarity, which makes it difficult to align some local regions. For instance, due to genome rearrangements during evolution, some parts of a functional region are kept in the original 5'-3' direction while other parts are inverted to 3'-5' directions. Thus, the overall sequence similarity based on genome alignments for true orthologous regions is too low to be identified. Therefore, genome alignments may poorly align these regions across species. For instance, the DHS region id-2404 (chr16:61153038-61153304) shares 75.4% identities with the CHAOS sequence (chr8:102219277-102219500) and shares 49.8% identities with the UCSC sequence (chr8:102870981-102871214). The much lower percent identity from the UCSC sequence is due to the fact that the segment (chr16:61153109-61153157) and the segment (chr16:61153242-61153294) in this DHS region are inverted in the mouse genome. In the CHAOS sequence, the two segments are aligned with two segments chr8+:102219348-102219395 and chr8-:102219440-102219395, which occur in the positive strand and negative strand, respectively ("+" and "-" following the chromosome name mean the positive and negative strand, respectively). In the UCSC sequence, the two segments are aligned with two segments, chr8+:102871048-102871095 and chr8+:102871168-102871203, in the positive strand.
Second, the genome alignments are targeting genome scale sequence similarity and thus may sacrifice the alignment quality of short functional regions. For instance, for the DHS region id-5225 (chr5 :142205165-142205746), we found that there is a conserved segment of 101 bp long with 83% identity to its orthologous region in the mouse genome. The genome alignment at UCSC aligned this region with all gaps. It is clear that, to provide better genome scale matches, the genome browser cannot guarantee to align the corresponding sequences for short regions.
BLAST and MegaBLAST neglect many conserved segments
The comparison of the CHAOS sequences with the UCSC sequences in previous sections shows that the aligned sequences from the UCSC genome browser may be misleading when considering the evolution of a local region. Because we are trying to identify the most similar regions around an orthologous mouse gene for a human query sequence, it is also necessary to determine the difference between our approach and BLAST, the basic tool for the same purpose using contiguous sequence similarity [14
For the above 172 TFBS-clustered regions and the 1470 DHS regions, we applied BLAST with the default parameters to output hits from the non-coding regions of the orthologous mouse genes. Note that the orthologous mouse genes were obtained from the Mouse Genome Informatics database (MGI) [18
]. We found that the best BLAST hits in the mouse were always included in the corresponding CHAOS sequences, for all human regions with significant BLAST hits (the smallest E-value < 1E-10; see Table for details). The inclusion of the best BLAST hits in the CHAOS sequences supports that our procedure identified reliable "orthologous" regions when the query sequences were "conserved".
The comparison of the CHAOS sequences with the BLAST hits.
To show the benefit of measuring the sequence similarity based on conserved segments without considering divergent sequences in a region, we further examined the human regions with significant mouse BLAST hits. We found that in 71 out of 76 (93.4%) TFBS-clustered regions and 167 out of 270 (61.9%) DHS regions, there were one or more CHAOS segments that were missed by BLAST (Table ). In the remaining regions, the CHAOS segments had a one-to-one correspondence with the BLAST hits, including the hits with E-values larger than 1E-10. Note that the human query sequences are experimentally verified to be functional. It is most likely that all the ICS in such regions, especially in the short DHS regions, are working together to perform functions. Therefore, BLAST missed many ICS by considering conserved segments individually. It also implies that many significantly conserved regions could be missed by BLAST if there is no individual significant hit. On the other hand, the identified ICS in a region together may tell us new functions of the region.
We also applied discontinuous MegaBLAST [15
] to identify mouse homologous regions for these human functional regions. Discontinuous MegaBLAST is claimed to be able to identify more divergent sequence similarity than BLAST. By using a recommended parameter combination "-A 50 -t 21 -W 11 -N 1", we can only identify hits in 94 out of 172 TFBS-clustered regions and in 306 out of 1470 DHS regions (Table ). Among them, only 76 TFBS-clustered regions and 246 DHS regions have MegaBLAST hits with an E-value less than 1E-10. In all these regions, our procedure identified all MegaBLAST hits as conserved segments. In 71 out of the 76 (93.4%) TFBS-clustered regions and in 72 out of the 246 (29.3%) DHS regions, our procedure identified more conserved segments than MegaBLAST. Although the results may be affected by the default parameters we used, the fact that MegaBLAST missed conserved segments in so many regions shows that MegaBLAST cannot identify many conserved regions with ICS.
The comparison of the CHAOS sequences with the discontiguous MegaBLAST hits.
At least 12.8% human functional regions are conserved in mouse
In the previous sections, we have shown that it is necessary to extend the current conservation measurements to consider only the conserved segments in a region. Here we want to estimate the percentage of human functional regions conserved in the mouse based on our new conservation measurement and the functional regions mentioned above.
To determine whether a human region is conserved in mouse (see Methods for detail), we first calculated how conserved a random human non-coding region is in the mouse genome. Figure shows the distribution of the conservation score of a random 1 kb long human non-coding repeat-free sequence. Note that this conservation score is a sum of the similarity scores of individual CHAOS segments in a pair of 1 kb long regions. Since it is estimated that at least 3.5% of human non-coding sequences are under constraint [19
], we choose the top 3.5% cut-off in this distribution to define the conserved regions. Note that we assume the constrained sequences are conserved here, which may not be true for some short constrained sequences. Please also keep in mind that, the 3.5% may be an underestimate of the constraint sequences in the human genome [20
]. For any region with a conservation score in the top 3.5% of scores for random regions, we claim this region is conserved between human and mouse. We next calculated the conservation score of a functional human region. By comparing these two distributions, we found that about 12.8% of (22 out of 172) TFBS-clustered regions are conserved in mouse, and about 19.2% of (279 out of 1470) DHS regions are conserved in mouse.
Figure 7 The distributions of the conservation scores. The distribution of the conservation score of a random 1 kb long human non-coding region aligned with mouse sequence is plotted with a solid line. The distributions of the conservation score of functional (more ...)
If we consider the contiguous sequence similarity for the TFBS-clustered conserved regions, the percentage of sequence identity is from 46.1% to 78.0%, with a median of 67.4%. For the DHS conserved regions, the percentage of sequence identity is from 23.8% to 92.1%, with a median of 69.8%. It is thus evident that many conserved regions are neglected by the current conservation studies.