In this study we identify, consolidate, and analyze all the 100% identity repeated sequences in the human genome that have a length of at least 300 bp. The result of our analysis, the IRB, comprises around 2% of the total reference human genome, and includes potential recombinogenic sites which overlap important functional and structural elements such as SDs, common repeats, and genes.
Because almost half of the total bp in the human genome (45%) corresponds to common repeats, it is not surprising that common repeats comprise 54% of the IRB bp. We observed an enrichment of LINEs over SINEs in the IRB compared to the total genome, as well as an underrepresented proportion of DNA transposable elements versus the total genome. These biases could be explained by size differences among these elements. SINEs have a length of about 100-400 bp, and the average size for all LINE1 copies (the most abundant LINE elements) is 900 bp (overall, LINEs are about 6-8 Kb long) [
24]. In the same way, DNA transposon fossils range from 2-3 Kb for the autonomous type and from 80-3000 bp for non-autonomous type [
25]. As in our analysis we look for 100% identity repeats of at least 300 bp, this length range may reduce the number of expected versus found rates of SINEs compared to LINE elements, and of DNA transposons compared to the total genome ratios. Another explanation for the overabundance of LINEs over SINEs and underrepresentation of DNA transposons might be related to the percentage identity threshold, as a single mismatch might break the length of identical sequences lower than the detection minimum of 300 bp, thus making the used algorithms overlook the regions. Therefore, SINEs, which have a shorter average size compared to LINEs, could be underrepresented in the IRB due to slight variations in their sequence; DNA transposons could be subject to the same explanation.
An enrichment of satellite type DNA was also detected in our dataset. Satellite DNA is known to be present in several centromeric and pericentromeric regions in the human genome [
26]. For example, alpha satellite DNA is found on all human chromosomes, while beta satellite DNA is normally present in tandem arrays of acrocentric chromosomes, covering hundreds of Kb. In addition, telomeric DNA accounts for many Kb located at the termini of human chromosomes. Even though mutations can exist in satellite sequences, long stretches of satellite type DNA conserve the established 100% identity and 300 bp thresholds. As a result, more satellite type DNA elements would be included in the IRB, explaining the observed enrichment.
SDs are another interesting feature of the human genome. Because SDs are large, highly identical sequences interspersed throughout the genome, it is expected that most of the IRB bp fall within this classification. In fact, we observed that 80% of the IRB overlapped with SDs, with 66% of the ISTs overlapping SDs of >99% identity. Correspondingly, ~33% of the total SD bp overlap with the IRB bp. These numbers reinforce the general idea that the IRB contains potentially recombinogenic sites, as SDs are known substrates of homologous recombination events [
3,
4].
A major result of our analysis is concerned with the presence of genes in the IRB. We found 296 genes which are completely contained within ICs. Of the 296 genes, 145 are classified as non-coding RNAs. Of these, approximately one third are annotated as miRNAs, accounting for ~3% of all human miRNAs. This is an interesting result because it has been observed that miRNAs play important roles in many biological processes such as cell growth and differentiation, apoptosis, and gene regulation [
27]. In this sense, it could be possible to correlate and/or make predictions of potential disease phenotypes based on the knowledge that these genes are prone to rearrangements. Actually, it has been reported that frequent deletions of miRNA genes
miR15 and
miR16 occur on patients with chronic lymphocytic leukaemia, suggesting a possible role for these miRNAs in the generation of this type of cancer [
28]. On the other hand, given that miRNAs function as fine-tuners of gene expression, it would be interesting to analyze the role of genomic rearrangements that include miRNAs throughout evolution.
Of the 296 identical genes identified, we found elements of the Golgin subfamily A and the Double homeobox family which, when compared to the Rhesus macaque genome, were described as gene families with a significant copy-number expansion in human [
29]. Another interesting observation is that an expression in testis has been reported for one third of the protein-coding genes detected in the IRB. Of these genes, 92% (22 genes) are members of the cancer-testis antigen family located in the X chromosome. It is known that most of the cancer-testis genes located in this chromosome are members of families that fall within complex regions of direct and inverted repeats, and have been reported to be undergoing expansion through duplication events [
30].
We are aware that the number of genes detected in the IRB could be an underestimation of the total identical genes in the human genome, since they may not meet the length threshold that we used for this analysis. In spite of this, the utility of using identical sequences enabled us to notice 26 inconsistent cases in the Ensembl database v50 human genome annotation. These include 5 identical genes with different annotated sizes but with the same description, 6 identical genes with different annotated descriptions, and 15 regions identical to a gene but not annotated as such. By considering the stringent identity threshold used to construct the IRB, the IRB-based gene analysis could be used as a suitable tool for refining annotation details of many different databases.
It is important to notice that the IRB includes identical pairs of long sequences, up to 88 Kb. The fact that no SNPs or indels were found is indeed odd, but these data are based directly in the reported sequence of the reference human genome. Until now, the reference assembly has the highest degree of accuracy available for any sequenced organism, with a calculated error rate of 1/100,000 bp [
31]. Any sequence or assembly errors in the reference would be translated into errors in the actual IRB, however this is not ascertainable
in silico. Nonetheless, another plausible explanation for the high identity of the regions within the IRB is that they might have been duplicated recently in evolution; it could also be possible that they are undergoing frequent gene conversion. These regions might also be polymorphic within the human population. We encountered that around 73% of our ISTs overlapped CNV regions from the DGV, which correspond to 81% of the total IRB bp. It is worth noticing that almost all of the overlapping ISTs were completely included within the CNV regions. An important observation is the fact that ~89% of the overlapping ISTs-CNVs sequence is catalogued as SDs. Previous studies have reported a significant association of CNVs with SDs [
32], which might suggest a SD-mediated mechanism for the generation of these CNVs.
Most interesting to notice is the degree of overlap that exists between the IRB and CNVs detected in other sequenced genomes. A comparison of the identified CNV regions in the Watson and Venter genomes revealed an overlap of 1,055,668 and 641,194 bp with the IRB, respectively. Moreover, a comparison among the DGV, and the Watson and Venter CNVs, brought to the fore shared regions of copy-number changes that overlap the IRB. Overall, these observations suggest that the ISTs might be participating as substrates for recombination events, which might ultimately lead to genomic rearrangements and copy-number changes. Following this hypothesis, we might expect to find CNV regions associated with the remaining 17,932 (27%) ISTs, which might not have been yet identified as CNVs, either due to technical limitations of current methods or lack of populations sampling.
Expanding the CNVs-gene analysis of the IRB, we searched for possible gene copy-number variations in the Watson and Venter genomes by comparing 52 non-coding RNAs against the NCBI assembly. By using pair-wise alignments it was found that most of the genes analyzed had at least one identical hit in the three genomes, and most of them had a hit number close to the average of a control set of randomly chosen small fragments of the reference assembly. We found statistically significant duplication evidence for two genes in the Venter genome (two cases of novel 5S_rRNA) and three genes for the Watson genome (a novel misc_RNA, and two novel 5S_rRNA genes). We also had six cases where no hits were detected in the diploid genomes (five genes for Watson and one for Venter). These genes include a novel 5S_RNA, a novel misc_RNA, and four novel copies of the U1 gene. We must be careful when interpreting these zones as possible deletions in the Venter or Watson genomes, mainly because the absence of hits for these genes could have been produced by sequencing errors, different coverage of the genomic regions, or by sequence polymorphisms (single nucleotide polymorphisms (SNPs), insertions and deletions). Additional analysis revealed that of the 9 cases of genes which presented no hits or a higher number of hits, none overlapped any Watson or Venter reported CNVs (data not shown).
It is tempting to speculate that the identified copy-number variable genes for the Watson and Venter genomic regions could be possible de novo duplications/deletions for either individuals, or deletions/duplications for the reference assembly. However, it is important to highlight that significant local fluctuation in read depth across the Venter and Watson genomes and the NCBI assembly, together with the presence of SNPs and microindels, might limit the ability for an accurate in silico CNV prediction with our methodology. Nevertheless, the possibility of identifying novel CNVs for two recently sequenced genomes is a step towards the discovery of other new copy-number variable regions in the human genome. Further experiments must be performed to verify if the predicted regions are true CNVs.
As a final remark, it is important to consider that the IRB need not to be identical in different individuals due to the presence of SNPs, microindels, and structural variation. Comparisons among the IRB of the reference human genome and recently published personal genomes will be plausible once newly sequenced genomes attain a higher degree of assembly confidence to make appropriate definitions of their individual IRBs. For now, IRB and raw sequence reads comparisons have shed light on important functional and structural aspects of the identical repeated nature of the reference human genome assembly in regard to two other sequenced genomes. Furthermore, given that most of the resequenced personal genomes rely greatly on mapping sequence reads back to the reference assembly, the IRB of the reference assembly will also help to pinpoint highly identical regions in the new genomes.