Tracking subtelomere alleles using conventional DNA markers is currently very difficult. All but six of the most distal 30 kb euchromatic subtelomere segments are composed exclusively of segmental duplications, and for a significant number of subtelomeres the duplication regions can be far more extensive (hundreds of kilobases) as well as highly variable in size and duplication content among alleles. Most of this subtelomeric DNA lies outside of the 'Hapmappable' genome; using single nucleotide polymorphisms to follow haplotypes in these regions is virtually impossible using current high-throughput technologies because of subtelomeric duplication content. Our high-resolution analysis of subtelomeric duplication sequence content and organization demonstrates significant differences in the levels of sequence similarity between distinct subtelomere duplicon families as well as large variations in the types and sequence organization of duplicons present at particular subtelomeres. These differences may offer opportunities for distinguishing individual subtelomere alleles in the context of genomic DNA samples, ultimately permitting large-scale studies associating subtelomere haplotypes or haplotype combinations with particular phenotypes.
Our analysis of subtelomeric duplicon substructure and nucleotide sequence similarity provides a different and more detailed perspective on subtelomere sequence organization than the subtelomere paralogy analysis included as part of the Linardopoulou
et al. [
12] study. The starting point for our analysis was a comprehensive set of manually curated and physically mapped subtelomere sequence assemblies [
6], and we incorporated all segmental duplications of the subtelomeric sequences (both non-subtelomeric and subtelomeric) into our duplicon definition and analysis strategy; this led to the systematic and comprehensive definition and sequence characterization of duplicons anchored to each subtelomere (Additional data files 6-47). The paralogy map derived from the Linardopoulou
et al. [
12] analysis does not incorporate non-subtelomeric homology blocks or the newer subtelomeric sequence included in our assemblies. Because of these differences, the paralogy blocks they define overlap with, but do not correspond to, any of the subtel-only blocks or subterminal blocks defined in this study (Additional data file 50). In addition, we determined raw percent nucleotide sequence similarity numbers directly from the pairwise blastn alignments of RepeatMasked sequence, rather than calculating this parameter from alignments of non-RepeatMasked DNA post-processed to exclude gaps and small insertions/deletions from alignment percent identity scoring [
12]. This accounts for the generally higher divergence between our duplicon sequence alignments compared to those of Linardopoulou
et al. [
12], and helps to focus attention on sequence differences most likely to be useful for allelic and paralog discrimination.
Duplicons and sets of adjacent duplicon blocks that comprise segmentally duplicated subtelomeric DNA were classified according to several practically useful and perhaps biologically significant groups. Duplicon blocks that occur only in subtelomeric regions (Table ) can be used to develop sequence-based approaches to the analysis of subtelomere variation and subtelomeric somatic evolution of individual genomes, without interfering background signals from non-subtelomeric sites. Subterminal duplicon blocks of sequence (Table ) were defined that, together with six one-copy subterminal regions, comprise all of the cis-elements adjacent to terminal (TTAGGG)n tracts. These sequences are believed to be involved in telomere-specific and allele-specific (TTAGGG)n tract regulation [
19], and are amongst the first non-(TTAGGG)n sequences expected to be affected by telomere dysfunction, aberrant telomere replication, and telomere instability. Their delineation and analysis of their variation are crucial for understanding the role of human subtelomeres in telomere length regulation and telomere biology.
Subtelomeric duplicons are known to harbor protein-encoding genes and predicted protein-encoding genes as well as pseudogenes and many transcripts of unknown function [
6,
12,
35] (H Riethman, unpublished). Known genes embedded in the subtelomere-specific duplicons and in the subterminal duplicons are listed in Tables and , respectively; a comprehensive listing of RefSeq matches with these duplicons is given in Additional data files 51 and 52. For several subtelomeric transcript families (IL9R, DUX4, FBXO25) functional evidence for protein expression from at least one transcript locus is available [
38-
40]. However, for most transcript families the evidence for encoded protein function relies upon the existence of one or more actively transcribed loci with open reading frames predicted to encode evolutionarily conserved proteins [
41-
44]. While these data strongly suggest that one or more members of each of these gene families encode functional protein, in most cases pseudogene copies of the respective gene family co-exist amongst the duplicons and a great deal of work lies ahead in terms of deciphering the functions of individual members of subtelomeric gene families as well as their evolution. In this light, it is important to note that only a single reference sequence has been sampled in this analysis, and given the abundant large-scale variation in these regions, there are certain to be many additional members of most of these gene families yet to be discovered in the human population.
One of the most intriguing transcript families embedded in the subtelomere repeat region is one predicted to encode odorant receptors [
35,
41], in subtelomere-specific duplicon block 2 (Table ). The highly variable dosage and polymorphic distribution of these genes in humans reflect a recent and evolutionarily rapid expansion of this gene family. Subtelomeric duplicon regions of yeast, Plasmodium, and trypanosomes are each associated with rapid duplication and generation of functional diversity in their embedded genes (discussed in [
10]), and it is intriguing to speculate that similar mechanisms are active in human evolution. A very interesting transcript family of unknown function (CXYorf1-related) is embedded in subterminal duplicon block C (Table ); many of these transcripts are predicted to encode variants of an evolutionarily conserved open reading frame with one copy in the mouse genome [
44]. This transcript family varies widely in both dosage and telomere distribution in individual genomes, and usually terminates less than 5 kb from the start of the terminal (TTAGGG)n tract; thus, individual telomeric transcription sites for this family might be differentially susceptible to position effects depending on local telomeric chromatin/heterochromatin status and on chromosome-specific telomere lengths.
From our analysis, it is clear that most subterminal duplicon sequences are more divergent than the large duplicons that exist more centromerically, both in nucleotide sequence similarity and in sequence organization. This divergence might be exploited to develop subterminal allele-specific PCR assays to track some of these sequences genetically in the context of total genomic DNA. For both the highly similar and the more divergent duplicon families, coupling quantitative PCR assays designed to amplify sequences across these regions with new bead-based single molecule characterization and sequencing methods [
45,
46] might provide an extremely powerful means for determining both the copy number and a global set of short-range subtelomere haplotypes within an individual genome. Thus, subtelomere variation might be linked with phenotypes at this level. Extending these global short-range sequence haplotypes into longer-range subtelomere allele haplotypes will be more challenging, and may require the isolation, detailed characterization, and perhaps complete sequencing of many additional variant subtelomere alleles.