In this study we show the efficacy of DNA capture sequencing and interrogation of variants in biologically important loci outside of the CCDS exome. These regions almost uniformly demonstrated decreased capture ability, as measured by average target coverage, when compared to the CCDS regions. Overall, both Illumina and SOLiD sequencing platforms showed similar biases in coverage of genomic subregions when measured relative to the CCDS. Importantly, capture ability appeared to be confounded by biases introduced by the sequencing technology and correlated with GC content of the target sequence, a known factor in short-read sequencing [37
]. Particularly, conserved UTR regions, which are approximately 30% GC, and regulatory regions, which are approximately 70% GC, had approximately half of the sequence depth of coverage as the CCDS regions, approximately 50% GC. When compared to WGS (non-capture) data the same general biases were evident. However, the act of capturing the targeted regions seems to exacerbate the coverage bias by an additional 5 to 10%. The exceptions to this are the predicted exons and microRNA, where the coverage was higher than expected and the UTR regions where the coverage was as much as 25% lower than expected from the WGS data. This effect may be due to steric hindrance of probe-target binding introduced by secondary structure present in the UTR regions. These results imply that naively capturing biologically relevant loci other than the CCDS will require 20 to 40% more sequencing data to be generated than expected from the CCDS. It may be possible, however, to alter the capture reagent, perhaps by increasing the representation of some probes, in order to compensate for the empirically measured coverage biases and thus help normalize the coverage when capturing CCDS and other elements.
To our knowledge, this is the first targeted-sequence capture study of a genome-wide, diverse set of biologically important elements, allowing the investigation of variant densities in functionally relevant loci that have been hitherto undetected at a fraction of the cost of whole genome sequencing. Using both Illumina and SOLiD sequencing, we demonstrate the ability to find variants across a significantly larger target region than the CCDS. As capture sequencing enables high levels of sequence coverage, we were able to discover rare (private) variants in each sample, using similar amounts of data to that used by low-coverage, whole-genome techniques that are better suited for common variant discovery.
Illumina sequencing consistently showed higher variant densities than SOLiD sequencing. This discrepancy is likely due to differences in variant filtering parameters used for the two different sequencing types. However, it may also reflect the inherently higher accuracy of SOLiD sequencing [37
]. Importantly, when measured relative to the CCDS variant density, different subregions showed remarkably similar variant densities for both sequencing platforms. Variant densities, however, were found to vary in different subregions of the genome, likely due to evolutionary conservation and base composition of these regions. The evolutionarily conserved CCDS exome and UTR regions showed variant densities of 1/1,600 to 1/1,850 bp, considerably less than the whole genome rate of 1/1,000 bp, which presumably reflects the result of purifying selection acting to remove deleterious variants. Exons specific to RefSeq, which are not in the CCDS, showed intermediate levels of variant density, 1/1,200 bp. This is likely because these loci are less essential to the organism, and mutations in these regions are less likely to be deleterious. Unlike the coding regions, the regulome showed a variant density higher than the whole genome. While this is likely due to the GC content of the regulome, we found that C→T and G→A mutations were underrepresented as a portion of all variants when compared to the CCDS. This is significant because 5-methyl-cytosine bases in CpG dinucleotides, which are over-represented in regulatory regions, are prone to spontaneous deamination to uracil and subsequent repair to thymine [39
]. This would indicate there is strong selective pressure to maintain cytosine and guanine representation in the regulome compared to the CCDS exome.
Of all the regions interrogated, the predicted exons showed the highest variant density, 1/660 bp. Although these exons have a higher GC content than the CCDS, it is considerably lower than the regulome, indicating that the increased mutability of GC-rich sequence content cannot fully account for the variant density. However, we observed that the intronic variant density in WGS studies was also considerably higher than that of the whole genome. It has been reported that transcribed regions have higher variant densities than non-transcribed regions [40
] and we surmise, therefore, that the observed variant density is a combination of these regions being actively transcribed and their high GC content. As expected from the high variant density, predicted exon regions showed a slightly higher proportion of bases with faster than neutral evolution rates than when compared to intronic regions. Unexpectedly, predicted exons also showed a slightly higher proportion of conserved bases when compared to intronic regions.
The 'exonization' of intronic elements is well documented [42
] and computationally predicted exons have been detected in mature mRNA from RNA-seq experiments [46
]. In this work we interrogated predicted exons that are flanked by canonical splice-sites and exist within known CCDS genes and thus are good candidates for inclusion in mature RNA and subsequent translation. Exons are thought to be protected from mutation [47
] and the higher mutation rates in predicted-exons may then be a source of evolutionary diversity.