Following initial method descriptions, current research is applying genome capture methods to a variety of questions. From disease causation and diagnosis to evolutionary comparison of ancient genomes, genome capture and massively parallel sequencing is a powerful investigative tool.
Medical sequencing
One of the more common exome capture experiments will be the search for genetic variation underlying a particular disease. For some diseases, causative genes have been identified, and researchers can use custom captures to examine those genes for known and novel variants in their samples. For other diseases, whole exome capture is suitable, as the causative gene is unknown, or many different genes may contribute. Several recent studies have captured and sequenced different regions of individual genomes with known causative variants or genes. These proof of principle experiments demonstrate the utility, as well as some shortcomings, of capture followed by massively parallel sequencing. Ng
et al. (
26) have used array-based hybridization to sequence 12 human exomes (~28 Mb). The study included four unrelated individuals with Freeman–Sheldon syndrome, a dominantly inherited rare Mendelian disorder. The investigators were able to identify variants in the known causative gene in each sample. Interestingly, the known causative gene was the only candidate following the application of numerous filters, including requiring a gene to have a novel variant in each sample. In their study of neurofibromatosis type 1, Chou
et al. (
29) used custom array capture and pyrosequencing to target the 280 kb region containing the
NF1 gene, which is known to harbor causal dominant mutations. The authors captured DNA from two different samples with known genotypes, but were initially only able to recover a known single-base deletion. The other known variant, an Alu sequence insertion, was only observed after
de novo assembly of unmapped reads. Additionally, the authors found many positions at which the captured genotypes did not agree with Sanger sequencing confirmations. They found that while some discrepancies were due to pyrosequencing errors, others were misalignments from the numerous pseudogenes of
NF1, illustrating one of the potential pitfalls of the method. Hoischen
et al. (
30) also used array-based capture (~2 Mb) and pyrosequencing to re-identify known variants in five individuals with autosomal recessive ataxia. They were able to initially identify 6/7 known variants investigated; the seventh variant was visible only after adding three times more sequence, although at a low number of reads (2/9 reads contained the mutation). A known variant trinucleotide repeat was not included in the design, due to the repetitive nature of these variants, and therefore not recovered. Raca
et al. (
31) searched for two known variants causative for Papillorenal syndrome using array-based capture targeting the causative gene,
PAX2, as well as >100 candidate genes for other ocular disorders (370 kb), followed by pyrosequencing. They were able to identify a known substitution using the provided sequencing analysis software, but did not recover the known single-base deletion in a homo-polymer run, despite seeing reads containing the variant. The authors concluded the vendor provided software was conservative when dealing with insertions/deletions in homo-polymer runs, as pyrosequencing has a higher error rate with this type of sequence. Other analysis packages were able to identify the variant.
Although these studies were not designed to identify novel variants causative for disease, much can be learned from them. Importantly, not every known variant was recovered. This was due to low sequence depth at the variant position, as well as issues relating to repeat regions and alignment. One study estimated that the probability of detecting a causative variant in any given gene is ~86%, although this ignores non-coding and structural variants (
26). In order to ensure sufficient allele sampling, as well as to prevent sequencing errors from appearing to be actual variants, all four studies use or recommend a minimum sequence depth threshold, ranging from 8- to 30-fold depth of coverage. These recommendations will affect the amount of sequencing required for a given capture size and will therefore affect the cost of the experiment.
Targeted capture has also been used to identify novel genes that cause hereditary disorders. Novel, putative causative variants have recently been discovered for a variety of disorders [sensory/motor neuropathy with ataxia (
32), Clericuzio-type poikiloderma with neutropenia (
33), familial exudative vitreoretinopathy (
34), recessive non-syndromic hearing loss (
35), talipes equinovarus, atrial septal defect, robin sequence, persistent left superior vena cava (
36)] using genome capture to target linkage regions from the affected families. The identified variants were almost all non-synonymous substitutions, but follow-up studies on additional unrelated samples using Sanger sequencing also identified insertions/deletions in the same genes (
33,
35). Volpi
et al. (
33) identified a substitution that disrupted a splice site, resulting in an exon skip and a frameshift. Interestingly, Johnston
et al. (
36) were able to identify variants in two different families (one non-sense, one frameshifting insertion) without sequencing the probands, for which DNA was not available. These studies demonstrate the ability of genomic capture to discover different types of novel variants important for human disease.
In addition to custom capture studies, two whole exome studies have been recently reported. In the first, Choi
et al. identified a novel coding variant in a consanguineous region of an affected individual. The variant was a homozygous missense substitution in
SLC26A3, a gene in which mutations are known to cause congenital chloride-losing diarrhea (
25). This genetic finding allowed the researchers to correct an earlier diagnosis of the patient's disorder. Variants in the same gene were present in other individuals, allowing the corrected diagnosis for them as well. In the second study, Ng
et al. used exome capture to search for variants causing Miller syndrome in three unrelated families. They identify variants in
DHODH in all three families, using filters for novel variants that fit inheritance models. These studies both showed that exome capture is an effective way to discover causative variants and genes and to correctly diagnose heritable disorders caused by variants in known genes.
Human evolution
Recent advances in the sequencing of ancient DNA have also benefited from targeted capture. Researchers used PEC to specifically target mitochondrial DNA from five Neandertal samples (
21). The PEC method allowed complete coverage of the Neandertal mtDNA, using only 5–50 ng of amplified pyrosequencing library template. More recently, researchers used array-based capture to target, in Neandertal DNA, non-synonymous substitutions that have been fixed in humans since the divergence from the human/chimpanzee ancestor (
37). Although the array-based capture did not have the low DNA requirements of PEC, the method allowed sequencing of a Neandertal sample containing 99.8% contaminating microbial DNA. Owing to the high contamination, this sample was unsuitable for shotgun sequencing, but targeted capture allowed recovery for almost all of the Neandertal sequence at the desired positions. The authors were able to then identify 88 substitutions that have become fixed in humans since the split from Neandertal, giving insight into what distinguishes us at the genetic, and perhaps molecular level.
Exome capture has been used to investigate more recent variation as well. Researchers used whole exome capture to identify changes in allele frequency between high-altitude populations (Tibetans) and low altitude populations (Han Chinese and Danes) (
38). They were able to identify a number of genes likely to have been selected for as a part of adaptation to a high-altitude environment. Several of these genes were identified in other studies using microarray genotyping (
39,
40). This suggests that exome capture techniques are accurate and useful for these types of allelic frequency studies and would be especially useful for rarer SNPs that may not be included on the microarray platforms. Both recent and ancient genetic differences have been investigated using exome capture, allowing us to see a more complete view of our evolutionary history.
Biological
Basic biology questions are also being investigated on a much greater scale than previously possible using genome capture. Although the genetic information in DNA is frequently the initial focus of genome studies, epigenetic modification of the DNA also plays an important role in the biological function of an organism. Two groups used genome capture with padlock probes (
19) or array-based capture (
41) to investigate DNA methylation using bisulfite sequencing. Both studies found this to be very accurate when compared with the standard capillary methods. The latter study also showed that sensitivity using array-based capture was high: 86–91% of targeted bases were covered by 10 or more reads. An additional study focussed not on methylation status, but on genetic variation at CpG sites, which are subject to a higher mutation rate via 5-methylcytosine deamination (
17). Using padlock probes, the researchers were able to determine genotypes for ~65% of targeted bases. The accuracy was very high when compared with an independent genotype assessment. These CpG region studies show that capture is useful to focus on the desired regions and is effective, even on difficult (high GC content) regions.
Copy number variation (CNV) is another source of genetic variation implicated in disease. The detection of copy number changes is often performed using low-resolution methods, such as array-comparative genomic hybridization and single nucleotide polymorphism (SNP) microarrays. Conrad
et al. (
42) have used targeted sequencing to capture breakpoint regions and identify the actual breaks with a high resolution. They were able to identify breakpoints for a number of known CNVs and were then able to classify the breaks into likely repair mechanisms used. The authors point out that this method is useful for CNVs in simpler regions, as repeat elements and complex genomic regions present challenges both for capture and post-sequence alignment.
Capture is not only limited to genomic DNA. Several studies have used targeted sequencing to investigate RNA as well. One group used padlock probes to target regions containing known RNA-editing sites (
43). They were able to identify sites in 10 of 13 known edited genes, by comparing captures of genomic DNA and cDNA from various tissues. The authors chose 18 editing sites at random and confirmed 15 with capillary sequencing. This research showed that padlock capture techniques work with cDNA and can be used to identify sites of RNA editing. Hybridization capture was also shown to capture cDNA (
44,
45). In (
44), the authors capture both cDNA and genomic DNA with an array-based method. They then determine allele-specific expression using both data sets. In (
45), the authors use solution hybridization to focus on enriching cDNA from a set of genes of interest. They were able to effectively enrich these genes, suggesting that genes of low abundance could be detected without huge increases in total sequencing. Interestingly, they were also able to identify gene fusions, including fusions in which one gene was not targeted. Applying targeted sequencing to cDNA is another way to focus on specific questions, even without whole-genome sequence.