|Home | About | Journals | Submit | Contact Us | Français|
To determine whether massively parallel next-generation DNA sequencing offers rapid and efficient detection of disease-causing mutations in patients with monogenic inherited diseases. Retinitis pigmentosa (RP) is a challenging application for this technology because it is a monogenic disease in individuals and families but is highly heterogeneous in patient populations. RP has multiple patterns of inheritance, with mutations in many genes for each inheritance pattern and numerous, distinct, disease-causing mutations at each locus; further, many RP genes have not been identified yet.
Next-generation sequencing was used to identify mutations in pairs of affected individuals from 21 families with autosomal dominant RP, selected from a cohort of families without mutations in “common” RP genes. One thousand amplicons targeting 249,267 unique bases of 46 candidate genes were sequenced with the 454GS FLX Titanium (Roche Diagnostics, Indianapolis, IN) and GAIIx (Illumina/Solexa, San Diego, CA) platforms.
An average sequence depth of 70× and 125× was obtained for the 454GS FLX and GAIIx platforms, respectively. More than 9000 sequence variants were identified and analyzed, to assess the likelihood of pathogenicity. One hundred twelve of these were selected as likely candidates and tested for segregation with traditional di-deoxy capillary electrophoresis sequencing of additional family members and control subjects. Five disease-causing mutations (24%) were identified in the 21 families.
This project demonstrates that next-generation sequencing is an effective approach for detecting novel, rare mutations causing heterogeneous monogenic disorders such as RP. With the addition of this technology, disease-causing mutations can now be identified in 65% of autosomal dominant RP cases.
Massively parallel next-generation sequencing has revolutionized the speed and cost associated with generating large quantities of sequence data, making it a promising technology for detecting disease-causing mutations associated with monogenic diseases.1–5 The low-cost, high-throughput attributes of next-generation sequencing make it particularly attractive for use in highly heterogeneous monogenic diseases such as retinitis pigmentosa (RP) where the number of potential disease-causing genes and mutations is high, and many are still unknown.
RP is a multifaceted Mendelian form of inherited photoreceptor degeneration that is monogenic in most individuals and families, but extremely heterogeneous in patient populations. RP affects approximately 1 in 4000 individuals in the United States, Europe, and Japan. This translates into approximately 1.5 million people affected with RP worldwide.6–10 Data from the Beijing Eye Study suggest that the prevalence of RP in China may be even higher (approximately 1 in 1000).11 Studies in Japan, Denmark, and Kuwait show that RP is among the leading cause of blindness or visual impairment worldwide, accounting for 25% to 29% of cases in the working-age group (21–60 years).12–15
RP can be inherited in an autosomal dominant (adRP), autosomal recessive (arRP), or X-linked (XlRP) manner with rare mitochondrial and digenic forms also reported.16,17 Several syndromic disorders, such as Bardet-Biedl and Usher syndrome, have RP associated with them.8,17 To further complicate matters, there are several other forms of inherited retinal degeneration—Leber's congenital amaurosis and cone–rod dystrophy to name a few—which have overlapping phenotypes, and to some extent, overlapping genes and mutations.16,18,19
Significant progress has been made in determining the molecular causes of RP, but much remains to be done. To date, research has identified 20 adRP, 35 arRP, and 6 XlRP loci, several overlapping with each other (RetNet, http://www.sph.uth.tmc.edu/RetNet; provided in the public domain by the University of Texas Houston Health Science Center, Houston, TX). Most of the genes have been identified for these loci, (17, 18, and 2 respectively), but mutations still cannot be found in a large fraction of individuals with RP, indicating that many of the causative genes and mutations have not been identified yet. For example, after testing for mutations in the known adRP genes, we can identify mutations in only approximately 60% of a clearly defined cohort of patients with adRP.17,20–24 In populations other than those of Western European origin, the mutation detection rate is even lower.25–29
To determine whether massively parallel next-generation sequencing is an option for adRP mutation identification and discovery, we sequenced pairs of affected individuals from 21 autosomal dominant families without mutations in known adRP genes, by using a combination of two platforms: 454GS FLX Titanium (Roche Diagnostics, Indianapolis, IN); and GAIIx (Illumina/Solexa, San Diego, CA). One thousand amplicons corresponding to the coding sequences and intron–exon junctions of 46 candidate genes were sequenced as part of this pilot project. Variants were assessed for potential pathogenicity using bioinformatic annotation, dbSNP, and manual review. One hundred twelve of the most likely variants were validated and subjected to additional segregation and population analysis using conventional di-deoxy sequencing. Five definitive mutations were identified in the 21 families proving that massively parallel next-generation sequencing is an effective approach for determining the genes and mutations associated with RP.
A subset of 21 families was selected from a cohort of 230 adRP families that has been described previously (Bowne SJ, et al. IOVS 2007;48:ARVO E-Abstract 2334).20,21,23,24 Families without mutations were selected from the adRP cohort based on pedigree analysis and the availability of DNA (Table 1). Pairs of affected individuals with the lowest kinship coefficient within the family were selected from each pedigree. Six additional positive control samples were selected from our diagnostic laboratory for analysis (Table 2).
The research adhered to the tenets of the Declaration of Helsinki. Informed consent was obtained from each of the individuals tested. This study was approved by the Committee for the Protection of Human Subjects of the University of Texas Health Science Center at Houston and by the respective human subjects' review boards at each of the participating institutions.
Lymphoblast DNAs from four human population control collections (CEPH, Han people of Los Angeles, Mexican-American community of Los Angeles, and African American) were obtained from the Coriell Institute for Medical Research (Camden, NJ) or from the Centre d'Etude du Polymophisme (Paris, France).
Genes targeted for sequencing were (1) known causes of autosomal dominant RP, (2) known causes of other forms of retinal degeneration with overlapping phenotypes, or (3) potential disease-causing candidate genes selected from sensory cilium proteome studies, EyeSAGE data, and other retinal expression and protein interaction studies (Table 3) (Liu O, et al. IOVS 2006;47:ARVO E-Abstract 3725).30–33
In-house amplimer design algorithms were used in conjunction with Primer3 primer-selection software (http://sourceforge.net/projects/primer3/develop)34 to design 1000 PCR amplicons (average size, 283 bp) targeting all coding and noncoding exonic sequences of the 46 genes. PCR primers were ordered from IDT (Coralville, IA) in four sets, each with an M13, MID1, MID2, or MID3 primer tail.
Genomic DNA was subjected to whole-genome amplification (WGA; REPLI-g genome amplification service; Qiagen, Valencia, CA) before amplification of the targets. Only samples with an assessment rating indicating a >99.0% accuracy rate were used for the study. Each of the 1000 PCR amplicons was amplified individually with 10 ng of WGA DNA, PCR master mix (Amplitaq Gold Master Mix; Applied Biosystems, Inc., [ABI], Foster City, CA), 8% glycerol, and 2.4 picomoles of primer. Standard PCR cycling conditions were performed for 40 cycles with an annealing temperature of 60°C.
Before variant identification sequencing, WGA DNA from all 48 individuals was pooled and amplified to determine individual amplicon efficiencies. Each of the 1000 primer sets containing M13 tails was amplified independently. Equal volumes of PCR product from these amplimers (3 μL) were pooled and sequenced on one full 454/XLR plate (~1,000× coverage per amplicon, or ~20× per sample). Amplicons were classified into groups of high, medium, and low coverage based on the average 454GS FLX read depth as described in Table 4. Amplicon efficiency classifications and resulting input ratios were used for all subsequent sample sequencing library preparations.
PCR products corresponding to four individuals (two individuals per family), were each amplified with a different primer tail. PCR products were pooled using the ratios established during the PCR efficiency run (Table 4). PCR product-pool libraries were created according to the protocol outlined in the manufacturer's instructions, with omission of the DNA fragmentation and size selection steps. Briefly, 2 μg of each pool was size purified (Agencourt AMPure; Beckman Coulter Genomics, Danvers, MA), according to the manufacturer's protocol. Fragments were end polished with T4 PNK and T4 DNA polymerase (both from Roche Diagnostics) and the adapters ligated at 25°C for 15 minutes. Fragments were immobilized, filled in, and denatured to construct the single-stranded DNA library. A library of positive controls was created using the same methodology but the pooled PCR product corresponding to the six mutation-positive DNAs was used without regard for the primer tail used in amplification.
Emulsion PCRs were performed according to the manufacturer's instructions (Roche Diagnostics). Briefly, library DNA fragments were captured on beads and then resuspended in amplification mix and prepared in oil (GS FLX Titanium emPCR kit; Roches Diagnostics). The emulsified beads were amplified and then recovered and washed (GS FLX Titanium emPCR Breaking Kit; Roche Diagnostics). Bead enrichment was performed via a biotinylated enrichment primer and streptavidin-coated magnetic beads. Sequencing primers were annealed to the bead-bound, single-stranded template DNA fragments and sequenced according to the manufacturer's protocol (Roche Diagnostics). The positive control library was run on one-half of a 454GS FLX XLR plate, while the remaining libraries were run on one fourth of a 454GS FLX XLR plate.
One paired-end GAIIx library containing PCR product from each of the 48 DNA samples was constructed. Excess primers and primer dimers were removed using PCR purification columns (QIAquick; Qiagen), according to the manufacturer's protocol. Ten micrograms of the PCR product was concatenated (Quick Ligase Kit; New England Biolabs, Ipswich, MA) with 7.2% PEG-8000. Concatenated DNA was then nebulized according to the manufacturer's protocol (Illumina/Solexa).
Sheared DNA was end repaired (DNA Terminator End Repair kit; Lucigen, Middleton, WI), purified (QIAquick columns; Qiagen), and verified on 1.2% gels (FlashGels; Lonza, Basel, Switzerland). An adenosine was added to the 3′ end of concatenated products (Klenow Fragment (3′→5′ exo−; New England Biolabs) at 37°C for 30 minutes. Adenosine-tailed DNA was column purified (MinElute; Qiagen) and ligated to an adapter-oligo mix at room temperature for 15 minutes. Ligated reactions were purified with the purification columns followed by gel purification using 1.2% agarose and extraction (MinElute Gel Extraction Kit; Qiagen). PCR enrichment was performed using DNA polymerase master mix (Phusion; New England Biolabs). Reactions were denatured at 98°C for 30 seconds followed by five cycles of 98°C for 10 seconds, 65°C for 30 seconds, and 72°C for 30 seconds with a final extension at 72°C for 5-minute PCR reactions were purified on the minicolumns, and the 300- to 400-bp library fragment was isolated by gel purification.
Cluster Generation Kit ver. 2 (Illumina/Solexa) and the manufacturer's protocol were used to generate paired ends, which were then sequenced on the GAIIx (Illumina/Solexa) platform (SBS Sequencing Kit, ver. 3; Illumina/Solexa) according to the manufacturer's protocol.
Sequencing data files (SFF format) were converted to FASTA format using sffinfo. Individual samples were separated using cross_match to match M13, MID1, MID2, and MID3 primer sequences with the first 20 bases of every read. Manifests of read names for each primer tail were composed, and sequences were extracted from the master SFF file for each individual.
Read sequences were extracted from individual SFF files using sffinfo and then aligned to the Hs36 reference sequence using BLAT (ver. 32 × 1) (http://genome.ucsc.edu/cgi-bin/hgBlat/ provided in the public domain by the University of California Santa Cruz).35 Alignments with <90% identity, a score of <25, or mapping to multiple locations in the genome (with the same score), were discarded. SNPs and indels were detected in the BLAT alignments using VarScan.36 A minimum of 10× coverage and at least 25% of reads supporting the variant allele was required for variant calling. Substitutions at sites with base quality <15 were discarded. Artifacts from regions homologous to MID primer tails were avoided by discarding variants called within 10 bp of the beginning or end of the read.
GAIIx reads were aligned to the Hs36 reference sequence using Bowtie (ver. 0.9.8, http://bowtie-bio.sourceforge.net).37 Reads were required to have a single best alignment (−m 1) with no more than two high-quality mismatches to the reference sequence. Bowtie alignments were parsed to identify substitutions at positions with base quality of 15 or higher. The variant allele with the greatest read support was called for all positions at which variants were detected.
To detect small indels, GAIIx reads were aligned to the Hs36 reference sequence (Novoalign, ver. 2.03.12; Novoalign, Selangor, Malaysia). VarScan36 was used to call indels based on unique read alignments. Indels with ambiguous positions or flanked by homopolymers were removed. Validation of 454GS FLX indels required that GAIIx indel calls be the same type, size (within 1 bp), and position (±5 bp).
The 1000 amplicon PCR primers with M13 tails were used for all additional analyses of potential disease-causing variants. Genomic DNA was amplified using Taq master mix (AmpliTaq Gold 360 Master Mix; ABI) and standard amplification conditions. PCR product was treated with exonuclease I and shrimp alkaline phosphatase (ExoSapIt; USB, Cleveland, OH) before sequencing. Clean PCR product was sequenced as described previously with dye termination chemistry (BigDye Terminator ver. 1.1; ABI) and M13 primers.38 Sequence reactions were treated with a reaction cleanup kit (BigDye XTerminator; ABI), according to the manufacturer's protocol, and run on one of two automated capillary sequencers (3100 Avant or 3730XL; ABI). Sequence analysis was then performed with one of three commercial software programs (Sequencing Analysis, Variant Reporter, or SeqScape; ABI).
The sample selected for this project included a subset of family members from our previously described AdRP Cohort (Bowne SJ, et al. IOVS 2007;48:ARVO E-Abstract 2334;).21,23,24,38,39 Families in the adRP cohort, based on pedigree analysis, have a high likelihood of having the autosomal dominant form of RP. Analysis of pedigrees for the likelihood of dominance versus X-linked or recessive inheritance did show some families with ad:Xl likelihood odds ratios of less than 0, indicating that the disease in a few families could be caused by mutations in an X-linked gene with clinical expression in carrier females (Table 1).
A proband from each family was tested previously for mutations in the complete coding regions of CA4, CRX, FSCN2, IMPDH1, NRL, PRPF31, RDS, RHO, ROM1, RP9, and TOPORS, and in mutation hot spots of RP1, PRPF3, PRPF8, NR2E3, and SNRNP200. Likely disease-causing mutations were identified in 141 of the 230 families. Only families without previously identified mutations were considered for this study.
Twenty-one families were selected for this project based on pedigree analysis and availability of family member DNAs. Two affected individuals from each family were selected to have the lowest kinship coefficient possible—that is, the most distant, available affected relatives with a common ancestor carrying a putative adRP mutation. This increased the probability that any shared variant identified in this project would be associated with disease, not just identical by chance. The demographic and pedigree characteristics of the families and the kinship coefficient of the two selected family members are shown in Table 1.
Six individuals with known mutations were also selected to use as positive controls in this study (Table 2). These individuals had a variety of mutation types in several different genes, thereby testing the identification rate of next-generation sequencing for different classes of DNA variants.
Each of the 46 genes selected for this project was (1) a known cause of adRP, (2) a known cause of other forms of retinal degeneration with phenotypes overlapping adRP, or (3) a potential disease-causing candidate gene selected from sensory cilium proteome studies, EyeSAGE data, and other studies of retinal expression and protein interaction (Liu Q, et al. IOVS 2006;47:ARVO E-Abstract 3725).30–33 The genes selected, their chromosomal location, and the associated diseases, if any, are listed in Table 3.
PCR primers corresponding to 1000 amplicons were designed to amplify all the coding and noncoding exonic sequences for each of the 46 genes selected. PCR primers were manufactured in sets of four with each set containing the same genome-specific sequence and one of four different tail sequences (M13, MD1, MD2, and MD3). These four tail sequences allowed PCR product from four individuals to be combined after amplification, while retaining the ability to distinguish the four individuals on sequence assembly.
Genomic DNA from each of the 48 individuals tested in this study was amplified by WGA before target amplification. A 454GS FLX amplicon efficiency test was performed on a pool of the DNAs to optimize product pooling such that each amplimer represented in the sequencing library was relatively equivalent (Table 4).
Eleven patient pool libraries were constructed for analysis on the 454GS FLX sequencing system. Each of these pools corresponded to two sets of affected family pairs that had been amplified individually with the four different, tailed target primers. The six positive control samples were pooled to form one library that was not sorted by MD tail.
One concatenated, paired-end library was constructed for analysis (PE sequencer; Illumina/Solexa). Concatenation and shearing of the PCR products before library construction enabled access to those regions that, due to the short read length of GAIIx sequencing, would otherwise not be sequenced, and also gives more random distribution of all positions in a particular sequence. This feature introduces less quality bias based on position of the variation within a given sequence. The paired-end library pooled all 42 unknown affected individuals and the six positive controls. Since this library was not sortable, it was used only for variant confirmation, not individual variant identification.
Sequence reads corresponding to the pooled positive control library were aligned to Hs36 and analyzed for the presence of each known mutation. Five of the six mutations were detected at a read frequency of 6% to 13%, which is in accordance with the predicted detection rate of 8% (Table 5). The sixth variant, a large 47-kb deletion of PRPF31 and several flanking genes present in VCH029, was not detected using this technology, as expected.
Approximately 1.5 million 454GS FLX sequence reads were separated by individual, by using primer tails with typically 90% to 95% of reads identified unambiguously. Alignment of the sequence reads to Hs36 resulted in an average sequence depth of 70×, with 93% of reads mapping to the 46 target genes and identification of more than 9000 variants (Fig. 1, Table 6).
The list of unfiltered variants was compared with the nonpathogenic variants identified in positive controls on the assumption that variants other than the one pathogenic mutation would not be disease-causing in other individuals. All variants found in the positive controls were removed, as were any nonpathogenic variants found in dbSNP (http://www.ncbi.nlm.nih.gov/SNP/ provided in the public domain by the National Institutes of Health, Bethesda, MD). Unfortunately, automated removal of variants in dbSNP was not possible, as a small portion of the variants in dbSNP are truly pathogenic.
The remaining 783 intermediate variants were analyzed to remove duplicate variants found in the same family leaving 420 unique variants. These variants were then annotated and prioritized based on their location in a gene (exon, intron, and splice-site) and on the predicted transcript or protein alteration and by manual assessment of the potential of the affected gene to cause RP. The resulting 112 variants were classified as potentially pathogenic, thereby warranting additional analysis (Fig. 2).
The 66.7 million 36-bp reads from the pooled GAIIx library (Illumina/Solexa) were aligned to Hs36 with an average sequence depth of 125× (Table 6). Variants were called and compared with the list of the 454GS FLX variants. To confirm a 454GS FLX variant required GAIIx read coverage of at least 100× and at least two variant-supporting reads. These data were confirmatory but not used in the initial identification of the 112 potential pathogenic variants.
The 112 potential pathogenic variants were subjected to a series of analyses to determine whether they were true variants, if they segregated with disease in the family in which they were identified, and whether they were present in unaffected controls.
Fluorescent di-deoxy capillary sequencing was used to determine whether the variants identified by next-generation sequencing were actually genomic variants. Genomic DNAs from the original affected family pair and from two additional family members (when available) were tested with the corresponding M13 tailed primers for the original 1000 amplimer amplifications. Traditional Sanger sequencing showed that 55 of the 112 potential pathogenic variants were artifacts of 454GS FLX sequencing. An additional four of the potential pathogenic variants did not amplify within the genomic regions specified for the variant and so were also assumed to be artifacts. With this strategy, 55 of the potential pathogenic variants were confirmed to be present in the identified individuals.
Once a variant was determined to be real, segregation analysis was used to assess its likelihood of it being disease-causing. If initial analyses showed correct segregation in the first set of four family members tested, then all available family DNAs were tested. Forty-three of the 53 confirmed variants did not segregate with disease in the family and hence were determined to be benign. Ten of the variants segregated with disease in all available family members (Table 7).
Three of the 10 segregating variants, KLHL7 p.A153V, RPGR p.G65D, and PRPF31 c.946–1, were identified and characterized in parallel laboratory testing and determined to be pathogenic.40,41 An additional variant in RPGR, p.G738*, was identified among the 10 segregating variants. Although not previously reported, RPGR p.G738*, like many other reported RPGR mutations, produces a premature termination codon in ORF15 and hence is most likely pathogenic. One additional segregating variant in GUCY2D, p.R838C, has also been reported to cause cone–rod dystrophy.42 No further testing was performed for these five disease-causing mutations (Fig. 3).
Ethnically matched control population DNAs were tested for the possible presence of the remaining five variants of unknown pathogenicity. Three of the variants, PRPF8 c.1–51G>A, PITPNM3 p.R703W, and TTC26 c.896+73G>T were found in control DNAs and hence are benign. The two remaining variants, PROM1 c.1302+3C>T and MRFP c.641+9G>A, were not found in the controls. The number of immediately available family member DNA samples was low (three and two, respectively) for the PROM1 and MRFP variants. Subsequent collection and testing of three additional VCH008 family members demonstrated that the PROM1 c.1302+3C>T variant does not segregate with disease. Collection and testing of three additional VCH025 family members also demonstrated that the MRFP variant does not segregate with disease. These data demonstrate that both the PROM1 and MRFP variants are benign.
Analysis of the individual reads from 454GS FLX sequencing identified 77 small, high-confidence indels ranging in size from 1 to 3 bp. Indels with ambiguous positions or flanked by homopolymers were removed and the remaining compared with GAIIx indel data. A total of 10 indels were identified with the GAIIx data (Table 8).
Indels were evaluated using the same fluorescent capillary sequencing strategy described above for the possibly pathogenic SNP variants. Traditional sequencing failed to confirm the presence of nine of the indels. The 10th indel, a 3-bp deletion in ORF15 of RPGR was not assessed, since RPGR is located on the X-chromosome and the family exhibited male-to-male transmission of RP. Furthermore, 3-bp deletions in ORF15 are common and usually benign.43–46
An excess of 9000 variants was identified in the 21 families analyzed in this project. Most of these variants were classified as benign on the basis of their presence in controls with definitive disease-causing mutations distinct from the novel variants. This massive reduction in variants requiring follow-up analyses, over 8000, stresses the importance of running control individuals and multiple family members, if possible, when using next-generation sequencing to detect mutations in families with inherited diseases.
Additional laboratory analyses were performed for 112 possibly pathogenic variants found by 454GS FLX sequencing. The presence of 53 (47%) of these variants was confirmed by traditional sequencing, whereas the remaining 59 (53%) variants were found to be false positives. A large fraction of the false-positive variants occurred in a polynucleotide runs, which is a known limitation of 454 sequencing methodology. This result suggests that additional stringency should be used when identifying variants in polynucleotide runs to reduce the number of false positives.
The pooled GAIIx sequencing data were not used in the initial phases of the project, but were compared to the list of 112 variants to determine whether cross-platform comparisons might also reduce the number of false-positive variants. When compared with the pooled GAIIx data, 85% of the confirmed variants were present, but, 55% of the false positives were also seen in the GAIIx reads. This finding suggests that cross-platform comparisons may be useful for prioritizing variants for subsequent follow-up, but should not be used as an exclusive requirement for variant identification.
Next-generation sequencing of the 1000 amplicons corresponding to 46 candidate genes resulted in identification of five pathogenic mutations in the 21 families tested (Table 7, Fig. 3). As expected, three of these mutations are in genes reported to be associated with either autosomal dominant RP or autosomal dominant cone–rod dystrophy (Birch DG, et al. IOVS 2006;47:ARVO E-Abstract 1037).21,40–42 Somewhat surprising was the identification of two mutations in RPGR. It has been known for some time that mutations in RPGR cause X- linked RP, but the high frequency of symptomatic female carriers is just beginning to be appreciated.43,45,47
Identification of five additional mutations brings the known-mutation frequency of our AdRP cohort up to 64% (Fig. 4). That is, we can identify the disease-causing mutations in 148 of the 230 adRP cohort families.
This project demonstrates that next-generation sequencing can be an effective tool for determining the pathogenic mutation in inherited disease families with highly heterogeneous causes. The large number of those variants proven to be artifacts identified in the limited region of the genome tested during this project raises concerns that the use of next-generation sequencing for larger genomic regions, such as a complete exome or genome, will be daunting. Coupled with the wide genetic variation known to exist in humans, this project makes it evident that, without the ability to perform segregation analysis, it will be extremely difficult to distinguish rare pathogenic variants from rare benign variants.
The authors thank Gita Dangol, Kaylie Webb, and Joe Ray for technical assistance.
Supported by the Foundation Fighting Blindness, National Institutes of Health Grants EY007142 and EY12910, and U54 HG003079 to the Center for Large Scale Genome Sequencing and Analysis.
Disclosure: S.J. Bowne, None; L.S. Sullivan, None; D.C. Koboldt, None; L. Ding, None; R. Fulton, None; R.M. Abbott, None; E.J. Sodergren, None; D.G. Birch, None; D.H. Wheaton, None; J.R. Heckenlively, None; Q. Liu, None; E.A. Pierce, None; G.M. Weinstock, None; S.P. Daiger, None