|Home | About | Journals | Submit | Contact Us | Français|
Of 7028 disorders with suspected Mendelian inheritance, 1139 are recessive and have an established molecular basis. Although individually uncommon, Mendelian diseases collectively account for ~20% of infant mortality and ~10% of pediatric hospitalizations. Preconception screening, together with genetic counseling of carriers, has resulted in remarkable declines in the incidence of several severe recessive diseases including Tay-Sachs disease and cystic fibrosis. However, extension of preconception screening to most severe disease genes has hitherto been impractical. Here, we report a preconception carrier screen for 448 severe recessive childhood diseases. Rather than costly, complete sequencing of the human genome, 7717 regions from 437 target genes were enriched by hybrid capture or microdroplet polymerase chain reaction, sequenced by next-generation sequencing (NGS) to a depth of up to 2.7 gigabases, and assessed with stringent bioinformatic filters. At a resultant 160× average target coverage, 93% of nucleotides had at least 20× coverage, and mutation detection/genotyping had ~95% sensitivity and ~100% specificity for substitution, insertion/deletion, splicing, and gross deletion mutations and single-nucleotide polymorphisms. In 104 unrelated DNA samples, the average genomic carrier burden for severe pediatric recessive mutations was 2.8 and ranged from 0 to 7. The distribution of mutations among sequenced samples appeared random. Twenty-seven percent of mutations cited in the literature were found to be common polymorphisms or misannotated, underscoring the need for better mutation databases as part of a comprehensive carrier testing strategy. Given the magnitude of carrier burden and the lower cost of testing compared to treating these conditions, carrier screening by NGS made available to the general population may be an economical way to reduce the incidence of and ameliorate suffering associated with severe recessive childhood disorders.
Preconception testing of motivated populations for recessive disease mutations, together with education and genetic counseling of carriers, can markedly reduce disease incidence within a generation. Tay-Sachs disease [TSD; Online Mendelian Inheritance in Man (OMIM) accession number 272800], for example, is an autosomal recessive neuro-degenerative disorder with onset of symptoms in infancy and death by 2 to 5 years of age. Formerly, the incidence of TSD was 1 per 3600 Ashkenazi births in North America (1, 2). After 40 years of preconception screening in this population, however, the incidence of TSD has been reduced by more than 90% (2–5). Although TSD remains incurable, therapies are available for many severe recessive diseases of childhood. Thus, in addition to disease prevention, preconception testing could enable perinatal diagnosis and treatment, which can profoundly diminish disease severity.
Although individual Mendelian diseases are uncommon in general populations, collectively, they account for ~20% of infant mortality and ~10% of pediatric hospitalizations (6, 7). Over the past 25 years, 1139 genes that cause Mendelian recessive diseases have been identified (8). To date, however, preconception carrier testing has been recommended in the United States only for five of these: fragile × syndrome (OMIM #300624) in selected individuals; cystic fibrosis (OMIM #219700) in Caucasians; and TSD, Canavan disease (OMIM #271900), and familial dysautonomia (OMIM #223900) in individuals of Ashkenazi descent (9–13). A framework for the development of criteria for comprehensive preconception screening can be inferred from an American College of Medical Genetics (ACMG) report on expansion of newborn screening for inherited diseases (14). Criteria included test accuracy and cost, disease severity, highly penetrant recessive inheritance, and whether an intervention was available for those identified. These criteria are also relevant for expansion of preconception carrier screening. Hitherto, important criteria precluding extension of preconception screening to most severe recessive mutations or the general population have been cost [defined in that report as an overall analytical cost requirement of <$1 per test per condition (14)] and the absence of accurate, sensitive, scalable technologies.
Target capture and next-generation sequencing (NGS) have shown efficacy and, recently, scalability for resequencing human genomes and exomes, providing an alternative potential paradigm for comprehensive carrier testing (15–22). In genome research, an average depth of sequence coverage of 30-fold has been accepted as sufficient for single-nucleotide polymorphism (SNP) and nucleotide insertion or deletion (indel) detection (15–22). However, acceptable false-positive and false-negative rates for routine use in clinical practice are more stringent and are driven by the intended purpose for which the data are to be used. Data demonstrating the sensitivity and specificity of genotyping of disease mutations, particularly polynucleotide indels, gross insertions and deletions, copy number variations (CNVs), and complex rearrangements, are very limited (20–22). In particular, the accuracy of disease mutation genotypes derived from NGS of enriched targets has been uncertain.
A recent workshop provided recommendations for qualification of new methodologies for broader population-based carrier screening (23). These were high analytical validity, concordance in many settings, high throughput, and cost-effectiveness (including sample acquisition and preparation). Here, we report the development of a preconception carrier screen for 448 severe recessive childhood disease genes, based on target enrichment and NGS that meets most of these criteria, and use of the screen to assess carrier burden for severe recessive diseases of childhood.
The carrier test reported herein was based on several hypotheses. First, cost-effectiveness was assumed to be critical for test adoption. The incremental cost associated with increasing the degree of multiplexing was assumed to decrease toward an asymptote. Thus, very broad coverage of diseases was assumed to offer optimal cost-benefit. Second, comprehensive mutation sets, allele frequencies in populations, and individual mutation genotype-phenotype relationships have been defined in very few recessive diseases. In addition, some studies of cystic fibrosis carrier screening for a few common alleles have shown decreased prevalence of tested alleles with time, rather than reduced disease incidence (24, 25). These two lines of evidence suggested that very broad coverage of mutations offered the greatest likelihood of substantial reductions in disease incidence with time. Third, physician, patient, and societal adoption of screening was assumed to be optimal for the most severe and highly penetrant childhood diseases, before conception and where the anticipated clinical validity and clinical utility of testing was clear (26). Therefore, diseases were chosen that would almost certainly change family planning by prospective parents or affect antenatal, perinatal, or neonatal care. Milder recessive disorders, such as deafness, and adult-onset diseases, such as inherited cancer syndromes, were omitted, as were conditions lacking strong evidence for causal mutations (26).
Database and literature searches and expert reviews were performed on 1123 diseases with recessive inheritance of known molecular basis (8, 27, 28). In general, diseases were selected to meet ACMG guidelines for genetic testing for rare, highly penetrant disorders (26). Assessment of the clinical validity and utility of testing was primarily based on literature review and was challenging for some disorders because of the paucity of data. Several subordinate requirements were gathered: In view of pleiotropy and variable severity, disease genes were included if mutations caused severe illness in a proportion of affected children. All but six diseases that featured genocopies (including variable inheritance and mitochondrial mutations) were included. Diseases were not excluded on the basis of low incidence. Diseases for which large population carrier screens exist were included, such as TSD, hemoglobinopathies, and cystic fibrosis. Mental retardation genes were not included in this iteration. Four hundred and forty-eight X-linked recessive and autosomal recessive diseases, encompassing 437 genes, met these criteria (table S1). The disease type was cardiac for 8, cutaneous for 45, developmental for 46, endocrine for 15, gastroenterological for 3, hematological for 15, hepatic for 3, immunological for 29, metabolic for 142, neurological for 122, ocular for 12, renal for 25, respiratory for 8, and skeletal for 28. Note that these genes, although a good representative set, require further assessment of clinical readiness before translation into clinical testing.
Array hybridization with allele-specific primer extension was initially favored for expanded carrier detection because of test simplicity, cost, scalability, and accuracy, as has recently been described (29). To be well suited for array-based screening, however, most carriers must be accounted for by a few mutations, and most disease mutations must be nucleotide substitutions (8, 27, 28). Of 215 autosomal recessive disorders examined, only 87 were assessed to meet these criteria. Most recessive disorders for which a large proportion of burden was attributable to a few disease mutations were limited to specific ethnic groups. Indeed, 286 severe childhood autosomal recessive diseases encompassed 19,640 known disease mutations (8, 27, 28). Given that the Human Gene Mutation Database (HGMD) lists 102,433 disease mutations (27), a number that is steadily increasing, a fixed-content method appeared impractical. Other concerns with array-based screening for recessive disorders were type 1 errors in the absence of confirmatory testing and type 2 errors for disease mutations other than substitutions (complex rearrangements, indels, or gross deletions with uncertain boundaries). A serendipitous discovery (discussed below) that supported this decision was an unexpectedly high number of characterized mutations that are misannotated.
The effectiveness and remarkable decline in cost of exome capture and NGS for variant detection in genomes and exomes suggested an alternative potential paradigm for comprehensive carrier testing. Four target enrichment and three NGS methods were preliminarily evaluated for multiplexed carrier testing. Preliminary experiments suggested that existing protocols for Agilent SureSelect hybrid capture (15) and RainDance microdroplet polymerase chain reaction (PCR) (16) but not Febit HybSelect microarray-based biochip capture (30) or Olink padlock probe ligation and PCR (31) yielded consistent target enrichment. Therefore, workflows and software pipelines were developed for comprehensive carrier testing by hybrid capture or microdroplet PCR, followed by NGS (Fig. 1). Baits or primers were designed to capture or amplify 1,978,041 nucleotides (nt), corresponding to 7717 segments of 437 recessive disease genes by hybrid capture and microdroplet PCR, respectively. Targeted were all coding exons and splice site junctions, and intronic, regulatory, and untranslated regions known to contain disease mutations (table S2). In general, baits for hybrid capture or PCR primers were designed to encompass or flank disease mutations, respectively. Primers were also designed to avoid known polymorphisms and to minimize nontarget nucleotides. To capture or amplify both the normal and the disease mutation alleles, we also designed custom baits or primers for 11 gross deletion disease mutations for which boundaries had been defined (table S3). A total of 29,891 120-mer RNA baits were designed to capture 98.7% of targets. Fifty-five percent of 101 exons that failed bait design contained repeat sequences (table S4). Primer pairs (10,280) were designed to amplify 99% of targets (table S5). Twenty exons failed primer design by falling outside the amplicon size range of 200 to 600 nt.
An ideal target enrichment protocol would inexpensively result in at least 30% of nucleotides being on target, which corresponded to ~500-fold enrichment with ~2-million-nucleotide target size. This was achieved with hybrid capture after one round of bait redesign for underrepresented exons and decreased bait representation in over-represented exons (Table 1). An ideal target enrichment protocol would also give a narrow distribution of target coverage and without tails or skewness (indicative of minimal enrichment-associated bias). After hybrid capture, the sequencing library size distribution was narrow (Fig. 2A). The aligned sequence coverage distribution was unimodal but flat (platykurtic) and right-skewed (Fig. 2B). This implied that hybrid capture would require oversequencing of most targets to recruit a minority of poorly selected targets to adequate coverage. As expected, median coverage increased linearly with sequence depth. The proportion of bases with greater than zero and >20× coverage increased toward asymptotes at ~99 and ~96%, respectively (Table 1 and Fig. 2C). Targets with low (<3×) coverage were highly reproducible and had high GC content (table S6). This suggested that targets failing hybrid capture could be predicted and, perhaps, rescued by individual PCRs.
Given the need for highly accurate carrier detection, we required >10 uniquely aligned reads of quality score >20 and >14% of reads to call a variant (20, 21). The requirement for >10 reads was highly effective for nucleotides with moderate coverage. For heterozygote detection, for example, this was equivalent to ~20× coverage, which was achieved in ~96% of exons with ~2.6 gigabases (Gb) of sequence (Fig. 2C). The proportion of targets with at least 20× coverage appeared to be useful for quality assessment. The requirement for ≥14% of reads to call a variant was highly effective for nucleotides with very high coverage and was derived from the genotype data discussed below. A quality score requirement was important when NGS started, but is now largely redundant.
In theory, microdroplet PCR should result in all cognate amplicons being on target and should induce minimal bias. In practice, the coverage distribution was narrower than hybrid capture but with similar right skewing (Fig. 2D). However, these results were complicated by ~11% recurrent primer synthesis failures. This resulted in linear amplification of a subset of targets, ~5% of target nucleotides with zero coverage and a similar proportion of nucleotides on target to that obtained in the best hybrid capture experiments (~30%; Table 1). Hybrid capture was used for subsequent studies for reasons of cost.
Multiplexing of samples during hybrid selection and NGS had not previously been reported. Six- and 12-fold multiplexing was achieved by adding molecular bar codes to adaptor sequences. Interference of bar code nucleotides with hybrid selection did not occur appreciably: The stoichiometry of multiplexed pools was essentially unchanged before and after hybrid selection. Multiplexed hybrid selection was found to be ~10% less effective than singleton selection, as assessed by median fold enrichment. Less than 1% of sequences were discarded at alignment because of bar code sequence ambiguity. Therefore, up to 12-fold multiplexing at hybrid selection and per sequencing lane (equivalent to 96-plex per sequencing flow cell) was used in subsequent studies to achieve the targeted cost of <$1 per test per sample.
Several NGS technologies are currently available. Of these, the Illumina sequencing-by-synthesis (SBS) and SOLiD sequencing-by-ligation (SBL) platforms are widely disseminated and have throughput of at least 50 Gb per run and read lengths of at least 50 nt. Therefore, the quality and quantity of sequences from multiplexed, target-enriched libraries were compared with SBS (GAIIx singleton 50-mer) and SBL (SOLiD3 singleton 50-mer; Table 1). SBS- and SBL-derived 50-mer sequences (and alignment algorithms) gave similar alignment metrics (Table 1). When compared with Infinium array results, specificity of SNP genotypes by SBS and SBL was very similar (SBS, 99.69%; SBL, 99.66%), reflecting both target enrichment and multiplexed sequencing (Fig. 3).
Given approximate parity of throughput and accuracy, consideration was given to optimal read length. Unambiguous alignment of short-read sequences is typically confounded by repetitive sequences, but was not relevant for carrier testing, because targets overwhelmingly contained unique sequences. The number of mismatches tolerated for unique alignment of short-read sequences is highly constrained but increases with read length. The vast majority of disease mutations are single-nucleotide substitutions or small indels. However, comprehensive carrier testing also requires detection of polynucleotide indels, gross insertions, gross deletions, and complex rearrangements. A combination of bioinformatic approaches was used to overcome short-read alignment shortcomings (Fig. 4). First, with the Illumina HiSeq SBS platform, we used the novel approach of read pair assembly before alignment (99% efficiency) to generate longer reads with high-quality scores (148.6 ± 3.8 nt combined read length and increase in nucleotides with quality score >30 from 75 to 83%). This was combined with generation of 150-nt sequencing libraries without gel purification by optimization of DNA shearing procedures and use of silica membrane columns. Omission of gel purification was critical for scalability of library generation. Second, we reduced the penalty on polynucleotide variants, rewarding identities (+1) and penalizing mismatches (−1) and indels [−1–log(indel – length)]. Third, gross deletions were detected both by perfect alignment to mutant junction reference sequences and by local decreases in normalized coverage (normalized to total sequence generated; C. H. Hu, personal communication). Previous studies have identified CNVs on the basis of changes in regional coverage along a chromosome in an individual sample (20, 21). However, concomitant analysis of normalized coverage in batches of samples appears to circumvent the need for adjustment for GC content (32), allowing more accurate detection of segmental losses. This was illustrated by identification of eight known gross deletion disease mutations (Fig. 5). Furthermore, seeking perfect alignment to mutant junction reference sequences obviates low alignment scores when short reads containing polynucleotide variants are mapped to a normal reference. This was illustrated by identification of 11 gross deletion mutations for which boundaries had been defined (table S3). It is anticipated that these approaches could be extended to gross insertions and complex rearrangements but will require additional analytical validation.
On the basis of these strategies and our previous experience of genotyping variants identified in next-generation genome and chromosome sequences (20, 21, 33, 34), a bioinformatic decision tree for genotyping disease mutations was developed (Fig. 4). Clinical utility of target enrichment, SBS sequencing, and this decision tree for genotyping disease mutations was assessed. SNPs in 26 samples were genotyped by both high-density arrays and sequencing. The distribution of read count–based allele frequencies of 92,106 SNP calls was tri-modal, with peaks corresponding to homozygous reference alleles, heterozygotes, and homozygous variant alleles, as ascertained by array hybridization (Fig. 6B). Optimal genotyping cutoffs were 14 and 86% (Fig. 6B). With these cutoffs and a requirement for 20× coverage and 10 reads of quality ≥20 to call a variant, the accuracy of sequence-based SNP genotyping was 98.8%, sensitivity was 94.9%, and specificity was 99.99%. The positive predictive value (PPV) of sequence-based SNP genotypes was 99.96% and negative predictive value (NPV) was 98.5%, as ascertained by array hybridization. As sequence depth increased from 0.7 to 2.7 Gb, sensitivity increased from 93.9 to 95.6%, whereas PPV remained ~100% (Fig. 6A). Areas under the curve (AUCs) of the receiver operating characteristic (ROC) for SNP calls by hybrid capture and SBS were calculated. When genotypes in 26 samples were compared with genome-wide SNP array hybridization, the AUC was 0.97 when either the number or the percent reads calling a SNP were varied (Fig. 6, C and D). When the parameters were combined, the AUC was 0.99. For known substitution, indel, splicing, gross deletion, and regulatory alleles in 76 samples, sensitivity was 100% (113 of 113 known alleles; table S7). The higher sensitivity for detection of known mutations reflected manual curation. The 20 known indels were confirmed by PCR and Sanger sequencing. Notably, substitutions, indels, splicing mutations, and gross deletions account for the vast majority (96%) of annotated mutations (27).
Unexpectedly, 14 of 113 literature-annotated disease mutations were either incorrect or incomplete (table S7) (35–39). PCR and Sanger sequencing confirmed that the 14 variants and genotypes called by NGS were correct. For example, sample NA07092, from a male with X-linked recessive Lesch-Nyhan syndrome (OMIM #300322), was characterized as a deletion of HPRT1 exon 8 by complementary DNA (cDNA) sequencing (40), but had an explanatory splicing mutation (intron 8, IVS8+1_4delGTAA, chrX:133460381_133460384delGTAA; Fig. 7A). NA09545, from a male with XLR Pelizaeus-Merzbacher disease (PMD; OMIM #312080), characterized as a substitution disease mutation [PLP1 exon 5, c.767C>T, P215S (41)], was found to also feature PLP1 gene duplication [which is reported in 62% of sporadic PMD (42); Fig. 7B]. NA02057, from a female with aspartylglucosaminuria (OMIM #208400), characterized as a compound heterozygote, was homozygous for two adjacent substitutions (AGA exon 4, c.482G>A, R161Q, chr4:178596918G>A and exon 4, c.488G>C, C163S, chr4:178596912G>C in 38 of 39 reads; Fig. 8), of which C163S had been shown to be the disease mutation (43). Although one allele of NA01712, a CHT with Cockayne syndrome type B (OMIM #133540), had been characterized by cDNA analysis as a deletion of ERCC6 exon 9 [c.1993_2169del, p.665_723del, exon 9 del, chr10:50360915_50360739del (44)], no decrease in normalized exon 9 read number was observed despite more than 300× coverage (Fig. 5G). Instead, however, 64 of 138 NA01712 reads contained a nucleotide substitution that created a premature stop codon (Q664X, chr10:50360741C>T). Both ERCC4 mutations described in CHT NA03542 were absent in at least 130 aligning reads (44). However, the current study used DNA from Epstein-Barr virus (EBV)–transformed cell lines in which somatic hypermutation has been noted (45). In particular, ERCC4, a DNA repair gene, is a likely candidate for somatic mutation. Including these results, the specificity of sequence-based genotyping of substitution, indel, gross deletion, and splicing disease mutations was 100% (97 of 97).
The average carrier burden of severe recessive disease mutations for severe childhood recessive diseases was assessed in 104 DNA samples. All variants meeting the filtering criteria described above and flagged as disease mutations in HGMD were enumerated. Seventy-four percent of these, however, were accounted for by 47 substitutions each with an incidence of ≥5%, of which 20 were homozygous in samples unaffected by the corresponding disease (table S8). These were omitted. Literature support for pathogenicity was evaluated for the remaining variants flagged as disease mutations in HGMD. Variants were retained as disease mutations if they had been shown to result in loss of activity in a functional assay, were the only variants detected in affected individuals and absent in controls, and/or were predicted to result in a premature stop codon or loss of a substantial portion of the protein (Fig. 4). In total, 27% (122 of 460) of literature-cited disease mutations were omitted, because they were adjudged to be common polymorphisms or sequencing errors or because of a lack of evidence of pathogenicity. New, putatively deleterious variants (variants in severe pediatric disease genes that create premature stop codons or coding domain frameshifts) were quantified: 26 heterozygous or hemizygous new nonsense variants were identified in 104 samples (table S9). Including the latter, 336 variants were retained as likely disease mutations.
The average carrier burden of severe recessive substitutions, indels, and gross deletion disease mutations, after exclusion of one allele in compound heterozygotes, was 2.8 per genome (291 in 104 samples). The carrier burden frequency distribution was unimodal with slight right skewing (Fig. 7C). The range in carrier burden was surprisingly narrow (zero to seven per genome, with a mode of two; Fig. 7C).
As exemplified by cystic fibrosis, the carrier incidence and mutation spectrum of individual recessive disorders vary widely among populations (46). However, whereas group sizes were small, no significant differences in total carrier burden were found between Caucasians and other ethnicities, between males and females, nor between affected and unaffected individuals (after correction for compound heterozygosity in those affected). Hierarchical clustering of samples and disease mutations revealed an apparently random topology, suggesting that targeted population testing is likely to be ineffective (Fig. 7D). Adequacy of hierarchical clustering was attested to by samples from identical twins being nearest neighbors, as were two disease mutations in linkage disequilibrium.
We have described a screening test for carriers of 448 severe childhood recessive illnesses consisting of target enrichment, NGS, and bioinformatic analyses, which worked well in a research setting. Specificity was 99.96%, and a sensitivity of ~95% was attained with hybrid capture at a sequence depth of 2.5 Gb per sample. Because enrichment failures with hybrid capture were reproducible, they may be amenable to rescue by individual PCR or probe redesign. Alternatively, microdroplet PCR should theoretically achieve a sensitivity of ~99%, albeit at higher cost (16, 47). The test was scalable, modular, and amenable to automation, with batches of 192 samples and a turnaround of 2 weeks. The time to first result could be reduced substantially with microdroplet PCR and third-generation sequencing. At high volume, the overall analytical cost of the hybrid enrichment-based test was $378, achieving the requirement of <$1 per test per condition and approximating that expended on treatment of severe recessive childhood disorders per U.S. live birth (14, 29). Although the analytical cost will decrease as the throughput of NGS improves, test interpretation, reporting, genetic counseling, and stewardship of mutation databases will confer considerable additional costs.
Having established technical feasibility in a research setting, the next phases of carrier test development will be refinement of the list of diseases, automation, software implementation, report development, and, most important, validation in a realistic testing situation featuring investigator blinding and less manual review. For example, genes associated with severe cognitive developmental disorders may merit inclusion. Although technical standards and guidelines have been established for laboratory-developed genetic testing for rare disorders in accredited laboratories (26), there are several challenges in their adoption for NGS and bioinformatic-based testing of ~500 conditions. For example, specific national standards for quality assurance, quality control, test accessioning and reporting, and proficiency evaluation do not currently exist. Addressing crucial issues such as specificity and false positives is complex when hundreds of genes are being sequenced simultaneously. For certain diseases, such as cystic fibrosis, reference sample panels and metrics have been established. For diseases without such materials, it is prudent to test as many samples containing known mutations as possible. In setting up and validating the screen, it would also be necessary to test examples of all classes of mutations and situations that are anticipated to be potentially problematic, such as mutations within high GC content regions, simple sequence repeats, and repetitive elements.
The ethical, legal, and social implications of comprehensive carrier testing warrant much discussion. These issues, in turn, are influenced by the scope and setting in which testing is proposed. The ideal age for recessive disease screening is in early adulthood and before pregnancy (48, 49). One possibility would be voluntary community-based population testing. This would have an advantage over testing in a hospital setting, where information about carrier testing often is communicated during pregnancy or after the birth of an affected child (50). Community-based carrier testing has had high uptake, without apparent stigma or discrimination and with substantial reductions in the frequencies of tested disorders (3, 48, 49, 51–54). After stakeholder discussions, the cost-effectiveness and clinical utility of offering community-based carrier testing would require detailed assessment. Examination of the results of existing population-based carrier screening programs for TSD and cystic fibrosis could provide templates for such analyses.
Rapid adoption of comprehensive carrier testing is likely by in vitro fertilization clinics, where screening of sperm and oocyte donors has high clinical utility, lower counseling burden, and small incremental cost (55). Early adoption is also likely in medical genetics clinics, where counseling resources already exist, to screen individuals with a family history of inherited disease. Although the data reported herein are preliminary, the apparent random distribution of mutations in individuals argues against screening different populations for different diseases. The most significant hurdles to implementing comprehensive carrier screening will be facile interpretation of results, reporting in a manner comprehensible by physicians and patients, education of the public of the benefits and limitations of screening, and provision of genetic counselors.
Currently, a two-stage approach is used for preconception carrier screening of couples, with confirmatory testing of all positive results. However, this has been in a setting of testing individual genes for specific mutations where positive results are rare. The requirement for at least 10 high-quality reads to substantiate a variant call resulted in a specificity of 99.96% for single-nucleotide substitutions (which is the limit of accuracy for the gold standard method used) and 100% for about 200 known mutations and new indels in our screening method. It appeared, therefore, that confirmatory testing of all single-nucleotide substitutions and indels was unnecessary. Obviously, inclusion of controls in each test run and random sample retesting will be required. Experience with polynucleotide indels, copy number variants, gross insertions and deletions, and complex rearrangements is as yet insufficient to draw firm conclusions. However, detection of perfect alignments to mutant reference sequences appeared to be robust for identification of gross insertions and deletions. We noted, however, that identification of larger polynucleotide indels was influenced in some sequences by the particular alignment seed, suggesting that additional refinement of alignment parameters is needed.
We found an unexpectedly high proportion of literature-annotated disease mutations that were incorrect, incomplete, or common polymorphisms. Differentiation of common polymorphisms from disease mutations requires genotyping a large number of unaffected individuals. Severe, orphan disease mutations should be uncommon (<1% incidence) and should not be found in the homozygous state in unaffected individuals. Unexpectedly, we found that 74% of “disease mutation” calls were accounted for by substitutions with incidences of ≥5%, of which almost one-half were homozygous in samples unaffected by the corresponding disease. Also unexpected was the finding that 14 of 113 literature-annotated disease mutations were incorrect. Thus, for many recessive diseases, HGMD, dbSNP, OMIM, and the literature are insufficient arbiters of whether variants are disease mutations. We have shown NGS of samples from affected individuals to be a powerful method for error correction: More than three-quarters of errors in mutation identification were Sanger sequencing interpretation errors or incorrect imputation of genomic mutations from cDNA sequencing. Key advantages of NGS are clonal derivation (facilitating unambiguous detection of heterozygous and indel variants), maintenance of phase information (allowing haplotype derivation for adjacent variants), and highly redundant coverage (resulting in extremely low consensus error rates). Thus, although we have shown that it is technically feasible to undertake comprehensive analysis of recessive gene sequences, sequencing of many unaffected and affected samples will be required to establish an authoritative disease mutation database. Specifically, current reference resources contain common polymorphisms that are annotated as disease mutations and erroneous disease mutations. Without reference database improvements, the clinical utility of comprehensive carrier testing will be limited. Aside from nonsense mutations and premature stop codons in known disease genes and the study of affected individuals, additional bioinformatic approaches will be needed to distinguish rare benign variants from pathogenic variants: Amino acid substitution characteristics such as physicochemical and evolutionary conservation and location (where tertiary structure is known) are useful but not definitive. For many rare variants, functional assays will need to be developed to assess pathogenicity rigorously. Establishment of an authoritative database of disease mutations is clearly needed and represents a nascent bottleneck in progress toward prevention, diagnosis, and treatment of recessive diseases. In the interim, clinical interpretation of the functional importance or pathogenicity of variants will be challenging for many recessive diseases.
A first estimate of the average carrier burden of disease mutations (substitutions, indels, and gross deletions) causing severe childhood recessive diseases was determined: In 104 unrelated individuals, it was 2.8 per genome. Several qualifications of this burden estimate should be noted. First, as discussed, an adequate compilation of pathogenic mutations does not currently exist, and strong evidence of pathogenicity was absent for some of the variants referred to as disease mutations. Second, the burden estimate excluded new, rare, missense variants of unknown significance (VUSs), some of which are likely to be pathogenic. The burden of nonconservative, nonsynonymous, uncommon (<5% incidence) VUS was ~11 per sample. Additional strategies are needed to triage these variants. Third, many individuals in our cohort were affected by one of these diseases. Although a correction was made for compound heterozygote and homozygote alleles, the burden estimate did not correct for other potential selection biases. Fourth, we did not assess gross deletions or other copy number variants beyond limited CNV array hybridization and examination of coverage changes in a small number of known deletions. Nevertheless, a burden of 2.8 per genome agreed with theoretical estimates of reproductive lethal allele burden (56). It also concurred with severe childhood recessive carrier burdens that we obtained by analyzing published individual genomes [2 substitution disease mutations in the Quake genome and a monozygotic twin pair (21, 57), 5 each in the YH and Watson genomes (58, 59), 4 in the NA07022 genome (60, 61), and 10 in the AK1 genome (20)]. The range in carrier burden was surprisingly narrow (zero to seven per genome). Given the large variations in SNP burden and incidence of individual disease alleles among populations, it will be of great interest to evaluate variation in the burden of severe recessive disease mutations among human populations and how this has been influenced by population bottlenecks.
Finally, the technology platform described herein is agnostic with regard to target genes or clinical setting. A variety of medical applications for this technology exist beyond use in preconception carrier screening. For example, comprehensive newborn screening for treatable or preventable Mendelian diseases would allow early diagnosis and institution of treatment while neonates are asymptomatic. Early treatment can have a profound impact on the clinical severity of conditions and could provide a framework for centralized assessment of investigational new treatments before organ failure. In some cases, such as Duarte variant galactosemia, molecular testing would be superior to conventional biochemical testing. Organ or symptom menu-based diagnostic testing, with masking of nonselected conditions, is anticipated to assist clinical geneticists and pediatric neurologists, because current practice often involves costly, sequential testing of numerous candidate genes. Given impending identification of new disease genes by exome and genome resequencing, the number of disease genes is likely to increase substantially over the next several years, requiring incremental expansion of the target gene sets.
In summary, a technology platform for comprehensive preconception carrier screening for 448 recessive childhood diseases is described. Combining this technology with genetic counseling could reduce the incidence of severe recessive pediatric diseases and may help to expedite diagnosis of these disorders in newborns.
Criteria for disease inclusion for preconception screening were broadly based on those for expansion of newborn screening, but with omission of treatment criteria (14). Thus, very broad coverage of severe childhood diseases and mutations was sought to maximize cost-benefit, potential reduction in disease incidence, and adoption. A Perl parser identified severe childhood recessive disorders with known molecular basis in OMIM (8). Database and literature searches and expert reviews were performed on resultant diseases (8, 27, 28). Six diseases with extreme locus heterogeneity were omitted (OMIM #209900, #209950, Fanconi anemia, #256000, #266510, #214100). Diseases were included if mutations caused severe illness in a proportion of affected children and despite variable inheritance, mitochondrial mutations, or low incidence. Mental retardation and mitochondrial genes were excluded. Four hundred and thirty-seven genes, representing 507 recessive diseases, met these criteria, of which 448 diseases were severe (table S3).
Target enrichment was performed with 104 DNA samples obtained from the Coriell Institute (Camden, NJ) (table S7). Seventy-six of these were known to be carriers or affected by 37 severe, childhood recessive disorders. The latter samples contained 120 known disease mutations in 34 genes (63 substitutions, 20 indels, 13 gross deletions, 19 splicing, 2 regulatory, and 3 complex disease mutations). They also represented homozygous, heterozygous, compound heterozygous, and hemizygous disease mutation states. Twenty-six samples were well characterized, from “normal” individuals, and two had previously undergone genome sequencing (21).
For Illumina GAIIx SBS, 3 μg of DNA was sonicated by Covaris S2 to ~250 nt with 20% duty cycle, 5 intensity, and 200 cycles per burst for 180 s. For Illumina HiSeq SBS, shearing to ~150 nt was by 10% duty cycle, 5 intensity, and 200 cycles per burst for 660 s. Bar-coded sequencing libraries were made per the manufacturer’s protocols. After adaptor ligation, Illumina libraries were prepared with AMPure bead (Beckman Coulter) rather than with gel purification. Library quality was assessed by optical density and electrophoresis (Agilent 2100).
SureSelect enrichment of 6-, 8-, or 12-plex pooled libraries was per Agilent protocols (15), with 100 ng of custom bait library, blocking oligonucleotides specific for paired-end sequencing libraries and 60-hour hybridization. Biotinylated RNA library hybrids were recovered with streptavidin beads. Enrichment was assessed by quantitative PCR (Life Technologies; CLN3, exon 15, Hs00041388_cn; HPRT1, exon 9, Hs02699975_cn; LYST, exon 5, Hs02929596_cn; PLP1, exon 4, Hs01638246_cn) and a nontargeted locus (chrX: 77082157, Hs05637993_cn) before and after enrichment.
RainDance RDT1000 target enrichment was as described and used a custom primer library (16, 46): Genomic DNA samples were fragmented by nebulization to 2 to 4 kb and 1 μg mixed with all PCR reagents but primers. Microdroplets containing three primer pairs were fused with PCR reagent droplets and amplified. After emulsion breaking and purification by MinElute column (Qiagen), amplicons were concatenated overnight at 16°C and sequencing libraries were prepared. Sequencing was performed on Illumina GAIIx and HiSeq2000 instruments per the manufacturer’s protocols, as described (20, 21).
DNA (3 μg) was sheared by Covaris to ~150 nt with 10% duty cycle, 5 intensity, and 100 cycles per burst for 60 s. Bar-coded fragment sequencing libraries were made with Life Technologies protocols and reagents. Taqman quantitative PCR was used to assess each library, and an equimolar six-plex pool was produced for enrichment with Agilent SureSelect and a modified protocol. Before enrichment, the six-plex pool was single-stranded. Furthermore, 1.2 μg of pooled DNA with 5 μl (100 ng) of custom baits was used for enrichment, with blocking oligonucleotides specific for SOLiD sequencing libraries and 24-hour hybridization. This was the first targeted capture of a multiplex library for SOLiD sequencing, and this protocol has not been subsequently pursued. Alternative methods have been demonstrated to reduce the noise associated with bar coding and enrichment. Sequencing was performed on a SOLiD 3 instrument with one quadrant on a single sequencing slide, generating singleton 50-mer reads.
The bioinformatic decision tree for detecting and genotyping disease mutations was predicated on experience with detection and genotyping of variants in next-generation genome and chromosome sequences (20, 21, 33, 34) (Fig. 4). Briefly, SBS sequences were aligned to the National Center for Biotechnology Information (NCBI) reference human genome sequence (version 36.3) with GSNAP and scored by rewarding identities (+1) and penalizing mismatches (–1) and indels [−1–log(indel – length)]. Alignments were retained if covering >95% of the read and scoring >78% of maximum. Variants were detected with Alpheus with stringent filters (>14% and >10 reads calling variants and average quality score >20). Allele frequencies of 14 to 86% were designated heterozygous and >86% homozygous. Reference genotypes of SNPs and CNVs mapping within targets were obtained with Illumina Omni1-Quad arrays and GenomeStudio 2010.1. Indel genotypes were confirmed by genomic PCR of <600-bp flanking variants and Sanger sequencing.
SBL sequence data analysis was performed with BioScope v1.2. Fifty nucleotide reads were aligned to NCBI genome build 36.3 with a seed and extend approach (max-mapping). A 25-nt seed with up to two mismatches is first aligned to the reference. Extension can proceed in both directions, depending on the footprint of the seed within the read. During extension, each base match receives a score of +1, whereas mismatches get a default score of −2. The alignment with the highest mapping quality value is chosen as the primary alignment. If two or more alignments have the same score, then one of them is randomly chosen as the primary alignment. SNPs were called with the BioScope diBayes algorithm at medium stringency setting (61). diBayes is a Bayesian algorithm that incorporates position and probe errors, as well as color quality value information for SNP calling. Reads with mapping quality of <8 were discarded by diBayes. A position must have at least 2× or 3× coverage to call a homozygous or heterozygous SNP, respectively. The BioScope small indel pipeline was used with default settings and calls insertions of size ≤3 nt and deletions of size ≤11 nt. In comparisons with SBS, SNP and indel calls were further restricted to positions where at least 4 or 10 reads called a variant.
PCR primers were designed to amplify 100 to 300 nt upstream and downstream of each variant or indel with PrimerQuest (Integrated DNA Technologies). Targeted regions were amplified from 100 ng of genomic DNA, and resultant PCR amplicons were analyzed for predicted size by LCGX (Caliper Life Sciences). Amplicons of appropriate size were Sanger-sequenced in both the forward and the reverse directions with the same primers used for PCR amplification. Analysis was performed with the Mutation Surveyor (SoftGenetics) software package.
Fig. S1. One end of five reads from NA01712 showing ERCC6 exon 17, c.3536delA, Y1179fs, chr10:50348476delA.
Fig. S2. One end of five reads from NA20383 showing CLN3 exon 11, c.1020G>T, E295X, chr16:28401322G>T (black arrow).
Fig. S3. One end of five reads from NA16643 showing HBB exon 2, c.306G>C, E102D, chr11:5204392G>C (black arrow).
Table S1. Four hundred and forty-eight severe pediatric recessive diseases, encompassing 437 genes, that met criteria for carrier screening.
Table S2. Sequences and genome coordinates of 29,891 Agilent SureSelect 120-mer RNA baits for hybrid capture of 7616 (99.7%) of 7717 segments of 437 genes causing severe recessive pediatric disorders.
Table S3. Custom Agilent SureSelect RNA baits for hybrid capture of 11 gross deletion DMs with defined boundaries.
Table S4. Repeat content of 55 exons (5773 nt, 46.27%) failing RNA bait design due to repetitive sequences.
Table S5. Sequences and genome coordinates of 10,280 primer pairs for microdroplet PCR (RainDance) of 7717 segments of 437 genes causing severe recessive pediatric disorders.
Table S6. Coordinates, genes, and GC content of 40 exons with recurrent coverage >3×.
Table S7. Confirmed and corrected disease mutations (DMs) in 104 DNA samples, together with enrichment technologies and sequencing platforms used to characterize them.
Table S8. Variants reported in HGMD to be disease mutations that occurred with incidence >5% in 104 samples by target enrichment and second-generation sequencing or that were assessed to be homozygous in unaffected samples, indicative that they were polymorphisms.
Table S9. Severe recessive pediatric disease-causing mutations (DMs) identified in 104 samples by target enrichment and second-generation sequencing.
We thank M. Chandler and M. Spain, who envisioned universal preconception screening, and the many physicians and geneticists who refined the concept and candidate disease list, particularly H. H. Ropers and C. J. Saunders. This work is dedicated to Christiane. A deo lumen, ab amicis auxilium.
Funding: This work was funded by grants from the Beyond Batten Disease Foundation and NIH (RR016480 to F.D.S.), and by in-kind support from Illumina Inc., Life Technologies, and British Airways PLC.
Author contributions: C.J.B. led the project, contributed computer programming and data analysis, and wrote the manuscript. D.L.D. contributed to study design; performed literature research, target enrichments, sequencing, genotyping, and data analysis; and wrote the manuscript. N.A.M. carried out data pipelining, software development, and bioinformatics. S.L.H. carried out literature research and data analysis and contributed to target enrichment and sequencing. E.E.G. performed literature research, target enrichment, and sequencing. J.M. provided data analysis. R.J.L. provided target enrichment and sequencing and assisted with data analysis. L.Z. performed sequencing. C.C.L. designed the SOLiD sequencing and data analysis. J.E.W. provided sequencing and genotyping. H.E.P. performed SOLiD 3 sequencing. F.D.S. assisted in project management and provision of resources. V.S. performed data pipelining and bioinformatic analysis. G.P.S. designed the HiSeq sequencing. R.W.K. provided oversight of sequencing operations. S.F.K. conceived and designed the study, wrote the manuscript, and carried out data analysis.
Competing interests: L.Z. is an employee of Illumina Inc. At the time the research was performed, G.P.S. was an employee of Illumina Inc. C.C.L., H.E.P., and V.S. are employees of Life Technologies. U.S. patent application 20090183268 entitled “Methods and systems for medical sequencing analysis” was filed by the National Center for Genome Resources on July 16, 2009. This application has claims related to this work. The other authors declare no competing interests.
Accession numbers: Nucleotide sequences are deposited in the NCBI at SRA026957.1. Nucleotide variants may be searched at http://hematite.ncgr.org.