|Home | About | Journals | Submit | Contact Us | Français|
Exome sequencing has identified the causes of several Mendelian diseases, although it has rarely been used in a clinical setting to diagnose the genetic cause of an idiopathic disorder in a single patient. We performed exome sequencing on a pedigree with several members affected with attention deficit/hyperactivity disorder (ADHD), in an effort to identify candidate variants predisposing to this complex disease. While we did identify some rare variants that might predispose to ADHD, we have not yet proven the causality for any of them. However, over the course of the study, one subject was discovered to have idiopathic hemolytic anemia (IHA), which was suspected to be genetic in origin. Analysis of this subject’s exome readily identified two rare non-synonymous mutations in PKLR gene as the most likely cause of the IHA, although these two mutations had not been documented before in a single individual. We further confirmed the deficiency by functional biochemical testing, consistent with a diagnosis of red blood cell pyruvate kinase deficiency. Our study implies that exome and genome sequencing will certainly reveal additional rare variation causative for even well-studied classical Mendelian diseases, while also revealing variants that might play a role in complex diseases. Furthermore, our study has clinical and ethical implications for exome and genome sequencing in a research setting; how to handle unrelated findings of clinical significance, in the context of originally planned complex disease research, remains a largely uncharted area for clinicians and researchers.
The advent of the capture of most known exons in the genome (the exome), followed by high-throughput sequencing, has already led to the identification of the causes of many Mendelian disorders (Bilguvar et al., 2010; Choi et al., 2009; Erlich et al., 2011; Johnston et al., 2010; Ng et al., 2010a; Ng et al., 2010b; Pierce et al., 2010), including from just a single individual (Haack et al., 2010). As the cost of exome (and even whole genome) sequencing decreases, it is becoming increasingly feasible to use exome sequencing in the diagnostic realm for Mendelian diseases with already identified genetic mutations, rather than only to investigate the genetic basis of unknown Mendelian disorders. Indeed, one of the pioneering studies examining the clinical utility of whole-exome sequencing demonstrated how an unexpected genetic diagnosis of congenital chloride diarrhea in a patient suspected with Bartter syndrome can be made from sequencing data (Choi et al., 2009). Furthermore, exome sequencing may also be an important tool for investigating complex traits. However, such studies, whether family-based or case series-based, are only just now being performed to our knowledge. Therefore, it remains an open question how much exome sequencing can advance our understanding of genetics of complex traits, particularly when sample sizes and/or pedigrees are relatively small.
To begin to address this issue, we focused on a complex illness, Attention Deficit/Hyperactivity Disorder (ADHD, OMIM: #143465). ADHD is a common disorder affecting more than 1 in 20 children in the U.S., with as many as 50% remaining symptomatic into adulthood, along with much heterogeneity in the presentation of the illness (CDC, 2010; Petersen et al., 2009). Genetic factors are thought to play a large role in the etiology of the disorder, but studies thus far remain inconclusive (Mick et al., 2010; Neale et al., 2010a; 2010b). We have hypothesized that rare, family-specific genetic variants may account for some of the “missing heritability” of ADHD, and we have presented evidence for the involvement of mGluR5 in one family (Elia et al., 2010). This is consistent with recent studies which showed that an excess of deleterious rare low-frequency nonsynonymous mutations reside in the human population (Li et al., 2010). Given the multifactorial nature of complex diseases, especially neuropsychiatric diseases, one way to reduce the complexity of searching for disease genes is to focus on rare variants with potentially higher penetrance, identified from clearly familial cases (Cirulli and Goldstein, 2010). Familial segregation of ADHD in a small family such as this one suggested to us that some major-effect genetic variants could be present and we set out to determine whether exome sequencing could shed light on the genetic basis of ADHD in this one family. Although we did identify several such rare variants, we have not yet proven that the identified variants cause the ADHD (alone or in combination), and we remark on the reasons for why this has been difficult. Our study also led to the unrelated characterization of the genetic basis of a case of idiopathic hemolytic anemia (IHA), which ended up being a clearly Mendelian trait (Pyruvate Kinase Deficiency of Red Cells, OMIM #266200). This study highlights the coming wave of unrelated findings and the resolution of “idiopathic” diseases with whole exome and whole genome sequencing.
The samples used in our study all came from the same pedigree ascertained in clinics at the University of Utah. The collection and genomic analysis of the DNA were approved by the Institutional Review Board at the University of Utah, and written informed consent was obtained from all study participants. Blood samples were collected and genomic DNA extracted using alkaline lysis and ethanol precipitation (Gentra Puregene, Qiagen Corp., USA). DNA was quality-checked on agarose gels and quantified according to standard protocols.
The clinical records of patient 84060 were reviewed with his permission, allowing for collection of data related to his hemolytic anemia. ADHD Research scales and assessments included: Parents Rating Scale (PRS) (Ward et al., 1993), Wender Utah Rating Scale (WURS) (Ward et al., 1993), Wender-Reimherr Adult Attention Deficit Disorder Scale (WRAADDS) (Rosler et al., 2010b), Conners Adult ADHD Rating Scale - Investigator Version (CAARS-Inv) (Rosler et al., 2010a), Clinical Global Impression - Improvement (CGI-I), Clinical Global Impression - Severity (CGI-S), Wisconsin Personality Disorders Inventory-IV (Smith et al., 2003), Personality Disorder Questionnaire (Hyler et al., 1992), Iowa Personality Disorder Screen (Langbehn et al., 1999), CNS Vital Signs (CNSVS) (Gualtieri and Johnson, 2006), Weissman Social Adjustment Scale - Self Report (SAS-SR) (Weissman et al., 2001), Sheehan Disability Scale (Leon et al., 1997), Childhood Symptom Scale - Self Report Form (ChSS-SRF) (Murphy and Adler, 2004), and Disruptive Behavior Rating Scale - Retrospective Adult Form (DBRS-RAF) (Friedman-Weieneth et al., 2009).
Exome capture for the three males was carried out in January 2010 using the commercially available Agilent SureSelect Human All Exon v1 in solution method as per the manufacturer guidelines (Agilent). This method is designed to target all human exons, regions totaling approximately 38 Mb, in a single tube. The DNA from the unaffected mother was obtained at a later date, allowing us to use the newly released SureSelect Human All Exon v.2 Kit, which targets approximately 44 Mb, covering 98.2% of the CCDS database. For both captures, the pure and high molecular weight genomic DNA samples were randomly fragmented at BGI-Shenzhen by Covaris, resulting in DNA fragments with a base pair peak of 150 to 200 bp. Adaptors were then ligated to both ends of the resulting fragments. The adaptor-ligated templates were purified by the Agencourt AMPure SPRI beads and fragments with insert size of approximately 250bp being excised. Extracted DNA was amplified by ligation-mediated PCR (LM-PCR), purified, and hybridized to the Agilent SureSelect Library for enrichment. Hybridized fragments were bound to the streptavidin beads whereas non-hybridized fragments were washed out after a 24-hour hybridization. Captured LM-PCR products were subjected to Agilent 2100 Bioanalyzer to estimate the magnitude of enrichment. Paired end sequencing was performed using the Illumina Genome Analyzer IIx platform with read lengths of 76 base pairs, providing at least 20× average coverage at the targeted region. The unaffected mother was sequenced with read lengths of 90 base pairs due to technological advancements during the course of the study, at an average coverage of 30× at the targeted region. Raw image files were processed by Illumina Pipeline v1.6 for base-calling with default parameters. FASTQ files were produced from the pipeline for downstream sequence data analysis.
We identified variants with four pipelines.
The first pipeline was performed at BGI-Shenzhen. Sequence reads were aligned to human reference genome builds hg19 using SOAPaligner 2.20 (Li et al., 2008) with a maximum of two mismatches. The consensus genotypes in target regions (based on the SureSelect Human All Exon Kit) were called by SOAPsnp version 1.03 (Li et al., 2009). Single nucleotide variant (SNV) results were filtered as followed: Base quality ≥20, depth from 4–200, copy number estimate <2, and distance between two adjacent SNVs no less than 5. Insertions/deletions (indels) <5 base pairs were detected using SOAPindel. The resulting variant genotypes relative to the human reference assembly were identified.
BWA (Li and Durbin, 2009) version 0.5.8 was used to align the sequencing reads, with default parameters, to the human reference genome sequence build 37 (hg19), downloaded from the 1000 Genomes Project website (1000 Genomes Consortium, 2010). BWA implements a backward search with Burrows-Wheeler Transform to efficiently align short reads with perfect identity against references. Alignments were converted from SAM format to sorted, indexed BAM files with SamTools (Evjenth et al., 2009). The Picard tool (http://sourceforge.net/projects/picard/) was used to remove invalid alignments and remove duplicate reads from the BAM files. The BAM files were re-aligned with the GATK IndelRealigner too (McKenna et al., 2010). Genotypes were called by the GATK UnifiedGenotyper and the IndelGenotyperV2. Based on the recommendations from the authors of GATK (see http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2), we removed variant calls based on having any of the following criteria: 1) SNVs with-in clusters (3 SNVs within 10 bp of each other); (2) more than four reads with mapping quality of zero (MQ0) and more than 10% of reads with mapping quality of zero; (3) strand bias (SB) higher than or equal to −1.0; (4) SNV quality score less than 30; (5) quality-by-depth (QD) score less than 5.0; (6) largest Contiguous Homopolymer Run of Variant Allele (HRun) more than 5; or (7) SNVs around a potential indel. Finally, we removed all variants with depth coverage less than 6.
Similar to Method II, BWA-SW was used to align the sequencing reads to hg19 with default parameters; duplicates were removed by picard and local re-alignment was done by GATK. Instead of using the variant call method provided by GATK, we used SNVer (http://snver.sourceforge.net/) for detecting SNVs in each sample (Wei et al., 2011). We considered the mapped short reads with mapping quality >20, and counted only bases with base quality >30. Because of the cleaning step by picard and GATK, SNVer assumed a small sequencing/mapping error rate of 0.01 for the cleaned mapped short reads in making variant calls. SNVer sets the number of haploids to 2 for analysis of individual samples. We set the variant allele frequency threshold >0 for detecting both rare and common SNVs. SNVer provides multiplicity control, and we performed Bonferroni correction and controlled the family-wise error rate at 0.05 level. After calling SNVs in each sample, we selected those SNVs which are shared by the three patients but absent in the healthy control (mother). Then, the SNVs called were extracted and annotated by ANNOVAR (Wang et al., 2010b) for further analysis.
SNVs were verified using GNUMAP (Clement et al., 2009). GNUMAP utilizes position-weight matrices and a modified Needleman-Wunsch algorithm to probabilistically align reads to multiple locations on a reference genome. Alignment was performed for the four subjects using human genome version hg19. GNUMAP’s SNP-caller outputs coverage by base at each locus in the genome and uses a p-value statistic to call single nucleotide polymorphisms (SNPs). USeq’s Alleler software package (Nix et al., 2008) (http://useq.sourceforge.net/) was used to isolate non-synonymous variants. SamTools was used to visualize read pile-ups for manual verification and analysis of these variants.
All DNA samples were genotyped on the Illumina Human610-Quad version 1 SNP arrays (Illumina, San Diego, CA) with ~610,000 markers (including ~20,000 non-polymorphic markers) at the Center for Applied Genomics, Children’s Hospital of Philadelphia. Total genomic DNA extracted from whole blood was used in the experiments. Standard data normalization procedures and canonical genotype clustering files provided by Illumina were used to process the genotyping signals. Copy number variants (CNVs) were detected using PennCNV (Wang et al., 2007) and genomic wave adjustment (Diskin et al., 2008), with default parameters. Concordance between SNPs from the arrays and SNPs from exome sequencing was determined by calculating the percentage of variants from exome sequencing also with the same genotype derived from the SNP arrays.
Variants were functionally annotated using the ANNOVAR software (Wang et al., 2010b). The variants reduction pipeline was also applied to identify a list of prioritized variants, but without filtering out variants in segmental duplication regions or conserved regions.
SNVs were validated using standard Sanger sequencing or Sequenom iPlex platform (http://www.sequenom.com/iplex).
The preparation of red cell hemolysates (to remove leucocytes and thrombocytes) and the measurements of enzymatic activities of pyruvate kinase, hexokinase, and glucose-6-phosphate dehydrogenase were performed according to standard methods (Beutler, 1984).
The pedigree collected in our study is shown in Figure 1, with a mother, father, and four children. The father and two sons in this pedigree have ADHD, combined hyperactive and inattentive subtype, meeting DSM-IV-Text Revision and the Utah criteria for ADHD (Rosler et al., 2008), whereas the mother was evaluated and was determined to be unaffected. The status of the two sisters and more distantly related relatives is unknown, as they refused to participate in the research. The father and two sons also participated in an adult ADHD clinical trial in the Psychiatric Research Clinic at the University of Utah (Marchant et al., 2011). The clinical trial was a double-blind placebo-controlled crossover study of methylphenidate transdermal system (MTS) for adult ADHD. It was noted by the psychiatrists (G.J.L., F.R., and R.R.) that all three subjects were very similar in presentation, with all three being moderately to severely impaired with their ADHD symptoms (with a score of 4 or greater on the CGI-Severity Scale for ADHD at both Screening and Baseline visits), and all improving with active medication but not placebo. Supplementary Table 1 (see the attached patient information and supplementary tables and figures) shows ADHD assessment rating scores during the clinical trial at the following assessments: 1) baseline, 2) after treatment with active medication, and 3) after treatment with placebo.
Given that some prior studies have found large CNVs associated with ADHD (Elia et al., 2010), we initially screened for such CNVs in the family we studied. Using the data from the high-density SNV arrays, we performed CNV analysis to identify CNVs shared by all three ADHD cases. Four deletions were shared among the three individuals (Supplementary Table 5), but all of these CNVs appear to be common in the population based on Database for Genomic Variants, and thus likely not associated with ADHD or idiopathic anemia in this pedigree.
In an effort to understand the genetic basis of the ADHD exhibited in this pedigree, we initially used exome capture (with Agilent SureSelect v.1) and sequenced the DNA of the index patient (84060), his father (92157), and his brother (84615) (Figure 1). To evaluate the sensitivity of exome sequencing to different data analytical strategies, we applied four computational pipelines to identifying genetic variants from the exome sequencing data (see Materials and Methods). Comparative analysis of the results from four pipelines may teach us about the sensitivity of data interpretation to the utilization of analytical approaches, highlighting the computational challenges in utilizing exome sequencing in clinical diagnosis. Approximately 20,000 single nucleotide variants (SNVs) were detected in each of the males being sequenced at a mean average coverage of ~20×, using four different computational pipelines for variant detection (Table 1, Figure 2, and Supplementary Figures 1 and 2). Given that our initial analyses revealed the desirability of filtering against an unaffected relative, we sequenced the mother’s exome at a later time, using a newer version of the exon capture (Agilent SureSelect v.2) and with a higher mean sequencing coverage of ~30×. We focused for the analysis presented here on the ~38 MB target region common to both platforms (Supplementary Tables 2–3). Given the experimental design, we benefited from the fact that we could check the relatives to verify calls, thus effectively tripling at a minimum the number of sequencing reads. This was particularly important in areas of relatively low coverage, given that the mean sequencing coverage ranged from 20–30×, which resulted in 10 or more reads in 67–75% of the target bases in each of the 4 individuals, but which was improved overall by having the sequence from the relatives. For example, after lowering the coverage threshold, there were 28,125,119 base pairs covered with six or more reads among all the three affected males. We also checked the accuracy of the exome sequencing by comparing to the Illumina Human610 SNP arrays with ~590,000 SNP markers and ~20,000 copy number markers. The concordance rate between the SNP results from the arrays and the exome sequencing variant calls was above 99.8% for all samples (Supplementary Table 4). We also verified by a combination of Sanger sequencing or Sequenom genotyping 19 of the SNVs identified by the SOAP pipeline as shared in the two brothers and father (Supplementary Figures 3–26). All 19 of these SNVs validated, demonstrating a very low rate of false positives for SNVs. On the other hand, the tested version of SOAPindel was not reliable for indel-calling, as 10 out of 13 called microindels were not validated with Sequenom genotyping. We therefore switched to the GATK pipeline for indels, as it has been shown to be more reliable in this regard (Depristo et al., 2011).
We made the assumption that any causative common variants with large effect sizes would have already been found in genome-wide association studies (GWAS). Therefore, we generated a list of rare [minor allele frequency (MAF) <1% in the 1000 Genomes project] candidate variants shared in the father and two sons that were not found in the mother (see Figure 3). It is important to note that the three different pipelines used to analyze all four exomes produced different sets of variant calls, although the vast majority of such variants (n=1,426) were found by all three pipelines (Figure 3). We also assessed the frequency of these variants in ~6,000 exomes already sequenced in other projects at BGI and Baylor (Supplementary Table 6). Nonsynonymous rare variants in ATP7B, CSTF2T, ALDH1L1, and METTL3 appeared to be the most plausible candidates for an association with ADHD in this pedigree, as these genes have been shown to be brain-expressed (at the level of RNA and/or protein) and several are implicated in possible neuropsychiatric conditions (Barley et al., 2009; Cocco et al., 2009; Elleuch et al., 2010; Kurian et al., 2011; Shankarling et al., 2009). However, to our knowledge, exome sequencing has not yet unambiguously identified many (if any) diseases variants shown to be definitely causative by themselves for complex diseases. After sequencing two parents with two sons, with the three males of the family sharing a well-phenotyped and similar version of a complex disease, we can obtain a prioritized list of potential candidate variants, but we have no certainty that any one of them can cause ADHD in the family, either by themselves or in aggregate. As it is well known among psychiatrists that there is extreme heterogeneity of ADHD in terms of its presentation, we decided to sequence this family due to the marked similarity in phenotype in the three males and what appeared to be an enhanced enrichment of ADHD in this family, based on reports from family members regarding other members of the family. We have unfortunately not been able to expand this pedigree further, as several members of the family have refused to participate in the genetics research. When we chose this family for sequencing, we did not anticipate the difficulty that we would encounter with recruiting other members of the family. Therefore, given the relatively small size of the family and the refusal of other family members to come in for an assessment, we were unable to define exactly the inheritance pattern in this family, although it seemed very reasonable to start with the assumption that variants causing the ADHD would be shared between the father and two sons.
Immediately after we obtained the first set of exome sequencing data, one of the patients (patient 84060, a 28-year old Caucasian male) revealed that he had previously been diagnosed with idiopathic hemolytic anemia. After further reviewing his medical history, we recognized a long record of intermittent jaundice since childhood, and he had always been chronically anemic, with hemoglobin (Hgb) levels ranging from 10–12 g/dL (normal range 13–16). As a youth, he had been informed that he had idiopathic hemolytic anemia, but no genetic test had ever been performed to the best of his knowledge. He had visited a hematologist, who had ordered several clinical tests, but he was unaware of the results from those tests. We therefore decided to test the clinical utility of exome sequencing in solving the idiopathic hemolytic anemia in this patient. Further details of his clinical course are presented in Supplementary Information (Patient Presentation).
We began by analyzing the non-synonymous and frame-shift insertions/deletions in this individual, as such mutations are much more likely to affect protein function. There were ~6,000 such SNVs in 84060 (Table 1). A series of procedures (Figure 4) was implemented in the ANNOVAR pipeline for the purpose of reducing the number of candidate variants. Unlike previous studies on recessive diseases, we did not utilize dbSNP filtering, because causal variants for well-studied diseases [such as those that cause hearing loss or Crohn’s disease (Wang et al., 2010a)] are likely already catalogued in dbSNP. We used an MAF>1% filter for the 1000 Genomes Project for a similar reason (see Figure 4). After these filtering steps, we began with the assumption of either an autosomal recessive (compound heterozygotes or homozygotes) or X-linked mode of inheritance. Under the autosomal model, we required at least two mutations from different parents in the same gene, given that the idiopathic hemolytic anemia was not observed in either of his parents, or his brother or sisters.
The X-linked inheritance model did not reveal any mutations of functional significance. Using the autosomal recessive model we generated a list of variants in 43 genes (Supplementary Table 7). Non-immune mediated hereditary hemolytic anemia with unaltered red blood cell (RBC) morphology is mainly caused by mutations in genes involved in red cell metabolism (glycolysis, hexose monophosphate shunt, and the purine salvage pathway) (van Solinge, 2010)). A thorough literature search for the 43 candidate genes revealed evidence supporting the mutations in PKLR (pyruvate kinase, liver, and RBC isoform 2) as being well known to cause hemolytic anemia. Patient 84060 is a compound heterozygote for two very rare mutations in PKLR. These two mutations are c.1022G>C substitution in exon 8, encoding a Gly341Ala amino acid change, and a c.1706G>A substitution in exon 12, encoding an Arg569Gln amino acid change (Figures 5–6). The protein encoded by this gene is pyruvate kinase (PK), which catalyzes the transphosphorylation of phosphoenolpyruvate into pyruvate and ATP. PK acts at a rate-limiting step of glycolysis. Defects in this enzyme due to gene mutations are known to cause chronic hereditary nonspherocytic hemolytic anemia (CNSHA or HNSHA, OMIM ID #266200) (van Wijk and van Solinge, 2005; Zanella et al., 2007). From the list of 43 candidates, the patient was also homozygous for a mutation in ULK1 (NM_003565:exon23: c.A2446G:p.T816A). The ULK1 serine threonine kinase is known to be a critical regulator of mitochondrial and ribosomal clearance during the final stages of erythroid maturation (Kundu et al., 2008), but there is no evidence that mutation in this gene can result in anemia. None of the other genes had any known connection to hemolytic anemia.
The c.1706G>A exon 12 p.Arg569Gln mutation in PKLR was present only in the mother’s exome, whereas the c.1022G>C exon 8 p.Gly341Ala mutation was found only in the father’s exome, thus confirming an autosomal recessive mode of inheritance. Both mutations were confirmed by Sanger sequencing in the respective family members (Supplementary Figures 27–33). Neither mutation was found in 50 control Caucasian subjects sequenced in other projects; however, the c.1706G>A exon 12 p.Arg569Gln mutation was found rarely in the following datasets: 1) listed in dbSNP (version 130), 2) appeared in the 1000 Genomes project with a frequency of 0.1%, 3) seen in one out of 40 whole-genomes from the Complete Genomics Diversity Panel (http://www.completegenomics.com/sequence-data/download-data/), and 4) was seen six times in 5,680 exomes sequenced for other projects at BGI. On the other hand, the c.1022G>C exon 8 p.Gly341Ala mutation was completely unique, not being found in any of the above datasets. No other deleterious mutations in PKLR were found in patient 84060. These two mutations have been submitted and updated on the PKLR mutation database http://www.pklrmutationdatabase.com/.
After we had determined that the patient might be anemic due to PK deficiency, we obtained his permission to contact his hematologist and obtain his medical records. His records revealed a very recent medical work-up for the idiopathic hemolytic anemia, with the hematologist having concluded from a large panel of tests that the patient had pyruvate kinase deficiency, based on an abnormally low red blood cell pyruvate kinase enzyme activity. Further clinical evidence in favor of PK deficiency is the improved clinical outlook of the patient as a result of the splenectomy, which is frequently observed in patients with hemolytic anemia due to PK deficiency (Baronciani and Beutler, 1995). This clinical improvement was accompanied by an increased number of reticulocytes of 11.8% (presplenectomy: 3.3%), also typically seen in PK-deficiency (Mentzer et al., 1971).
Despite the clinical results, we decided to test more systematically for the effect of these mutations on pyruvate kinase enzymatic activity with the blood of 84060 and an age-matched Caucasian control, as these two mutations in PKLR had been reported 1–2 times in the literature in association with PK deficiency (Baronciani and Beutler, 1995; Fermo et al., 2005; van Wijk et al., 2009) but never in this particular combination (Table 2). Red blood cells of the patient showed lowered PK activity. In addition, the increased activities of hexokinase (HK) and glucose-6-phosphate dehydrogenase (G6PD), two other red blood cell age-dependent enzymes, were indicative of the presence of a red blood cell population of relatively young age. Therefore, since the activity of PK is strongly dependent on the age of the red blood cell, with the youngest cells showing the highest activity (Rijksen et al., 1990), PK activity should be considered to be reduced even more when taking into account the reticulocytosis in subject 84060 (11.8%).
PK is a key enzyme in glycolysis. The sustenance of the red blood cell entirely depends on this pathway for the generation of metabolic energy (ATP). Both mutations, p.G341A and p.R569Q, are located at different subunit interfaces of the tetrameric enzyme, and the lowered activity of the enzyme leads ultimately to premature removal of the red blood cell from the circulation. Residues of the subunit interface are considered to be crucial for the enzyme’s response upon binding of its allosteric activator fructose-1,6-bisphosphate (Valentini et al., 2002). Both mutations were predicted to be deleterious by several complementary nsSNV scoring algorithms, including SIFT (Ng and Henikoff, 2003), PolyPhen version 2, LRT, PhyloP, and MutationTaster, using scores obtained from the dbNSFP database (Liu et al., 2011) (Table 3). Gly341 is a highly conserved residue at the interface between two A domains of opposing subunits in the tetrameric enzyme (A/A’ subunit interface). This is a conserved region of the protein involved in binding of phosphoenolpyruvate and ADP/ATP, and divalent cation binding (Munoz and Ponce, 2003). This mutation has been described only once in a PK-deficient Caucasian patient from the USA (Baronciani and Beutler, 1995). This latter patient carried the common p.R486W variant in trans and had received multiple blood transfusions, so the level of his or her PK deficiency could never be accurately measured. Rather, the authors deduced that this mutation was likely to have a negative effect on PK activity by conducting red cell PK assays on the patient’s mother and sister, although the actual data were not reported. Arg569 is a solvent exposed residue in the C/C’ subunit interface, located 5 residues from the C-terminal end of red blood cell PK. The arginine side chain is directed away from the surface and does not make specific hydrogen bonds or charge interactions. Introduction of a glutamine side chain at this position could cause unfavorable interactions with residues of the same or opposite subunit. However, there are also several putative conformations into which glutamine can mold that are predicted to affect its neighboring residues only to a minimal extent. We and others have previously described a patient carrying the R569Q change (Fermo et al., 2005; van Wijk et al., 2009). In one study the second mutation was never detected (Fermo et al., 2005) whereas in the other study two related patients (brother and sister) inherited the p.569Q variant together with a p.L374P variant. Both of these patients were PK-deficient although clinically mildly affected (van Wijk et al., 2009).
Altogether, our results indicate that it is highly likely that the two identified changes in PKLR, in this particular combination, represent the main cause for the idiopathic hemolytic anemia as observed in this patient.
Although this research study was initially begun as a method to discover a genetic cause for ADHD, during the course of the study we determined an unrelated explanation for this patient’s idiopathic anemia. We do not specifically refer to this as an “incidental finding,” because we did search for the cause of the anemia, once we knew that the research subject had this illness as well. Rather, we call this simply an unrelated finding. As a research test, our exome sequencing did not meet the standards necessary for a clinical test as required by Clinical Laboratory Improvement Amendments (CLIA). Therefore, we did not use our research results clinically to inform the research subject of his carrier status for two autosomal recessive mutations in PKLR. Instead, after consultation with our Institutional Review Board, we conveyed the results to the patient’s hematologist, so that the clinician could determine whether further CLIA-certified genetic testing might be warranted. It is important to note that one of the mutations was found with a low frequency (~0.1%) in the 1000 Genomes Project and dbSNP(v130), so these databases should not be used simply as a filter for any recessive mutations, given that heterozygote carriers of some mutations are likely to be found in these databases. In addition, a very rare mutation might only have manifestations in the context of compound heterozygosity with another modifying mutation, such as seen here with the low frequency (0.1%) mutation in PKLR on the other allele.
We acknowledge that any hematologist could have (and did) diagnose this patient with pyruvate kinase deficiency, based on a biochemical test. Importantly however, such results should be interpreted with care (Wijk et al., 2004) and, ideally, be confirmed by DNA analysis of PKLR. It was only in the research setting that we uncovered the fact that this is the only person currently known to carry both the p.G341A and R569Q changes, thus allowing us to conclude that this combination of mutations leads to impaired PK activity in such a way that the red blood cell life span is compromised, which is supported by the biochemical enzymatic assay results. As exome (and eventually whole genome) sequencing enters the clinic, there will be much opportunity to map and document human genetic variation, as illustrated here and in another recent paper related to hemoglobinopathies (Giardine et al., 2011).
One lesson from our analysis is that sequencing just four members of a family is likely insufficient to prove causality of any particular variant in a complex disease such as ADHD, particularly when the affected members of the family are closely related, such as the case here. A further problem is that some variants in neuropsychiatric illness might only have 50% penetrance (Mitchell and Porteous, 2011). Strategies looking for de novo mutations in sporadic cases of neuropsychiatric illness might prove more successful, although proving causation in a de novo case, rather than just association, will still be quite difficult. A very recent report in 20 autism de novo trios suggested a possible association with candidate variants in four probands out of the twenty trios, along with other possible variants for additional cases, but the study did not prove that any of the variants cause the illness (either by themselves or in aggregate with other unreported variants in the genome) (O’Roak et al., 2011). Also, given the possible oligogenic nature of autism, supported in part by recent copy number variant studies (Sanders et al., 2011), sequencing only a relatively small number of autism de novo trios (or even a large number of single cases) will not be able to prove causality of any discovered alleles, in light of penetrance issues and environmental influences. Rather, to provide more compelling evidence for causality, it is critically important to show segregation of the variants with phenotype in multiple family members in large pedigrees living in the same geographic region, including from affected distantly related individuals (such as cousins), so as to increase the power of detection and to help sort out the rate of penetrance. Given the extreme heterogeneity of ADHD and other neuropsychiatric conditions, we surmise that “intersection filtering” of the exomes (or even whole genomes) of completely unrelated individuals will be less successful, unlike the situation with rare Mendelian syndromes with monogenic causes and extremely high penetrance (Ng et al., 2010a; Rope et al., 2011), particularly given ongoing debates in regards to polygenic versus oligogenic modes of inheritance for these illnesses. We therefore believe it is critically important to control for environmental and ethnic/population stratification differences by focusing on large pedigrees living in the same geographic region.
We filtered some candidate rare variants against exome sequencing databases at BGI and Baylor to confirm whether they are truly rare variants. We did this because, despite our efforts to filter candidate variants by dbSNP and the 1000 Genomes Project, a large fraction of candidate variants were found in other exomes to be relatively common and are thus unlikely to be disease causing. This analysis suggests that the current catalog of less common variants in the 1000 Genomes Project is not comprehensive enough, perhaps due to the relatively low sequencing coverage and relatively small sample size. To identify genes for complex diseases from small families, given the large number of candidate variants, the analytical strategy depends heavily on a well-annotated catalog of rare variants, preferably collected from well-phenotyped research subjects. Recent studies suggest that there are 144,000 new variants per genome, after the first 15 individuals have been sequenced (Pelak et al., 2010). Considering the total number of variants from the 1000 Genomes Project (~30 million variants), it is clear that a catalog of rare/less common variants is far from complete, so pooling variant calls from multiple genome centers to produce a catalog of rare variants (MAF<0.1% or 0.5%) will help to take advantage of whole exome or whole genome sequencing data.
There are several limitations to our study. One of them is the need for additional sequencing to achieve a higher depth of coverage in the pedigree, as we found that exome sequencing at an average coverage of 20–30× only resulted in 10 or more reads per base pair in ~67–75% of the target region, whereas the emerging consensus at the large sequencing centers appears to be that it is better to aim for 20 or more reads per base pair in >80% of the target region.. Obviously, we plan to sequence in more depth and in more pedigrees and also to look for the identified rare variants in case-control studies, but this will require more time and funding to accomplish. Also, in the rapidly moving field of DNA sequencing, we have been grappling with the decision of whether it is worth the cost to sequence more of the exomes, or just to switch to whole genomes, given that we cannot predict a priori that the causative variants for this complex disorder will reside entirely in the exons. We have not conducted a formal cost-benefit analysis of our study, so we can only state here that if the cost of sequencing continues to plummet (and if capturing costs stay largely unchanged), it is likely that whole genome sequencing will soon become competitive with exome sequencing and even with the cost of currently marketed single gene mutational assays, which can sometimes range into thousands of US dollars. Also, another limitation of exome capture in general and thus also for our study is precisely the fact that not everything is efficiently captured, necessitating much more sequencing and expense to obtain enough data on low-capture regions as we have shown herein; this is not a problem with whole genome sequencing given that there is no capture step. It therefore seems likely that whole genome sequencing will replace exome sequencing and single gene diagnostic tests for Mendelian diseases at some point in the near future, assuming that the clinical standards for this can be established.
Whether genome sequencing can deliver any clinical benefits for complex traits is still largely unknown. The nature of these complex diseases, especially neuropsychiatric diseases, makes it possible that a single major-effect gene will not explain disease phenotypes in most subjects, with this point being highlighted at least theoretically in studies of complex phenotypes in model organisms (Buckler et al., 2009; Ehrenreich et al., 2010). It is therefore still an open question how much of the inheritance of neuropsychiatric disease will be explained by single rare variants in the context of an oligogenic model. Appropriate identification of multiple candidate variants, which together might lead to disease phenotypes in a potential epistatic and/or stochastic manner (incomplete penetrance) (i.e., a polygenic model), would certainly be more challenging in comparison to our recent effort to identify a single-gene cause for a new rare Mendelian syndrome, which we have tentatively named Ogden Syndrome, in honor of where that family resides (Rope et al., 2011).
For now, our results are meant to highlight the difficulties that our group (outside of a large sequencing center) has encountered thus far with trying to use exome sequencing in one small family to figure out the genetic basis of a complex neuropsychiatric illness. We anticipate that we and others will move to whole genome sequencing as the costs continue to plummet, given that it is obvious from the ENCODE (Birney et al., 2007; ENCODE Project Consortium et al., 2011) and other projects that there are a large number of non-coding variants, i.e., regions outside the exome, which could play a role in the pathogenesis of complex diseases. Besides the potential need to perform whole-genome sequencing, novel bioinformatics methods that can help functionally interpret genetic variants are of paramount importance to truly utilize next-generation sequencing in clinical practice for complex traits. In this regard, we and our collaborators have recently introduced a probabilistic search tool (VAAST) for identifying disease-causing variants in personal genome sequences (Rope et al., 2011; Yandell et al., 2011), and we are applying this and other new tools to the analysis of this and other datasets for rare Mendelian and complex disorders.
We thank Matthew Bainbridge and Richard Gibbs at Baylor for access to unpublished exome data. We also thank David A. Nix and the University of Utah Bioinformatics Shared Resource. This work was funded by University of Utah Department of Psychiatry funds to G.J.L. and R.R., along with funds from F.R. J.X. is supported by NIH/NHGRI K99HG005846. B.M. and M.Y. were supported by NHGRI 1RC2HG005619. K.W. and H.H. were funded by a Pilot/Methodological Study Award from NIH/NCRR Grant UL1 RR025774.
Author Contributions: The first two authors (G.J.L. and T.J.) should be regarded as joint first authors. G.J.L., K.W., T.J., W.W., P.M.B., J.X., L.T., M.C., Y.L., P.Z., Y.L., B.M., Z.W., and W.E.J. analyzed the exome and SNP genotyping data. T.J. and colleagues at BGI performed all Sanger sequencing and Sequenom genotyping. J.T.G. and K.W. performed CNV analysis. J.E., F.R., R.R., and G.J.L. contributed phenotyping data. R.V.W. and W.W.v.S. performed biochemical assays of PK and contributed expertise in the PK field. G.J.L. and K.W. conceived of the project, supervised all data analysis, and wrote the manuscript. All authors provided comments on the manuscript.
The authors do not have any competing financial conflicts of interest to report.
Gholson J. Lyon, Department of Psychiatry, University of Utah, Salt Lake City, Utah 84132, USA and NYU Child Study Center, New York University, New York, New York 10016, USA.
Tao Jiang, BGI-Shenzhen, Shenzhen, 518083, China.
Richard Van Wijk, Clinical Chemistry and Haematology, University Medical Center Utrecht, Utrecht, 3583 CX, Netherlands.
Wei Wang, Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey 07102, USA.
Paul Mark Bodily, Brigham Young University, Provo, Utah, USA.
Jinchuan Xing, Eccles Institute of Human Genetics, Department of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
Lifeng Tian, Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA.
Reid J. Robison, Utah Foundation for Biomedical Research, Salt Lake City, Utah 84107, USA.
Mark Clement, Brigham Young University, Provo, Utah, USA.
Yang Lin, BGI-Shenzhen, Shenzhen, 518083, China.
Peng Zhang, BGI-Shenzhen, Shenzhen, 518083, China.
Ying Liu, BGI-Shenzhen, Shenzhen, 518083, China.
Barry Moore, Eccles Institute of Human Genetics, Department of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
Joseph T. Glessner, Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA.
Josephine Elia, Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA.
Fred Reimherr, Department of Psychiatry, University of Utah, Salt Lake City, Utah 84132, USA.
Wouter W. van Solinge, Clinical Chemistry and Haematology, University Medical Center Utrecht, Utrecht, 3583 CX, Netherlands.
Mark Yandell, Eccles Institute of Human Genetics, Department of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA.
Hakon Hakonarson, Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA and Department of Pediatrics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA.
Jun Wang, BGI-Shenzhen, Shenzhen, 518083, China.
William Evan Johnson, Brigham Young University, Provo, Utah, USA.
Zhi Wei, Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey 07102, USA.
Kai Wang, Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA and Zilkha Neurogenetic Institute, Department of Psychiatry and Preventive Medicine, University of Southern California, Los Angeles, California 90089, USA.