Next generation sequencing technologies promise to expedite disease gene discovery and allowed us to identify known and novel pathogenic variants in our patients. Although costly, exome sequencing is practical as it interrogates the 1.5% of the genome that contains approximately 95% of pathogenic variants
[12],
[13],
[14],
[15],
[16],
[17],
[18],
[19],
[20],
[21],
[22],
[23],
[24],
[25],
[26],
[27],
[28]. To assess the utility of exome sequencing in an active clinical setting, we selected 15 patient samples representing 7 different genetic conditions. Each disorder had previously been mapped to a chromosomal locus and candidate gene sequencing failed to identify the pathogenic variant. For six disorders, an average autozygous block of 4.4 Mb (0.13% of the human genome) contained only one novel homozygous variant and rendered disease gene identification straightforward. Even in the case of infantile parkinsonism-dystonia syndrome, for which autozygosity and linkage mapping were only partially informative, a manageable list of 19 candidate variants was assembled simply by assuming mutation homogeneity. Prior mapping data and thorough knowledge of the patient implicated a single variant from this list (
SLC6A3 IVS9+1G>T).
For four of the conditions, more than one affected individual was available for exome sequencing. Even in the absence of mapping data, the identification of the putative pathogenic variant would still have been unambiguous. When we examined the shared, novel homozygous variants in affected individuals, we found one, and only one, that was not homozygous in any other (unaffected) individuals in the study. Thus, the assumption of mutation homogeneity obviates the need for SNP genotyping and mapping; we reach the same conclusion by exome sequencing of multiple affected individuals without the added time and expense of SNP genotyping.
The number of novel homozygous variants in each individual was surprisingly small. Average inbreeding coefficients of 4% and 2.5% in the Lancaster Amish and Mennonite populations, respectively, suggested that a small but significant fraction of variation will be homozygous. On average, we found only 21 novel homozygous variants per sample across the exome. Of these variants, only 12 were predicted to be potentially pathogenic (missense, nonsense, splice site). This represents 3.7% of all novel variants per exome. For the two disorders where a singleton was sequenced, we identified only 6 potentially pathogenic novel variants which were homozygous in the patient but in no other samples. Since our total sample size was small (15 individuals), we expect that future studies which leverage accumulated exome data will allow us to sequence single individuals to identify rare, uniquely homozygous pathogenic variants. In the outbred population, a strategy that scans for homozygosity or compound heterozygosity for novel variants in the same gene should yield equally manageable candidate gene lists.
Among the fifteen individuals studied, we found 4200 different novel autosomal sequence variants, roughly 62% of which have pathogenic potential. We infer that 3.6% of these variants were non-pathogenic changes as they were homozygous in one or more unaffected individuals. As more Amish and Mennonite exomes are analyzed within a clear clinical context, our ability to determine pathogenicity will improve. The exome data also provided a broader view of the genetic disease burden within these populations. We have catalogued 94 known pathogenic sequence variants within the Plain populations that should be detectable by exome sequencing. Of these, 11 were represented in at least one individual. On average, each individual harbored 1.4 known Plain pathogenic alleles (range, 0–4). We also compared our exome results against the Human Gene Mutation Database (HGMD)
[58]. Carrier status for 113 HGMD mutations, that cause phenotypes not yet encountered in the Plain populations, was detected in our patients. These data permit us to generate a more comprehensive molecular differential diagnosis list when faced with a new clinical phenotype. It is notable that several HGMD-DM mutations were homozygous in one or more patients, casting some doubt on the pathogenicity of these variants. Similar results have been reported elsewhere and highlight the need for better curation of mutation databases
[59].
Critics will argue that we have failed to exclude all possible variants due to incomplete coverage. Our sequencing metrics show excellent, albeit incomplete, exome coverage. On average, 91.8% and 86.6% of the targeted exome was sequenced to a depth of 10× and 20×, respectively. While this is a potential hazard, our study design minimized this risk. Prior mapping analyses narrowed the focus to a vanishingly small 0.13% of the genome. Within these mapped regions, we discovered only one novel homozygous variant. This is significantly better coverage and stronger evidence than we and others have demonstrated for disease gene identification prior to the advent of exome sequencing. Additionally, our SNP filtering strategies might be questioned; dbSNP is polluted with many pathogenic variants and their numbers continue to grow as data accrues. Our current conservative strategy used dbSNP 129 and the 1000 Genomes Project to filter exome variants. The risk to our analyses is relatively small since we study very rare and highly penetrant alleles. Nonetheless, in our population and elsewhere, local population-specific variant databases will prove most useful for inferring pathogenicity.
Pathogenicity is difficult to prove, but for four conditions we provide ample functional data to demonstrate abrogation of protein function. In vitro studies of protein localization and function in mammalian cells provide further confirmation that the homozygous variants identified were indeed pathogenic. The predicted consequence of the
BRAT1 c.638_639insA frameshift variant is a truncated protein at amino acid position 401. Overexpression of this truncated protein abolished nuclear localization and demonstrated protein instability (87.7% decrease relative to wild-type). An alternative disease mechanism, nonsense-mediated mRNA decay, has not been investigated. Others have shown that knockdown of BRAT1 results in p53-induced apoptosis
[33]. This is consistent with the neurodegeneration observed in patients with the
BRAT1 variant. When transfected into mouse IMCD3 cells, the mouse counterpart of the
CRADD c.382G>C variant disrupts interaction with mouse Pidd (it's normal binding partner) and forms dense aggregates when co-expressed with wild-type mouse Pidd. This is in contrast to a pattern of uniform colocalization of wild-type Cradd and Pidd throughout the cytoplasm and nucleus. Mouse Snip1 protein, when transiently overexpressed in IMCD3 cells, localizes to the nucleus in a punctate pattern consistent with transcriptional complexes, while mutant mouse Snip1 (p.Glu353Gly, corresponding to human p.Glu366Gly) localizes to the nucleus, but with a more aggregated distribution. Western blotting proved this structure unstable. For the
HARS c.1361A>C variant, we demonstrate that reaction velocity (Vmax) for aminoacylation of human tRNAHis with histidine is reduced nearly two-fold. While these studies cannot prove pathogenicity beyond a shadow of doubt, the totality of evidence is compelling and strongly suggests that we have identified the disease-causing alleles. The association between
FLVCR1 and
SLC6A3 variants and disease has previously been established.
We provide no further evidence for pathogenicity of the TUBGCP6 c.5458T>G variant. However, the primary microcephalies result from disruption of the centrosomal complex during mitosis, reducing the neural progenitor pool during development (39–41). Centrosomal proteins such as CDK5RAP2 and CENPJ/CPAP interact with the γ-tubulin ring complex (γ-TuRC) to regulate microtubule nucleation (42–44). TubGCP6 (GCP6) is a component of the human γ-TuRC (45), where it is required for CDK5RAP2 to activate microtubule nucleation (43). We predict that defects in TUBGCP6, CDK5RAP2 and CENPJ cause primary microcephaly by similar mechanisms.
All seven variants described in this paper interfere with neurological development. Functional studies of BRAT1 and SNIP1 will expand our understanding of epilepsy and also deepen our knowledge of DNA damage repair and transcriptional regulation in cortical development and neuronal survival. Selective degeneration of photoreceptors and dorsal column afferents caused by FLVCR1 mutations suggest an unusual vulnerability of these cells to deranged heme transport, or may reveal an altogether different function of the FLVCR1 protein. Despite its ubiquitous expression, a defect in aminoacylation by HARS selectively damages elements of afferent sensory systems and, by unknown mechanisms, predisposes to episodic psychosis and sudden death. Abnormalities of neuroblast proliferation and migration caused by TUBGCP6 mutations fit nicely with our existing knowledge about centrosomal complexes, microtubular arrays, and cortical growth, but also introduce new questions about the diverse brain morphologies linked to various specific tubulin-associated proteins and the role of these proteins in early eye development. Finally, further studies on the connection between CRADD and general intelligence will certainly change our understanding of the microanatomical and molecular bases of cognition. Small focused studies as described herein will be a steady engine of progress for understanding the specific connections between genes and the human brain.
Ultimately, genomics can only shape medical practice within the context of regional particulars and clinical facts. Our local, patient- and family-based approach to gene discovery stands in stark contrast to the prevailing model of genomic research, where the people who produce genotype data are frequently separated from those who collect and analyze clinical facts, and both struggle to translate genetic knowledge into primary care. Although we are focused on specific regional populations (1, 3), these studies reveal concepts of broad biological and economic relevance (8, 11). The discovery of rare, highly penetrant alleles among small social groups may prove more useful than large genome-wide association studies for revealing the basic genetic foundations of complex disease, particularly when these alleles can be viewed against a background of population-specific genetic variation. Even at current prices, microarray analyses to detect copy number abnormalities coupled with exome sequencing are an order of magnitude cheaper than the standard workup for a complex patient at a tertiary medical center. Thoughtful and appropriately scaled application of these genetic technologies to other regional populations should yield similar economic and clinical benefits in the years ahead.