Kabuki syndrome is a rare, multiple malformation disorder characterized by a distinctive facial appearance (
Supplementary Fig. 1), cardiac anomalies, skeletal abnormalities, immunological defects, and mild to moderate mental retardation. Originally described by Niikawa
et al.5 and Kuroki
et al.6 in 1981, Kabuki syndrome has an estimated incidence of 1 in 32,000
7 and about 400 cases have been reported worldwide. The vast majority of reported cases have been sporadic, but parent-to-child transmission in more than a half a dozen instances
8 suggests that Kabuki syndrome is an autosomal dominant disorder. The relatively low number of cases, the lack of multiplex families, and the phenotypic variability of Kabuki syndrome have made the identification of the gene(s) underlying Kabuki syndrome intractable to conventional approaches of gene discovery, despite aggressive efforts.
We sequenced the exomes of ten unrelated individuals with Kabuki syndrome, seven of European ancestry, two of Hispanic ancestry, and one of mixed European/Haitian ancestry (
Supplementary Fig. 1, Supplementary Table 1). Enrichment was performed by hybridization of shotgun fragment libraries to custom microarrays, followed by massively parallel sequencing
1–3. On average, 6.3 gigabases of sequence were generated per sample to achieve 40× coverage of the mappable, targeted exome (31 megabases). As previously, our analyses focused primarily on nonsynonymous (NS) variants, splice acceptor and donor site mutations (SS) and coding indels (I), anticipating that synonymous variants were far less likely to be pathogenic. We also predicted that variants underlying Kabuki syndrome are rare, and therefore likely to be novel. Novelty was defined here by absence from all datasets used for comparison, including dbSNP129, the 1000 Genomes Project, exome data from sixteen individuals previously reported by us
2,3, and ten exomes sequenced as part of the Environmental Genome Project (EGP).
Under a dominant model in which each case was required to have at least one novel NS/SS/I variant in the same gene, only a single candidate gene (
MUC16) was shared by all ten exomes (, row 4;
Supplementary Table 2). However,
MUC16 was considered likely to be a false positive due to its extremely large size (14,507 aa). Potential explanations for our failure to find a compelling candidate gene in which novel variants are observed in all affected individuals included: (a) that Kabuki syndrome is genetically heterogeneous, and therefore not all affected individuals will have mutations in the same gene; (b) that we failed to identify all mutations in the targeted exome; (c) that some or all causative mutations were outside of the targeted exome, e.g., in non-coding regions or unannotated genes. To allow for a modest degree of genetic heterogeneity and/or missing data, we conducted a less stringent analysis by looking for candidate genes shared among subsets of affected individuals. Specifically, we searched for subsets of
x out of 10 exomes having ≥ 1 novel variant in the same gene, for
x = 1 to 10. For
x = 9, 8, and 7, novel variants were shared in three genes, six genes, and sixteen genes, respectively (, row 4). However, there was no obvious way to rank these candidates.
| Table 1Number of genes common to any subset of x affected individuals |
We speculated that genotypic and/or phenotypic stratification would facilitate the prioritization of candidate genes identified by subset analysis. Specifically, we assigned a categorical rank to each Kabuki case based on a subjective assessment of the presence of, or similarity to, the canonical facial characteristics of Kabuki syndrome (
Supplementary Fig. 1) and the presence of developmental delay and/or major birth defects (
Supplementary Table 1). The highest ranked case was one of a pair of monozygotic twins with Kabuki syndrome. We then categorized the functional impact (i.e. nonsense versus nonsynonymous substitution, splice-site disruption, frameshift versus in-frame indel) of each novel variant in candidate genes shared by each subset of two or more ranked cases. Manual review of these data highlighted distinct, novel nonsense variants in
MLL2 in each of the four highest ranked cases. On sequential analysis of phenotype-ranked cases with a loss-of-function filter,
MLL2 is the only candidate gene remaining after addition of the second individual (, row 5, column "+2"). No novel variant in
MLL2 was found in the Kabuki case ranked 5
th, such that the number of candidate genes drops to zero after the fourth individual (, row 5). However, a 4-bp deletion was found in the case ranked 6
th and nonsense variants in the cases ranked 7
th and 9
th. Thus, exome sequencing identified a nonsense substitution or frameshift indel in
MLL2 in seven of the ten Kabuki cases.
| Table 2Number of genes common in sequential analysis of phenotypically ranked individuals |
Retrospectively, if we apply a loss-of-function filter to the subset analysis of exome data (, row 5), at
x = 7,
MLL2 is the only candidate gene. We also developed a
post hoc ranking of candidate genes based on functional impact of variants present (“variant score”) and the rank of the cases in which each variant was observed (“case score”). When applied to the exome data as a combined metric,
MLL2 emerges as the top candidate (
Supplementary Fig. 2).
In parallel with these analyses, we applied genomic evolutionary rate profiling (GERP)
9 to exome data. GERP uses mammalian genome alignments to define a rejected substitution (RS) score for each variant, regardless of functional class. We have previously shown that the quantitative ranking of candidate genes by the RS scores of their novel variants can facilitate the exome-based analysis of Mendelian disorders
10. In subset analysis with GERP-based ranking,
MLL2 remains on the candidate list up to
x = 8, ranking 3
rd in a list of 11 candidate genes at this threshold (,
Supplementary Fig. 3). Interestingly, the additional
MLL2 variant contributing to this analysis (such that
MLL2 is still considered at
x = 8) is a synonymous substitution with an RS score of 0.368 in the 5
th ranked case.
| Table 3Analysis of exome variants using genomic evolutionary rate profiling |
We sought to confirm all novel variants identified in MLL2, particularly because loss-of-function variants identified through massively parallel sequencing have a higher prior probability of being false positives. All seven loss-of-function variants in MLL2 were validated by Sanger sequencing. We further analyzed the three cases in which we did not initially find a loss-of-function variant in MLL2, first by array comparative genomic hybridization (aCGH) to determine any gross structural changes, and then by Sanger sequencing of all exons of MLL2 in case of false negatives by exome sequencing. Since an average of 96% of coding bases in MLL2 were called at sufficient quality and coverage for single-nucleotide variant detection, we anticipated that any missed variants were more likely to be indels instead, because of the higher coverage required for confident indel detection in short-read sequence data. Indeed, although aCGH did not find any structural variants in the region, Sanger sequencing did identify frameshift indels in two of these three cases (ranked 8th and 10th).
Ultimately, loss-of-function mutations in
MLL2 were identified in nine out of ten cases in the discovery cohort (), making it a compelling candidate for Kabuki syndrome. For validation, we screened all 54 exons of
MLL2 in 43 additional cases by Sanger sequencing. Novel nonsynonymous, nonsense or frameshift mutations in
MLL2 were found in 26 of these 43 cases ( and
Supplementary Table 3). In total, through either exome sequencing or targeted sequencing of
MLL2, 33 distinct
MLL2 mutations were identified in 35 of 53 families (66%) with Kabuki syndrome ( and
Supplementary Table 3). In each of twelve cases for which DNA from both parents was available, the
MLL2 variant was found to have occurred
de novo. Three mutations were found in two cases each: one mutation was confirmed to have arisen
de novo in one of the cases, indicating that some mutations are recurrent. Novel
MLL2 mutations (K4527X and T5464M) were also identified in each of two families in which Kabuki syndrome was transmitted from parent-to-child. None of the additional
MLL2 mutations were found in 190 control chromosomes from individuals of matched geographical ancestry.
Our results strongly suggest that mutations in
MLL2 are a major cause of Kabuki syndrome.
MLL2 encodes a large 5,262 residue protein that is part of the SET family of proteins, of which
Trithorax, the
Drosophila homologue of
MLL, is the best characterized
11. The SET domain of
MLL2 confers strong histone 3 lysine 4 methyltransferase activity and is important in the epigenetic control of active chromatin states
12. Murine loss of
Mll2 on a mixed 129Sv/C57BL/6 background slows growth, increases apoptosis and retards development leading to early embryonic lethality, due in part to mis-regulation of homeobox gene expression
13. However, no morphological defects have been reported in
Mll2+/− mice
13.
Most of the
MLL2 variants identified in Kabuki cases are predicted to truncate the polypeptide chain before translation of the SET domain. Accordingly, though it is not certain whether Kabuki syndrome results from haploinsufficiency or a gain-of function at
MLL2, haploinsufficiency seems to be the more likely mechanism. Deletion of chromosome 12q12-q13.2, which encompasses
MLL2, has been reported in a child with characteristics of Noonan syndrome
14. However, we re-analyzed this case using oligo aCGH (including 21 probes that cover
MLL2) and found the distal breakpoint to be located ~700 kb proximal of
MLL2 (data not shown). Interestingly, all of the pathogenic missense variants identified herein are located in regions of
MLL2 that encode C-terminal domains. This suggests that missense variants elsewhere in
MLL2 could be better tolerated or, alternatively, are embryonic lethal.
For the 18 of 53 cases for which no novel protein-altering variant was found, it is possible that non-coding or other missed mutations in MLL2 are responsible instead. Alternatively, Kabuki syndrome could be genetically heterogeneous, and further analysis of these cases by exome sequencing may elucidate additional genes for Kabuki syndrome and potentially explain some of the phenotypic heterogeneity seen in this disease. Notably, 9 of 10 individuals in the discovery cohort (90%), but only 26 of 43 individuals in the replication cohort (60%), were ultimately found to have mutations in MLL2. It is therefore possible that the careful selection of canonical Kabuki cases for the discovery cohort enriched for a shared genetic basis. This underscores the importance of access to deeply phenotyped and well-characterized cases.
In summary, we applied exome sequencing of a small number of unrelated cases to discover that mutations in
MLL2 underlie Kabuki syndrome. As predicted in previous analyses
2,3, allowing for even a small degree of genetic heterogeneity or missing data significantly confounds exome analysis by increasing the number of candidate genes consistent with the model of inheritance. To facilitate the prioritization of genes under such criteria, we stratified data by ranked phenotypes and found that
MLL2 was prominent in the higher ranked cases. However, nine of the ten Kabuki cases in the discovery cohort were ultimately found to have
MLL2 mutations, such that stratification by phenotype was of less importance than originally appeared to be the case. Nonetheless, the sequential analysis of ranked cases may have reduced the probability of confounding due to genetic heterogeneity. All of the
MLL2 mutations found in the discovery set via exome sequencing were loss-of-function variants. As a result,
MLL2 ranked highly among candidates assessed by predicted functional impact. Such a pattern will likely occur for some, but not all, Mendelian phenotypes subjected to this approach. We anticipate that the further development of strategies to stratify data at both the genotypic and phenotypic level will be critical for exome and whole genome sequencing to reach their full potential as tools for discovery of genes underlying Mendelian and complex diseases.