To demonstrate the applicability of the proposed analytic framework, we provisionally binned 2016 genes implicated in Mendelian disorders, implemented a computational analytic pipeline, and explored the output from 80 whole genome sequences. In this first attempt at binning the genome (Supplemental Table S1
), 161 genes were assigned to Bin 1, 1798 genes were assigned to Bin 2b, and 57 genes were assigned to Bin 2c. We emphasize that the binning of genes used in this study is provisional and used for illustrative purposes; the final population of bins will change over time and the choices made by our group and others may well differ.
We then explored parameters (AF cut-offs and effect of the mutation) used to select variants for further manual review (). The total number of variants selected () is decreased 10–20 fold using AF filters of <5% or <1% (). Selecting for protein-altering variants (missense, nonsense, frameshift, and splice site) further decreases this number (). However, the resulting numbers are still incompatible with the small chance of an individual having a Mendelian disorder; thus, the vast majority of variants with <5% AF must have minimal clinical consequences. When selecting only predicted truncating (nonsense, frameshift, and splice site) variants, the number identified per patient is more consistent with realistic expectations ().
Selection of variants based on allele frequency and predicted effect on the translated protein
Clearly, the sensitivity of the algorithm is decreased by the exclusion of rare missense mutations. To address this issue we queried a local instance of HGMD for variants in these genes annotated as “DM” and identified 871 unique variants (771 missense) among the 80 whole genome sequences. On average there were 74 (range 61–106) “DM” variants per person (), which is strikingly similar to the report of the 1000 Genomes Project Consortium that individuals are heterozygous for 50–100 variants classified as disease causing in HGMD.9
Nevertheless, this large number of putatively disease-causing mutations is surprising, given the very low probability of a Mendelian disorder truly being present in any of the subjects.
Analysis of mutations annotated as “DM” in HGMD
Since 88% of the unique “DM” variants were missense substitutions, we hypothesized that these variants could comprise a subset of the ~150 missense variants per person identified in Bins 1, 2b, and 2c with <5% AF (). Surprisingly, there was minimal overlap between the less common missense variants and “DM” variants detected in the 80 genomes (), and upon further review, 251 of the 871 unique “DM” variants (29%) had >5% AF. As a result, 78% of “DM” variants per person were >5% AF (). This finding is similar to a previous report that 74% of HGMD variants identified in 448 genes implicated in severe recessive diseases of childhood were variants with >5% AF.11
Although some of these variants could represent recessive alleles that are relatively frequent in certain populations, this explanation cannot account for the vast majority of these variants.
To further assess the pervasiveness of misleading database errors, we queried the 1000 Genomes Project allele frequencies and found allele frequencies for 1811 out of 74,694 “DM” variants (mostly substitution variants). Of these, 1152 had <1% AF, 299 had 1–3% AF, 95 had 3–5% AF, and 265 (~0.35% of all “DM” variants) had >5% AF (). The small subset of variants with >5% AF comprised the majority of “DM” variants identified in a given genome sequence, simply because of the prevalence of these variants in the general population; in subsequent analyses we restricted HGMD variants to those with <5% AF.
The final algorithm selected variants according to the following criteria: 1) presence in a binned gene, 2) <5% AF, and either 3) annotation as a disease-causing mutation (“DM”) in HGMD or 4) predicted to be truncating. Variants were further analyzed for zygosity to assign single heterozygous variants in recessive genes to a separate category for carrier status (Bin R). When we applied this algorithm to the 80 genomes, a total of 1391 variants (906 unique variants) were selected. The per-person averages were 1.5 variants in Bin 1 genes, 6.4 variants in Bin 2b genes, 0.2 variants in Bin 2c genes, and 9.2 variants considered to imply carrier status for recessive disorders ( and Supplemental Table S2
Numbers of variants selected by the informatics algorithm
The variants selected by the algorithm were then manually reviewed using a combination of OMIM, PubMed, Google Scholar, UCSC genome browser, and locus-specific databases to assess the evidence for pathogenicity or to reclassify the variants selected from the 80 genomes. Variants were reclassified if two variants identified in an individual likely comprised a single complex substitution allele or comprised a single common haplotype. In many cases, variants annotated as “DM” in HGMD were reclassified as VUS or likely polymorphisms. In other cases, the type of variant or its location within a specific transcript was inconsistent with a pathogenic effect. Zygosity was reassessed when it was determined that two variants were likely to be in cis
or that only one of the selected variants in a gene was likely to be pathogenic; in these cases, the remaining heterozygous variant was reassigned to Bin R. shows examples of binned variants, reclassified variants, and variants removed from consideration. Several detailed examples are described in the Supplemental Materials
. A list of binned variants from the 61 publically available genomes is available in Supplemental Table S3
Selected examples of selected variants, reclassified variants, and variants removed from consideration after human review
After review, 705 variants were removed from consideration and 71 were reassigned to carrier status. Differing percentages of variants were reclassified or removed from consideration in each bin category () and lower proportions of novel variants were removed () compared to HGMD “DM” variants (). In all, 279 of the 358 unique variants removed from consideration were HGMD “DM” variants. After the final analysis, the revised per-person averages were 0.3 variants in Bin 1 genes, 2.6 variants in Bin 2b genes, 0.06 variants in Bin 2c genes, and 5.5 variants considered to imply carrier status ( and Supplemental Table S2
Results of the manual review of variants selected by the informatics algorithm