Here we predict human haploinsufficient genes by integrating diverse genomic, evolutionary and function properties that we show are characteristic of haploinsufficiency. We further validate those predictions with independent experimental and clinical data. The framework of this study is outlined in . While various, but not necessarily mutually exclusive, physiological mechanisms have been proposed to underlie haploinsufficiency, including dosage threshold effects and altered stoichiometry of a macromolecular complex 
, our approach does not assume that one or other of these mechanisms predominates.
Outline of the prediction framework.
We first compiled a list of known human HI genes and a catalog of HS genes. Known HI genes were collated from literature 
. The catalog of HS genes was generated from genes disrupted in a loss-of-function manner in control individuals used in genome-wide association studies by CNVs detected in data from the Affymetrix 6.0 chip (see Methods
and Figure S1
). We identified 2,676 putative HS genes seen in any control individuals and 1,079 seen in two or more controls (Table S1
and Figure S2
), and used the latter set in most downstream analyses Thus the final list of HI and HS genes contains 301 and 1,079 genes respectively.
To systematically assess the difference in properties between HI and HS genes, we gathered a large number of annotations describing the evolutionary, functionary and interaction properties of genes (see Methods
) and examined the distribution of each individual property in HI and HS genes. We found that HI genes have consistently more conserved coding sequence (human-macaque dN/dS, p
3.12e-26), less mutable promoter (p<1e-30), paralogs with lower sequence similarity (p
1.84e-9), longer spliced transcript (p<1e-30), longer 3′UTR (p
2.63e-12), higher expression during early development (p
1.10e-15), higher tissue specificity in expression (p
2.29e-6), more interaction partners in both a protein-protein interaction network (p<1e-30) and a gene interaction network (p<1e-30) and higher chances of interacting with other known HI genes (p<1e-30) and cancer genes (p<1e-30) (). Some biological insights can be gained from these comparisons. For instance, the average sequence identity to the closest paralog of HS genes is significantly higher than that of HI genes, suggesting functional compensation between recently duplicated genes that shield each other from reduction in dosage. Interestingly, growth rate of yeast heterozygous deletion strains does not seem to differ between their HI human homologs and HS human homologs, probably reflecting the vast functional differences between the majority of yeast and human genes, except those involved in highly conserved cellular processes.
Properties that distinguish HI genes from HS genes.
The highly significant differences in genomic, evolutionary, functional and network properties between HI and HS genes suggest they may be predictive of haploinsufficiency. However, since each annotation is only available for a fraction of genes in the genome (Table S2
), there is a trade-off between the increase in prediction performance by considering more properties and the decrease in the genomic coverage of genes we could predict. Therefore, we aimed to select a small number of most predictive properties that are relatively ‘orthogonal’ in the kind of information they provide (Methods
and Table S3
). After evaluating many different possible combinations of predictor variables (Figure S3
) we selected a model comprising of dN/dS between human and macaque, promoter conservation, embryonic expression and network proximity to known HI genes. The training data with all these information available consisted of 234 HI genes and 326 HS genes.
We are interested in the relative contributions of each selected property in the model. We used linear discriminant analysis (LDA) as the supervised classifier, which, given multi-dimensional data and class labels, finds the linear combination of the given dimensions (linear discriminant) that maximizes the inter-class variance. We scaled each property to the same variance before entering the model so that their contribution can be measured by the coefficients of the resulting linear discriminant (). We found that proximity to known HI genes in the probabilistic gene network is the single most heavily weighted predictor of haploinsufficiency.
Assessment of model performance.
To evaluate the performance of the model, we adopted a 10 fold cross-validation strategy and calculated the area under the ROC curve (AUC) () and Matthew correlation coefficient (MCC) (Methods
) as measurements of prediction performance. The model achieved an AUC of 0.81 and a MCC of 0.50. We showed that this prediction accuracy is not appreciably affected by the calling threshold used to define the CNVs that underpin the HS gene catalog, nor by the frequency threshold in controls used to define HS genes (Figure S6
). We also showed that prediction is not appreciably improved by using a Support Vector Machine classifier, which is more computationally intensive (Figure S7
). We demonstrated that combining the predictor variables together generated a more predictive model than considering any of the individual predictor variables in isolation (max AUC
0.78 for network proximity to known HI genes, Figure S4
). We also confirmed our hypothesis that contrasting known HI and inferred HS genes should be more predictive of HI than simply contrasting known HI genes to the rest of the genome (max AUC
0.75, Figure S5
We then used the model to estimate a probability of being HI ( p(HI) ) for all protein-coding genes in the genome for which all four selected properties were available (12,443 genes, over half of the total protein-coding genes) (Table S5
, Dataset S1
), the distribution of which is clearly bimodal, with a large peak near 0.2 and a much smaller peak at 1 (, left). The distributions of p(HI) for the HI and HS training sets are clearly differentiated (, right). It is not possible to assess how well-calibrated these probabilities are, as the fraction of human genes that exhibit HI is not known. We therefore sought to validate these predictions using indirect approaches that examined the distribution of p(HI) in independent gene sets enriched for HI. As there is no credible estimation of the number of human HI genes, in some of the following validation analyses we arbitrarily labeled the genes in the top 10% of p(HI) as being predicted HI genes. However, the results were robust against this threshold being varied by at least a factor of at least 2 (data not shown).
Predicted probability of being haploinsufficient across the genome.
First, we asked if genes implicated in human dominant diseases were enriched in our predicted HI genes relative to recessive-disease-causing genes. We retrieved 571 and 772 genes implicated in dominant and recessive disease from the OMIM and hOMIM database 
, respectively, with no information regarding haploinsufficiency (and thus not included in our training data), and compared the distribution of predicted p(HI) against each other. The HI status could be predicted for 392 dominant genes and 606 recessive genes, of which 87 and 39 were predicted as being HI, respectively. This 4.14 fold enrichment is highly significant (p
4.46e-13, Fisher's exact test). Simply comparing the distribution of p(HI) values for these dominant and recessive genes also shows a highly significant shift towards high p(HI) values in dominant relative to recessive genes (p
4.44e-16, Mann-Whitney U test; ). Second, we asked if heterozygous knockout of the orthologs of predicted human HI genes are more likely to cause severe phenotypic abnormalities in mice. For this purpose, we extracted a list of 1,523 mouse genes whose heterozygous knockout cause various abnormal phenotypes from MGI database 
, mapped them onto orthologous genes in humans, removed orthologs to genes in our training genesets and extracted the predicted p(HI) for the remainder. HI status could be predicted for the orthologs of 1,063 of these genes and 260 (24.5%) of them were predicted HI, indicating a 2.45 fold enrichment (p<1e-30, Fisher's exact test ) (). If focusing on those genes of which the heterozygous LOF phenotypes involve prenatal lethality (MP:0002080), the fold of enrichment increased to 4.38 (p
3.60e-12, Fisher's exact test) (28 predicted as HI out of 64 that could be predicted).
Enrichment of predicted HI genes in dominant genes relative to recessive genes.
Enrichment of predicted HI genes in orthologs of mouse haploinsufficient genes and mouse haplolethal genes.
Using haploinsufficient gene predictions to assess pathogenicity of deletions
We investigated how our gene-based predictions of haploinsufficiency might be used to discriminate between benign and pathogenic genic deletions. We considered that a natural way to score the probability of a deletion causing a haploinsufficiency phenotype is to generate a LOD (log-odds) score comparing the probability that none of the genes covered by the deletion will cause haploinsufficiency with the probability that at least one of the genes will cause haploinsufficiency, as shown schematically in . This LOD score is calculated using the formula below:
Higher LOD scores indicate deletions that are more likely to be pathogenic as a result of haploinsufficiency. Note that this score assumes that there is no statistical interaction between the genes.
Calculation of deletion-based LOD scores and the distribution of LOD score of control individuals and pathogenic de novo deletions.
We then considered how these deletion-based haploinsufficiency scores might be used to assess whether a genic deletion observed in a patient might cause their disease. One way of framing probabilistically this intuitively simple question is to estimate the opposing probability, that the deletion is unrelated to the patient's disease status. We can relate this to the probability of drawing an individual at random from a healthy control population with a deletion at least as pathogenic as the deletion in the patient. We can estimate this probability empirically as the proportion of healthy controls with a genic deletion having the same or greater haploinsufficiency score.
To test this approach, and to avoid circular reasoning, we generated a set of gene haploinsufficiency predictions using a slightly smaller set of HS genes identified in a large subset of the GWAS controls. The performance of the HI predictions using the slightly smaller HS gene training data was very similar to that of the full predictive model described in the previous section, as assessed by ten-fold cross validation (Text S1
). We then identified LOF deletions in the remainder (N
2,322) of the GWAS controls 
, which had not been used to train the predictive model, and determined the distribution of the maximal deletion haploinsufficiency score (based on the new gene haploinsufficiency predictions) observed in each control individual. We investigated whether the distribution of maximal LOD scores is significantly different between European-American (E-A) and African-American (A-A) GWAS controls. We observed that there was not a significant difference in median haploinsufficiency scores in E-A and A-A populations (p
0.71, Mann-Whitney U test), although the E-A controls have a slightly longer tail of more pathogenic deletions (e.g. a higher proportion of E-A controls have deletions with LOD scores greater than 3, Table S4
). The 50%, 90%, 95% and 99% percentiles of the distribution for maximal LOD score for E-A and A-A controls cohorts are listed in .
Percentiles of the distribution of maximal LOD scores seen in controls.
We calculated a LOD score for each of 487 pathogenic de novo
deletions submitted by clinical geneticists to the DECIPHER database 
. We focused exclusively on deletions known to be de novo
variants, as we infer that their pathogenicity has been ascribed primarily on the basis of their inheritance status. The distributions of maximal LOD scores in GWAS controls and LOD scores of pathogenic DECIPHER deletions are shown in . The pathogenic deletions have strikingly significantly higher LOD scores than deletions observed in GWAS controls (p<1e-30, Mann-Whitney U test). We observed that for 92% of the pathogenic deletions there was a probability of less than 5% of drawing an individual at random from our control population with a genic deletion of equal or greater LOD score, and for 83% of pathogenic deletions there was a less than 1% probability.
We computed ROC curves to compare three different approaches for discriminating between pathogenic deletions and deletions seen in controls: (i) our LOD scores, (ii) the length of the deletion, and (iii) the number of genes in the deletion (). These ROC curves clearly show that the haploinsufficiency LOD score is the best metric for discriminating between pathogenic deletions in patients and deletions seen in controls. We provide a script and input files to calculate LOD scores and make comparisons with control data (Protocol S1
Comparison of different metrics for assessing deletion pathogenicity.
Haploinsufficient gene predictions and loss-of-function sequence variation
We investigated whether the gene-based probabilities of haploinsufficiency that we have generated are of general utility across different forms of genetic variation. If this is indeed the case then we should expect that genes harbouring loss-of-function substitutions or small indels in apparently healthy individuals should not have a high p(HI). We identified 349 genes as having LOF substitutions and indels in 12 recently sequenced exomes 
, of these, we could estimate p(HI) for 176 that were not also in the HS training set (and thus represent a fair set for independent comparisons). These genes are highly significantly enriched among genes with low probabilities of exhibiting haploinsufficiency (p
1.06e-20 when comparing to the genome, and p<1e-30 when comparing to known HI genes, Mann-Whitney U test). This result implies that there are not substantial differences between genes that tolerate whole gene deletions and those that tolerate smaller loss-of-function variants. Moreover, by studying the allele frequency spectrum apparent in a large gene-resequencing dataset that has been extensively studied for patterns of selective constraint 
we observed that nonsynonymous variants in genes more likely to exhibit haploinsufficiency are highly significantly skewed towards rarer variants than nonsynonymous variants in genes less likely to exhibit haploinsufficiency, in both African Americans and European Americans (p
2.9e-7 and 4.0e-3 respectively, one-sided Mann-Whitney U test, see also Figure S9
) reflecting greater, on average, selective constraint on genes we predict to exhibit haploinsufficiency (Text S1