Exome and whole genome sequencing of patients are becoming a major approach for unlocking the molecular basis of uncharacterized human rare Mendelian disease loci. In this report, we have identified various disease and design factors that influence the statistical power of this approach. An analytical framework that quantitatively links these factors to statistical power has been established. This model is validated by computational simulation. As expected, the statistical power of identifying disease genes is affected by both experimental conditions as well as intrinsic features of the diseases. Importantly, based on our model, for recessive Mendelian diseases, the vast majority of disease genes can be readily identified when a moderate number of patients with the same disease are sequenced and analyzed together. This is true even when the heterogeneity of the disease is high. For example, in the case of recessive disease, a power of 0.89 can be reached for identifying a gene responsible for as little as 5% of the disease population by sequencing 200 unrelated patients. In contrast, the power for dominant diseases is substantially lower where sequencing of more than 1000 patients is needed to achieve a comparable power. Our result is significant since it indicates that the molecular basis for the vast majority of uncharacterized recessive disease loci can be resolved using the exome sequencing approach.
Our framework can provide guidance for both experimental design and data analysis. In general, proper combination of sufficient sample size, capture sequencing coverage, cutoff for variants identification, stringency of variants filtering, and inclusion of genetic mapping information are important to maximize the success of exome sequencing experiments. However, strategies used to tackle recessive versus dominant disease are quite different. In the case of recessive disease, the key factor is the sample size. Based on our model, genes underlying highly heterogeneous recessive diseases can be identified by sequencing a moderate number of patients. In contrast, since the Tr
statistic, counting the number of individuals with ≥2 mutations is already quite effective for recessive diseases, reducing false positive mutations by aggressive allele frequency filters and bioinformatic filtering have only a minor impact on improving power. In the case of dominant diseases, the key factor is to reduce the number of candidate variants. Both aggressive filters and genetic mapping should be implemented to maximize the exclusion of variants in order to improve the power. In contrast, although positively correlated, increasing sample size has limited impact on the power for highly heterogeneous dominant diseases. Other than variant filtering and sample size, a common factor important for experimental design is the underlying heterogeneity of the disease. To increase power, it is highly desirable to minimize heterogeneity. This may be achieved by grouping patients based on their clinical phenotype. In addition, reiterating the analysis by excluding samples with already identified causative mutations can also be informative. An often overlooked but potentially confounding factor to be considered during data analysis is the length of the gene. As genes with large size incur more rare variants by chance, it is important to adjust the statistical significance of findings based on gene size. To facilitate the ranking of putative disease genes, the binomial test p-values proposed in our report can be calculated for each candidate gene, which provides a unified metric to rank genes similarly to what is used in Genome-wide Association Studies (GWAS) analyses. To facilitate experimental design, statistical power estimation, and p-value calculation, an online calculator has been developed and can be accessed at http://exomepower.ssg.uab.edu
Our results show that, for rare monogenic Mendelian diseases, it would be feasible to apply the exome-sequencing approach to discover causative genes even when a substantial level of genetic heterogeneity exists among patients. This can be achieved by conducting rigorous statistical tests that can evaluate the statistical significance of identified mutations present in a small portion of a relatively large collection. Therefore, a key to identifying genetically heterogeneous rare Mendelian disease genes is to collect large samples of patients and analyze the sequence data together. As patients with mutations in individual disease genes are rare, it will be more efficient and powerful to combine samples with the same disease from multiple collections for sequencing. In effect, the study of rare disease is not unlike the study of common diseases in which investigators form large consortiums to achieve a sufficiently powered sample size. Given a sufficient number of samples, the lack of extended family data, a major bottleneck for linkage-based disease gene mapping approaches, does not pose a substantial problem for exome sequencing.
Admittedly, in this work we adopted a simple statistical framework. Real RMMD exome data analyses often involve in applications of a number of filters. There are several directions where a more advanced statistical framework could be established. First, the current framework assumes there is only a single mutation filter. In real data analysis there is often an array of filters, each with a different set of criteria, that are applied in combination. It is an interesting question how to best combine these filters and adjust the p-values accordingly. Second, the current framework adopts simple mutation count statistics. It may be useful to take into account the strengths of different types of mutations and the phenotypic differences among patients, such as the weighted sum statistics 
and the post hoc score developed by Ng et al 
. Third, explicit modeling of disease heterogeneity, either phenotypic or genetic, should be explored as well. Fourth, the proposed test for recessive diseases simply requires that at least two mutations are present in a same gene, as haplotype information of these mutations are typically unavailable. It is possible, with improved genotype and haplotype calling algorithms or longer sequencing reads, that haplotype information can be estimated or observed, and thus one can improve the recessive test by requiring two mutations to be on different chromosomes. Fifth, mutation filters may be applied based on allele frequencies. Our discussion was mostly focused on strict filters which assume that disease-causing mutations are not present in any of healthy individuals. While this is likely true for dominant diseases and very rare recessive diseases, it may not be true for rare recessive diseases with a moderate prevalence, in which case mutations may be present in healthy individuals in heterozygous state. In that case, filters based on a certain allele frequency cutoff may be more appropriate. Sixth, software tools predicting variants' pathogenicity such as PolyPhen2 
, SIFT 
, and MutationTaster 
are often used. The statistical properties of these filters may be studies in future research. Seventh, while this work is primarily focused on exome sequencing, the main results are also applicable to the analysis of the genic portion of whole genome sequencing for rare diseases 
. Finally, many successful discoveries of disease-causing genes of RMMD by exome sequencing capitalize on the rich information on family information. For example, rare recessive diseases often run in highly inbred families in which patients often carry a common homozygous mutation. While our model is designed for exome sequencing of unrelated individuals of rare Mendelian diseases, it offers insights into two factors that may explain the high rate of success of familial exome sequencing: This would be a special case with zero genetic heterogeneity (R
1). Also, very strict filtering criteria requiring disease causing mutations to be homozygous can be used, resulting very small m