In this set of simulations we have attempted to show some of the more important aspects and principles of design of studies involving MPS of common human disease. Our overall approach is modeled on the two-stage GWAS design in which cases and controls are assayed on a dense (500K→1M) SNP chips, followed by a second stage in which a smaller number of SNPs consisting of the top N
(e.g., 10,000) SNPs (ranked by p
-value) are tested on a (usually) larger independent set of cases and controls. For MPS studies our overall design is a first stage in which whole-exome sequencing is performed on samples from individuals in exceptionally high-risk families (or other extreme phenotype). In our case, the second stage is not a set of SNPs or variants, but a screen by sequencing of a relatively small number of genes that meet the criteria applied in the first stage, and in a larger number of individuals using a relatively inexpensive high-throughput screening strategy using methods such as HRM, DHPLC, or even targeted capture MPS. Within this framework we have attempted to look at the trade-off between the power to detect true disease loci against the vast number of false positives that will be generated in any MPS experiment. More importantly, we have examined the efficacy/efficiency of a variety of filtering strategies in reducing the number of false positives without dramatically affecting the power to detect the true disease susceptibility loci. Liu and Leal have also explored the strategies for two-stage designs in the context of whole-exome sequencing of a series of cases and controls with individual interesting variants or genes evaluated in a second stage 
. In their case, the two primary strategies compared were to evaluate individual variants in the second stage or to evaluate through re-sequencing the genes in which those variants were identified. They concluded that sequence-based replication is generally advantageous if the stage I sample size is relatively small, as many variants will not be identified in the initial sequencing. The methods proposed by ourselves 
and Price et al 
both incorporate filters based on allele frequency and in silico
analysis of sequence variants in order to create a more powerful aggregate test of the hypothesis that rare variants in a gene are contributing to disease risk than methods based on simply counting the numbers of variants observed in cases and controls as in e.g., 
. The problem we address here differs from the case-control design examined in the above studies in at least two aspects. First, our assumption is that the individual disease alleles are much rarer than those proposed above and would not be amenable (even in aggregate) to a case-control approach; and second, we are considering the whole exome rather than a specific gene or pathway as is typically done in the case-control situation as analyzed above. It is not clear how these methods would work when applied to 20,000 genes; it is likely that the required sample size to detect the (even large) effects of many such alleles with correction for the multiple comparisons inherent in such a genome-wide approach would be prohibitive. Kryukov et al. 
have specifically examined the power of whole-exome sequencing studies using extreme phenotypes and concluded that for reasonable affect sizes, detecting the effects of rare alleles in individual genes would be possible, although the sample sizes would be in the 1000s for whole-exome sequencing and the number of individuals that would need to be phenotyped to provide adequate selection of extremes would be substantially larger. Thus we focused our paper on the problem of vanishingly rare variants of large effect and the use of family studies to identify the specific genes harboring such variants.
Given that there are a large number of possible genetic architectures underlying each disease, and in most of these cases we can only make educated guesses about the true genetic basis. However, in the analyses presented here we have explicitly or implicitly assumed several key features. First, we assume that a substantial proportion of the “missing” genetic variance is due to individually rare alleles that confer moderate to high increased risk of disease (5–20 fold). Second, we have assumed that the pedigrees available for whole exome sequence analysis will be likely to be segregating a pathogenic mutation in one such gene, although not all cases in the pedigree are necessarily due to this mutation (i.e., phenocopies). Additionally we have assumed that each sequenced exome will contain a large number of rare missense variants that are independent of disease and a smaller number of protein truncating variants, even after filtering by frequency and by in silico analyses. We have assumed that this filter would reduce the number of rare missense variants by 70%; although this may seem somewhat arbitrary we note that our experience in analyzing a number of genes, shows that this level of filtering is easily achievable even using available multiple species protein sequence alignments. Further such a filter can be easily adjusted to provide more (or less) stringent filtering by requiring different degrees of evolutionary sequence conservation and/or more radical changes of the affected residue.
Taken as a whole, the results presented in , , , and demonstrate that while the choice of an appropriate strategy will depend on a variety of factors, the optimal degree of filtering will depend on the sample size as well as the choice of the number of individuals sequenced in each pedigree; there is no single optimal strategy. As shows, when stringent filtering based on multiple variants in the same gene in different pedigrees is applied (e.g., N3RV) the number of false positives is approximately the same for a fixed number of pedigrees no matter how many individuals are sequenced, indicating that the concordance aspect is not as important since a given variant has to only be concordant in a single pedigree and so there is a balance between the number of exomes sequenced and the additional filtering, whereas for the looser filter N1RV there are large differences in the number of false genes as a function of the number of sequenced per pedigree. also demonstrates that requiring multiple rare variants that are potentially pathogenic (based on a simple bioinformatics filter) in the same gene as a filter for selection to stage II sequencing is a very effective strategy in reducing the number of false positives (and hence the cost). Of course, this will also reduce the number of true genes identified, with the magnitude depending on the true underlying genetic architecture.
Under the models examined, one point that is evident from our results is that as the number of pedigrees increases, the cost of doing the study increases more rapidly than the increase in the number of susceptibility genes identified, particularly when only a single individual is sequenced per pedigree. This is true under a variety of different filtering strategies.
Often the choice of strategy will be determined by the availability of a sufficient number of suitable pedigrees and, beyond that by the ability to get sufficient quantity and quality of DNA from the appropriate members of the pedigree in the two or three individuals per pedigree case. In many cases it is easier to get larger number of pedigrees suitable for analysis in stage II. In this regard, it is useful that there are designs that are more or less equivalent in terms of cost and number of true genes identified using each strategy. In choosing which cases within a pedigree to sequence in stage I, there is typically a trade-off between power and false positive rate. If they are too closely related (e.g. siblings), the concordance filter cannot effectively exclude false positive genes; on the other hand, if they are too distantly related, particularly without intervening affected relatives and a common disease, the probability that the two individuals are not sharing a true high-risk mutation is increased, and the concordance filter will have a higher likelihood of rejecting a true disease gene.
We have made both explicit and implicit assumptions in our simulations, including the numbers of genes in the genome, the distribution of variants across those genes, the proportion of pathogenic variants of a given type, and the sensitivity of MPS sequencing in detecting true pathogenic variants. Although inaccuracies in these assumptions may affect some of the finer details of our results, we believe that the overall conclusions of our study are sound. Clearly the biggest factor influencing our ability to identify novel susceptibility alleles for complex human disease is the underlying genetic architecture which unfortunately is essentially unknowable, although often epidemiological and other data such as linkage studies, can provide some rough guides. If much of the genetic variance is due to many rare alleles of relatively modest effect in many different genes, and if the disease is common, it is likely that different approaches will have to be developed to identify these genes.
Our study shows that the choice of the appropriate design and filtering strategies will likely depend on many factors, and there is no “one-size fits all” recommendation. The choice of design depends on funds available, the ability of identifying high-risk families such as that typified in our simulations, as well as the ability to obtain DNA samples from the best individuals within each pedigree. Sequencing three cases per family does not add much additional variant filtering compared with the two individuals per pedigree case, and thus does not meaningfully reduce the overall cost. However, this strategy does result in lower power as a result of the exclusion of true genes due to the higher probability of one of the three cases being a phenocopy. Nevertheless, we note that there may be situations where it would be desirable to sequence three cases if, for example, only siblings were available. Nevertheless, our results provide some general guidelines that indicate that a reasonable fraction of moderate to high penetrance genes can be identified for complex diseases with practical and economical study designs. As costs of MPS sequencing drop for both whole exome and whole genome approaches, different strategies may become economically feasible. In particular, it may be possible to perform the second stage on larger numbers of genes using targeted sequence capture, or all available families could be screened in the first stage. In either case, there will still be a need for effective filtering strategies, particularly in the case of whole genome sequencing. Although the strategy employed and the sequencing method employed are clearly important, the key to success in identifying novel susceptibility for common disease will ultimately rely on the availability of large series of well-characterized families with many cases of the disease of interest and with the appropriate collections of biospecimens available for study. We recommend that before embarking on whole-exome or whole-genome studies in complex human diseases, careful consideration be given to the concepts discussed in this paper under a set of disease-specific plausible genetic models. We encourage interested readers to use the data in the supplemental data to repeat these analyses using relative costs of exome sequencing to candidate gene screening and sample sizes that are pertinent to their situation. To assist in this effort, the simulation program used in this study is available from the authors on request.