Genome-wide association studies have been successful in identifying common single-nucleotide polymorphisms (SNPs) that contribute to complex genetic disease [Manolio, 2010
]. They are thus supportive of the common disease/common variant hypothesis, which states that frequent SNPs contribute to widespread disease. However, only a portion of the heritability as estimated from twin, adoption, and family studies can be explained by loci identified in genome-wide association studies. Although both the validity and the accuracy of the heritability estimates are questionable [Ziegler and König, 2010
, ch. 6], it is widely believed that there is missing heritability [Maher, 2008
; Eichler et al., 2010
]. This hypothesis, if true, suggests that other genetic mechanisms, such as gene-gene interaction, epigenetics, and rare variants, contribute to disease susceptibility. Indeed, although the exploration of these factors has just begun, there already is increasing evidence that gene-gene interaction and epigenetics play a role [Cordell, 2009
; Petronis, 2010
The investigation of rare variants (rare generally corresponds to a minor allele frequency [MAF] < 1%) is complicated because relatively few rare variants are well represented in current genome-wide association arrays; in addition, the methods used for genome-wide association analysis have low power for low-MAF SNPs unless the effect size is large, and untyped rare variants are poorly tagged by common SNPs. The combination of low MAFs and poor tagging properties makes rare variants unsuitable for analysis with the microarrays used in genome-wide association studies [Asimit and Zeggini, 2010
]. However, with the development of novel technologies for high-throughput next-generation sequencing [Metzker, 2009
; Meyerson et al., 2010
], it is now possible to sequence regions of interest, the exome, or even the entire genome. Unfortunately, the costs are still high, there is a trade-off between cost and accuracy, and the post-processing and analysis of sequence data are challenging.
The reasons that multiple rare variants might play a role in complex genetic disease have been summarized by Bansal et al. 
: With the recent expansion of the human population, a large number of segregating, functionally relevant rare variants have emerged that mediate phenotypic variation. Furthermore, multiple rare variants within the same gene contribute to largely monogenic disease [Fitze et al., 2002
; Easton et al., 2007
]. It is therefore reasonable to assume that the same genetic mechanisms that operate for complex disease also apply to common disease. Finally, and most important, sequencing studies that focus on specific genes have shown that collections of rare variants can indeed associate with particular phenotypes [Bansal et al., 2010
, Table 1].
Several strategies for identifying rare variants that contribute to disease susceptibility have been proposed and include the study of families and studies that place increasing emphasis on other structural variants, such as insertions, deletions, inversions, or translocations [Manolio et al., 2009
, Box 1]. The study of large families is a promising approach in this context. Specifically, the study of extended pedigrees has several advantages over unrelated subjects. First, some rare variants can be observed at higher frequencies in extended pedigrees compared to the general population. Second, rare variants that segregate with reasonably high penetrance in extended pedigrees can provide a linkage signal so that deep sequencing in extended pedigrees is not required for the entire genome but only in those chromosomal regions that show a linkage signal. Thus the sequencing effort is substantially reduced. Third, one can expect larger effect sizes of the rare variants. Fourth, the results are simpler to interpret because the rare variants, together with the disease, run within the families and therefore provide a proof of the genetic basis. Finally, families are generally simpler to follow up than individuals.
As an alternative to the study of families, the investigation of subjects with unusual phenotypes or from the extreme ends of the phenotype distribution can be reasonable, as can studies in isolated or founder populations or of subjects of recent African ancestry. These study designs generally lead to analyses that are analogous to those of standard case-control studies. In addition, population-based cohort studies might be of interest to obtain unbiased estimates of population parameters. In all scenarios involving unrelated subjects, the statistical analysis of rare variants remains challenging because of low power. One approach to overcome the problem of low power is to pool rare variants for analysis, and these methods are generally called collapsing methods.
Several novel collapsing approaches were proposed during Genetic Analysis Workshop 17 (GAW17), and these were often compared with already published collapsing approaches. Furthermore, GAW17 contributors compared various approaches discussed in the literature using the simulated data provided for the workshop. Our aim in this paper is to provide a comprehensive overview of published collapsing methods and the permutation approaches most studies require to assess statistical significance. For simplicity, we restrict the description of the statistical approaches to dichotomous phenotypes.