In this work, we propose a specialised method to identify multiple rare mutations underlying a genetically heterogeneous disease. Analysis of real data and power simulations show that the proposed weighted-sum method performs very well compared to existing methods. This demonstrates that the use of specialised analytical methods can improve power to identify genetic components of complex (genetically heterogeneous) diseases. On the other hand, it must be kept in mind that the power of such specialisation is at the cost of generality, and therefore the methods must be used in combination with other strategies covering other biological scenarios such as the common variant common disease scenario. It must further be noticed that all methods using the grouping approach (i.e. CMC, CAST and weighted-sum) are sensitive to misclassification of which allele is treated as the mutation (i.e. disease-related allele). If disease-related alleles from some variants are grouped with wild-type alleles from other variants it may hide a true signal. As stated in the Background section, it may be natural to treat e.g. non-synonymous substitutions, frame shift indels and very rare alleles as mutations, but when there is no information to classify the alleles, grouping methods may not be useful. Instead the idea from the CMC method can be used, such that the variants that can be grouped are analysed with a grouping statistic (e.g. the weighted-sum method), and all other variants are analysed variant by variant or by multivariate analysis.
The weighted-sum method is designed for resequencing data, since this technology allows rare mutations to be observed directly. The use of inferred haplotypes from tag SNP studies is a current approach to evaluation of unobserved variants, but this approach fails when the unobserved variants are rare; the tag SNP approach is hence not suited for the scenario of multiple rare disease-mutations 
. Alternatively, familial linkage studies are a strategy to identify mutations underlying genetically heterogeneous diseases, but when the marginal effect of each mutation is low, it may be difficult to obtain a sufficient number of affected individuals to detect a disease association 
The weighted-sum method can be adapted to a wide range of study designs, by e.g. the following: (A) Using the posterior probability of each genotype rather than the most probable genotype. (B) Analysing mutations in conserved areas by weighting each mutation according to the measure of conservation; this is an extension of the conservation base selection criterion from 
. (C) Analysing continuous traits by testing for correlation between genetic ranks (or scores) and the trait measure. Furthermore, the weighted-sum method can be used for other types of data that can be grouped according to function. Such data include for example methylation measures, where multiple regions/sites can be methylated in promotor regions (i.e. the CpG islands). Note that ranking can be omitted in the test procedure, so the test statistic is the sum of the genetic scores (γi
) of all affected individuals, rather than the sum of ranks. In the tests performed in this study, the two procedures yield very similar results (results not shown), but we prefer to use the ranking procedure because it is robust to outliers.
The mutation weights (
) can be chosen in an infinite number of ways. We suggest using the estimated standard deviation of the total number of mutations in the sample (including affected and unaffected individuals), under the null hypothesis of no frequency differences between affected and unaffected individuals. This choice of weight ensures that all variants in a group contribute equally to the weighted sum, under the null hypothesis. The weight of each mutation is determined by its frequency in the population of unaffected individuals only. In this way, a mutation which is common among unaffected individuals has lower weight than a mutation which is rare among the unaffected individuals. If further information about the mutations is available, it may be incorporated in the weights. Such information could include the estimated impact of a mutation or a measure of conservation of the surrounding region (as discussed above).
Analysis of pathways can be done in two different ways. One way is to use the pathway as a group, and run the test on the entire pathway. On the other hand, for large pathways, it may be beneficial to use a method that allows a gene with a strong signal to have a high impact on the combined pathway test-statistic (T
). If a pathway contains G
non-overlapping genes, a method to do this is to use the weighted-sum method on each gene, and combine the resulting p-values (π1
) with the Fisher product test statistic
are i.i.d. uniformly(0,1) distributed under the null-hypothesis, T
-distributed with 2G
degrees of freedom, and can be evaluated accordingly 
. This method allows for fast analysis of different pathways, using the results from the gene-analysis, and can thereby assist in the functional analysis of a disease association study.
Simulating inheritance of a genetically heterogeneous disease can be performed in different ways. To ensure that all variants have a low effect, we have chosen to simulate all variants within a group with the same PAR. An alternative scenario is to simulate all variants, in a group, with the same relative risk (RR), and let the PAR vary according to the mutation-frequency. Under this scenario, a single, or few, common mutations may carry a large part of the total risk, and this scenario is hence equivalent to a scenario with a single, or few, disease-contributing variants. A few common variants carrying a relatively large risk is exactly the what studies using panels of SNPs are designed for, and our focus has therefore been on scenarios where the disease risk can not be explained by a few variants. Note further that all investigated methods are able to identify cases where a few mutations carry a large part of the total risk (see ). We have further included the comparison of the Encode populations, to cover a scenario where the mutation-frequencies are distributed according to an actual population.
In summary, we show that the weighted-sum method is powerful for identifying multiple rare mutations underlying genetically heterogeneous diseases. Under some genetic scenarios, 1000 affected and 1000 unaffected individuals are sufficient to identify e.g. a gene with a PAR of only 1%, corresponding to an odds ratio of 1.1. These findings thus demonstrate that resequencing studies have the potential to identify important genetic associations, provided specialised analysis methods are used.