Using data simulated by GAW17, in the current study we compared population-based and family-based designs for their ability to identify rare causal variants, as well as gene-level association. We found that the population-based and family-based designs can result in the identification of different causal variants and genes. Because the same underlying simulated model was used for both the family- and population-based data sets, these results suggest that both of these designs have roles in the discovery of rare variant association.
By comparing the identified and unidentified causal genes (Tables and ), we found several interesting characteristics. Both population- and family-based analysis identified particular genes most of the time (KDR and FLT1 by population-based data; VEGFA and VEGFC by family-based data). In the family-based data, both KDR and FLT1 have five polymorphic causal variants, whereas VEGFA and VEGFC included only a single causal variant each. Based on the expected performance of linkage, one might expect linkage to work better in genes with multiple variants. However, VEGFA and VEGFC show larger effects (β = 1.21 and 1.36, respectively); thus the ability to detect the VEGF gene may be more reflective of the effect than of the number of variants. On the other hand, the methods we used to identify gene-Q1 association in the population-based data rely largely on the probability to capture rare variants; thus a higher power for genes with more rare variants (KDR and FLT1) is not surprising.
When comparing SNP association and the measured genotype approach, we found that power is related to MAF (Additional file 1
). When MAF is similar, these two methods show no difference. On the other hand, these two data sets identify different SNPs. Because similar approaches are used, this difference is likely due to the design. The results suggest that for SNPs that are rare in a population, a family-based design may provide an opportunity to enrich the rare SNPs, thus increasing the power to detect the SNP-phenotype association (e.g., C6S2981 and C4S4935). However, a family-based sample may lack polymorphism by chance. In this case, population sampling may be advantageous (e.g., C4S1877 and C4S1889).
When comparing linkage and association results from the family-based data (Table and Additional file 1
), we noticed that FLT4
were identified by linkage, but the causal SNPs on these two genes were either nonpolymorphic or had no power to be identified even at the 0.01 level in the association test. Thus, when analyzing family-based data, linkage analysis may be advantageous in the identification of causal regions by using other genetic variations in the same region.
We also compared the association results at the SNP and gene levels from the population-based data (Table and Additional file 1
). It appears that gene-level association is not likely to be detected when SNP-level association is lacking. Collapsing the information of the rare SNPs on one particular gene may not enhance the power or provide additional information, as linkage analysis would.
Taken together, these results suggest that neither the family-based nor the population-based analysis we used is sufficient to identify causal variants of next-generation sequence-level data, especially in the context of rare variants. Given that the family-based design offers a variety of advantages (such as segregation with disease rather than just co-occurrence) that cannot be used for unrelated individuals and that may enrich rare variants, the family-based design may also be valuable for genome-wide SNP scanning for novel causal variants. Population- and family-based designs can be complementary and should both be considered in future genome-wide association studies.