Clearly, some contributing genes (FLT1, VEGFA, PRKCA) could be detected by an examination of rare variants in the unrelated subjects. However, it turned out that our second goal (determining whether stratified analyses or a population covariate would be more effective in dealing with population stratification in these data) gave rise to the most interesting results.
FLT1 contained multiple rare variants with large effect on Q1. The signals from this gene were strong in the European and Yoruba populations and present in the Asian populations. (The rare variants in FLT1 were not significant in the Luhya sample, even if 200 replicates were meta-analyzed.) Because the signal was present in subsamples representing more than 84% of the data, pooling the data and using population as a covariate maximized power.
In contrast, VEGFA
, found in an analysis of the Luhya sample, was not near the top of the list in the combined analysis. It was not until we meta-analyzed 50 replicates of the full data (total N
= 34,850) that this gene surpassed the 10−6
significance threshold (p
= 1.4 × 10−14
). In contrast, meta-analysis of the first 50 replicates of the Luhya subjects alone (total N
= 5,400) resulted in an extremely low p
= 2.1 × 10−94
). This is because the rare variant for VEGFA
is private to the Luhya population. As a result, including samples from other populations merely introduces noise into the signal. It is interesting to note that VEGFA
corresponds to the highest linkage signal found in a linkage analysis of the family data [4
We note that the phenotypes were modeled identically for the different populations. As a consequence, one might have believed a priori that a combined analysis (perhaps not even using population as a covariate) might be the most powerful approach. However, as illustrated by the results for VEGFA, this need not be the case. This suggests that it might be worthwhile to analyze multipopulation data both ways (stratified and adjusted), despite the multiple testing penalty.
In these data we also found that the two tested approaches to collapsing performed similarly, particularly for the top signals. This simply suggests that in these data the outliers in phenotype were not also outliers in terms of the count variable for any genes. Clearly, this cannot be generalized to other genetic models.
Finally, although our top signal in each analysis result was a true signal, there were many more highly significant false positives than we would have expected. We learned from the analysis of Q4 that it is unlikely that these spurious results were a completely random effect of using multiple replicates of the same genotypes. Two other possible causes come to mind. First, rare variants carried by individuals with extreme phenotypes could give rise to such results. We tested this idea by performing some analyses that included the individual from the CEPH sample (NA7347) who had an extreme Q1 value (>5 standard deviations above the mean) in nearly every replicate. We found that multiple genes that were not included in the model but for which this individual was the only carrier of rare variants became significant. Second, the signals could actually be in the data, although they were not included in the generating model. We note that many false positives in these data have been reported as consistently arising under a variety of analysis methods. For a detailed discussion of this aspect of the data, see Luedtke et al. [6