Traditionally, investigators examining gene regions or specific candidate genes might genotype hundreds of SNPs, possibly perform tag SNP selection, and test each SNP for association with disease or disease-related traits. Unfortunately, this approach necessitates multiple test correction, resulting in a significant reduction in power. PC analysis has been suggested as an exploratory approach that parses the information contained in a large number of correlated SNPs into a smaller number of orthogonal PCs that can be analyzed for association instead of individual SNPs [
3,
4]. A significant omnibus test of PCs indicates statistical association between a given region, as represented by the SNPs genotyped, and disease outcomes. However, PCA cannot be used to identify the specific SNPs contributing to the association, and therefore still requires testing of individual variants, to isolate the specific SNP(s) contributing to the association. We introduce a PC-based clustering method that retains many of the favorable attributes of PC regression, but allows for identification of the subset of SNPs contributing to the evidence for association, which reduces the multiple testing burden. We compared the traditional PC approach to the PC-clustering method using the NARAC data, and demonstrate that PC-clustering identifies variants in the 3.2-Mb MHC region contributing to RA risk and variation in RA-related traits.
While traditional PC analysis makes it possible to analyze only the subset of PCs that represent most of the variation in a candidate region, PCs still represent linear combinations of all SNPs in the data set, which makes interpretation of significant PCs difficult. Upon inspection of the 29 PCs from the full model found to be significantly associated with RA status, we found the 822 eigenvector loadings on these PCs to range from -0.148 to 0.145, with most hovering close to 0. Thus, we were only able to infer from PC analysis that variation in the MHC region, as represented by these 822 SNPs, is strongly associated with RA risk. Additional interpretation of the specific SNP(s) driving significant associations between PCs and phenotypes can only be achieved by testing all 822 SNPs individually for association. In contrast, the PC-based clustering algorithm we employed reduced 822 SNPs to 188 discernable SNP clusters that also accounted for 80% of the regional variation. The clusters, which are subsets of the 822 SNPs analyzed, allow unique identification of those SNPs that may contribute to the evidence for association. For example, of the 24 SNP clusters associated with RA status, Cluster 1 and Cluster 23 were found to be the most significant. Cluster 1 represents a distinct set of SNPs covering ~883 kb of the 3.2-Mb region examined, while Cluster 24 covers a non-overlapping region of ~295 kb. While Cluster 1 represents SNPs flanking
HLA-C and
HLA-B, Cluster 23 comprises SNPs surrounding the
HLA-DRA,
HLA-DRB5, and
HLA-DRB1 loci. In fact, rs3099844 and rs2857595 found in Cluster 1 were previously identified by Lee et al. [
5] as belonging to a haplotype associated with anti-CCP positive RA, which 98% of cases in the present study were. Additionally, rs2395175 in Cluster 23 ranked among the top ten SNPs for association with RA in a recent genome-wide association study by Plenge et al. [
8].
The clustering algorithm also identified 36 SNP clusters found to be associated with variation in RFUW among RA cases. The most significant associations included Clusters 2, 5, 20, 24, and 183. Clusters 2, 5, 24, and 183 are composed of SNPs located in the chromosomal region between
HLA-A and
HLA-C, with Clusters 2 and 5 capturing the specific variation in and around
HLA-C. Interestingly, Yen et al. demonstrated that
HLA-C alleles may modulate the pattern of RA progression [
10]. Moreover, Lee et al. found rs887464 in Cluster 183 to be associated with RA affection [
5]. Cluster 20, composed of nine SNPs, represents variants located within and proximal to
HLA-DQB2. Previous examination of genes in the MHC class II region, conditional on the
HLA-DRB loci, has shown the
HLA-DQB2 locus to have a vital role in RA [
11,
12]. As RA is heterogeneous in terms of the progression of joint destruction [
13], further examination of the SNPs in these clusters may provide information regarding genetic determinants of RA progression or symptom severity.
While our PC-based clustering method offers the interpretability a traditional PC approach lacks, there are other issues to be considered. First, we required more clusters than PCs to satisfy the 80% explained-variance threshold, which increased the degrees of freedom utilized for the omnibus test of association. The additional degrees of freedom usually results in reduced power to detect global association compared to the traditional PC approach. This may be due to the fact that while PCs are orthogonal, or independent, cluster components formed by the clustering algorithm are oblique. At each iteration, PC1 and PC2 are computed from a distinct set of SNPs that have been assigned to a given cluster, such that the first PC of one cluster may be correlated with the first PC of another cluster. Thus, although each SNP is assigned to the cluster with which it has the highest squared correlation, all SNPs share some degree of correlation with the other clusters they were not assigned to. This underlying correlation among clusters may be indicative of the correlation pattern among SNPs, although not necessarily haplotype blocks, and thus better reflect the true relationship of the variants within the MHC candidate region, but may also result in slightly reduced power to detect association.