The SFVT approach is a novel method for analyzing the association of highly polymorphic proteins with any phenotypic characteristic. It was specifically designed to facilitate the analysis of disease association with HLA gene products in the human MHC, but could be applied to any analogous situation, such as the association of viral pathogen sequence variation and measures of virulence. In developing the approach, we have created a catalog of relevant sequence features and their variant types for all HLA class I and class II proteins, enabling the efficient mapping of information obtained in traditional high-resolution HLA typing studies to known structural and functional elements of the protein. The current compendium is completely extensible, as new sequence features can easily be added without changing the current list, and has the potential to be easily automated so that existing data sets can be re-analyzed quickly.
There are two major advantages of this approach to association analysis. First, it focuses on molecular variants that are likely to have biological importance—structural features of the three-dimensional protein, functional domains and combinations of the two. This information is not explicitly utilized when information is only analyzed at the level of the whole allele. Analysis of individual polymorphic amino acids is similar to the SFVT in the extreme case where each amino acid is considered as an SF. However, this does not take into account any information that puts each residue into structural and functional context. In this regard, Salamon et al
., used set covering computation to describe all the possible unique combinations of amino acids in HLA DQA1-DQB1 proteins distinguishing predisposition to Type 1 diabetes (12
). While this method is unbiased to the extreme, it is far more computationally intensive than the SFVT method and does not take advantage of biological knowledge relating to protein structure and function.
Second, the SFVT method can increase the statistical power of smaller data sets. In a given case–control cohort, rare disease-associated alleles may not show statistical significance. However, these alleles may harbor causative SVFTs in common with other associated alleles, thus allowing appropriate allele grouping and thereby increasing the sample size and decreasing degrees of freedom, leading to increased statistical power when such cohorts are analyzed using this approach.
The analysis of the SSc data set illustrates the applicability of this method. By including the entire polypeptide sequence as one of the sequence features, this analysis confirms the positive association seen with HLA-DRB1*1104 (previously DR5) in Caucasian and Hispanic SSc patients. This allele contains the amino acid sequence 26F, 28D, 70D and 78Y in the peptide-binding pocket 4. These residues are also seen in HLA-DRB1*0804 which is associated with SSc in African Americans (29
). Sequence variation in peptide antigen binding pockets 4 and 7 has been shown to greatly influence the type of peptides presented by a particular DRB1 allele, e.g. myelin basic protein bound by HLA-DRB1*1501 versus DRB1*0401 (32
). Intriguingly, there is a significant difference in risk between these two alleles in our analysis of SSc as well.
Only three of the seven alleles in Table showed significant disease associations at the P
< 0.05 level, and yet many of them carry an SFVT strongly associated with disease. For example, DRB1*0804 (primarily an African allele) carries the same sequence at this composite sequence feature as DRB1*1104 (a Caucasoid allele) and yet does not demonstrate a significant P
-value on its own due to the relatively small sample size. However, when tested as an allele in the African-American SSc patients versus race-matched controls, it was strongly associated with disease, similar to DRB1*1104 in European-derived Caucasians and Hispanics (29
). When the association is tested on the basis of the composite sequence feature, then DRB1*0804 and DRB1*1104 individuals are combined, yielding a highly significant result. Thus the SFVT approach identified stronger associations with sub-regions of the protein than with entire alleles although ethnicity also needs to be considered.
Distinguishing between the effects on disease of specific amino acids and SFs is not simple, due to the well-known complex LD patterns of the polymorphic amino acids in the HLA classical genes (31
, and G. Thomson et al
., in preparation). Conditional analyses can help to disentangle causative effects from those due to LD with a causative agent, but these are complex, as many different comparisons must be made. In this data set, a preliminary conditional analysis shows that while residue 30 appears in many sequence features associated with the disease, its contribution appears to be by virtue of LD with residues 26 and 28, not due to an independent effect. For amino acids with very high LD it may be impossible to distinguish the causative amino acid(s). Cross ethnic studies can aid in such cases if there is a sufficiently different LD pattern, but for some amino acids this is not the case.
These results suggest that not only is HLA-DRB1*1104 associated with disease risk for systemic sclerosis, but also known structural and functional elements, especially beta strands 2 and 3, alpha helix 2 and peptide binding pockets 4 and 7. An inspection of the predicted structure of the high-risk allele shows several bulky aromatic amino acids protruding into the peptide-binding groove of risk alleles that could dramatically influence the repertoire of peptides bound by HLA-DRB1 (Fig. B). These data are consistent with the hypothesis that the ability of particular HLA class II alleles to bind and present auto-antigenic peptides determines the extent of T-cell help stimulated, which in turn determines the degree of auto-antibody production. This was suggested by the analysis of 28 Japanese SSc patients who were tested for the presence of anti-topoisomerase antibodies (19
). The authors noted that both the presence and amount of anti-topoisomerase was associated with those HLA-DRB1 alleles containing the linear sequence 67
. This sequence includes residues that make up peptide binding pockets 4 and 7. In our analysis, three of the four risk alleles contain this sequence. But in addition, our analysis finds a more extensive and select group of localized amino acids that appear to be responsible for the underlying association. The analysis to determine if the same HLA-DRB1 SFVT association is found with auto-antibody presence in our cohort is ongoing. Moreover, there is strong linkage disequilibrium (LD) within HLA, and associations between HLA-DQB1*0301 and related alleles have been seen, particularly in patients who have anti-topoisomerase antibodies (24
). We have completed the SFVT assignments for HLA-DQB1 and -DQA1 and are now analyzing them for association with SSc.
In conclusion, we have developed a novel approach for the analysis of HLA genetic associations in which the proteins are broken down into smaller sequence features. These sequence features can be large (e.g. complete protein domains) or small (e.g. single amino acids), they can be based on structural features (e.g. a particular beta strand) or functional features (e.g. a peptide binding region), they can be continuous or discontinuous with respect to the linear sequence, and they can overlap with each other. Once the sequence features have been defined, the corresponding variant types found in the population of HLA alleles are identified. This approach then allows for the independent analysis of disease association with any SFVT regardless of which HLA alleles carry the variation. In order to test the hypothesis that this approach will both provide stronger statistical correlations with disease and highlight the critical functional parts of the HLA molecule, we have used this approach to analyze the correlations of HLA-DRB1 sequence features in a cohort of study participants with systemic sclerosis and show that a sequence feature composed of specific amino acid residues in peptide binding pockets 4 and 7 of HLA-DRB1 (residues 26, 28, 30, 37, 67, 70, 71, 86) best explains the molecular determinant of HLA-DRB1*1104 associated disease risk for systemic sclerosis. Although this study is focused on the analysis of HLA, this approach can be applied in any circumstance in which associations between sequence polymorphisms and phenotypic characteristics are being investigated.