|Home | About | Journals | Submit | Contact Us | Français|
We describe a novel approach to genetic association analyses with proteins sub-divided into biologically relevant smaller sequence features (SFs), and their variant types (VTs). SFVT analyses are particularly informative for study of highly polymorphic proteins such as the human leukocyte antigen (HLA), given the nature of its genetic variation: the high level of polymorphism, the pattern of amino acid variability, and that most HLA variation occurs at functionally important sites, as well as its known role in organ transplant rejection, autoimmune disease development and response to infection. Further, combinations of variable amino acid sites shared by several HLA alleles (shared epitopes) are most likely better descriptors of the actual causative genetic variants. In a cohort of systemic sclerosis patients/controls, SFVT analysis shows that a combination of SFs implicating specific amino acid residues in peptide binding pockets 4 and 7 of HLA-DRB1 explains much of the molecular determinant of risk.
Proteins of the major histocompatibility complex (MHC; HLA in humans) participate in wide-ranging immunological processes. The class I (HLA-A, -B and -C) and class II (HLA-DR, -DQ and -DP) molecules are widely expressed on hematopoietic and non-hematopoietic cell surfaces, and function to bind short peptides in such a way that the combination of peptide and MHC are recognized by clonotypic T-cell receptors, resulting in T-cell activation. Class I molecules also function as the ligands for natural killer (NK) receptors.
MHC molecules are extremely polymorphic. The extensive polymorphism of HLA sequences reflects the need for the presentation of diverse repertoires of peptides for effective immune surveillance. The IMGT/HLA Database (1) (www.ebi.ac.uk/imgt/hla/) describes over 2000 class I and 1000 class II alleles. The majority of the protein sequence variation occurs within discrete areas that are involved in peptide binding and T-cell receptor interaction. Other variations may affect interactions with the accessory proteins CD4 and CD8, or NK receptors.
This allelic variation in the ability of different MHC molecules to bind peptides and activate effector cells in the immune system underlies their association with infection, autoimmune disease, drug sensitivity and tissue transplantation success. In model systems, the peptide epitopes derived from infectious agents such as human immunodeficiency virus, Epstein-Barr virus and others have been elucidated, as have the specific MHC residues involved in their binding/presentation (2–5). In the case of autoimmune diseases, while many allelic associations are clear—e.g. HLA-B*2705/2/4/7 and ankylosing spondylitis (6,7), HLA-DRB1*0401/*0404/*0405 and rheumatoid arthritis (8,9), HLA-DRB1*0301 HLA-DQA1*0501 DQB1*0201 and DRB1*0401/2/4/5 DQA1*0301 HLA-DQB1*0302 and type 1 diabetes (10,11)—the actual antigenic peptides responsible for these associations are not. However, knowing the nature of the critical MHC amino acid residues involved can allow reasonable predictions about peptide epitopes. Such predictions are important for the design of novel vaccines and the understanding of autoimmunity.
Typically, the significant association of a given normal or pathologic immune response with one or more HLA alleles or haplotypes is based on statistical analysis followed by a manual inspection of linear sequence alignments with the goal of identification of those amino acid residues that occur more commonly in individuals with the given response. More specific analytic and computational approaches have been developed to efficiently identify combinations of amino acids that may be causative in differential disease risk (12–14). These approaches, however, do not explicitly take into account biological information about the MHC molecule under study.
Here we describe a novel method for the analysis of MHC/disease associations that additionally incorporates structural and functional information about the HLA molecules (antigenic peptide binding, TcR binding, etc.) to help illuminate the biological nature of disease associations based on variations in these functional ‘sequence features’ to augment allele-based association analyses. Sequence features may be defined based on purely structural, e.g. ‘α-helical segment 1’ or functional characteristics, e.g. ‘peptide binding’, or a combination of both. Sequence features can be large (e.g. the entire HLA-DRB1 polypeptide) or small (e.g. the loop between beta-strands 1 and 2 of HLA-DRB1), overlapping and non-contiguous (e.g. the peptide antigen binding pocket 7 of HLA-DRB1). There are no restrictions on what protein sub-region can be labeled as a sequence feature. Variation of each sequence feature is then based on the known primary sequences of all alleles of a given HLA molecule. Since this sequence variation can be seen in multiple alleles, we have termed this the ‘variant type’ for a given sequence feature. In genetic terms, the sequence feature variant type (SFVT) can be thought of as an ‘allele’ comprising a haplotype of particular amino acids within a single protein. A single allele of a particular HLA molecule can then be represented as an SFVT feature vector in which the number of dimensions corresponds to the number of sequence features defined for the locus. The resulting association studies can be automated to produce information that is (i) based on experimentally determined structure–function relationships as well as allele and individual amino acid level variation and (ii) statistically informative due to the opportunity to combine groups of individuals rather than separate them by HLA allele.
We have applied this SFVT analysis, to a cohort of patients with systemic sclerosis (scleroderma, SSc). SSc is an autoimmune disorder characterized by organ fibrosis (skin, lungs, heart, kidneys), vasculopathy and the production of autoantibodies to nuclear antigens such as centromeric proteins, topoisomerase I, RNA polymerase III and others. Over the past 20 years, several groups have shown associations between particular HLA class II alleles and the presence of disease (15–23). These include HLA-DR3, -DR11 and -DR7. Strong associations have also been seen when subsets of SSc patients are analyzed. For example, expression of anti-centromere antibody is associated with DQB1*0501 and other DQB1 alleles that lack leucine at position 26 (24,25). Anti-topoisomerase is associated with the DR-DQ haplotype DRB1*1104 DQB1*0301 (24,26,27), and anti-U3-RNP with the DRB1*1302 DQB1*0604 haplotype (28).
Using the SFVT approach, we have been able to confirm and extend the previous association between SSc and HLA-DRB1 at the allele level. Moreover, sequence features corresponding to antigenic peptide binding pockets that predispose to disease risk and protection were delineated. These data provide evidence that this novel method of analyzing MHC association data can identify the most likely molecular determinants of disease association for future functional study.
Sequence features for all classical HLA class I and II proteins were defined at the amino acid level. The procedure used to define sequence features is described in detail in the Materials and Methods; representative sequence features for HLA-DRB1 are shown in Table 1.
Four general categories of sequence features were identified—structural, functional, sequence alteration and combinational. Structural SFs include protein domains (e.g. Hsa_HLA-DRB1_beta 1 domain) and secondary structure motifs (e.g. Hsa_HLA-DRB1_beta-strand 2). Figure 1A shows functional SFs that include protein regions known to provide some important biological function (e.g. Hsa_HLA-DRB1_peptide binding; Hsa_HLA-DRB1_TCR binding). Sequence alteration SFs are simply those single amino acid positions in which sequence variation has been observed (e.g. Hsa_HLA-DRB1_variant position 67). Combinational SFs were constructed by identifying the intersection between SFs of the other types, for example those residues in alpha helix 2 that are involved in peptide binding (Hsa_HLA-DRB1_alpha-helix 2_peptide binding).
Over 2000 unique protein sequence features for the 10 classical HLA class I and II loci were defined (Table 2). A complete list of all currently defined sequence features can be found in Supplementary Material, Table S1. This list is not intended to be static. As new functions are discovered for HLA proteins, new sequence features can be defined and appended to the list without affecting the identity of those previously defined.
For each HLA locus, one common and well-documented allele was selected as a reference, and the specific amino acid residues found at each position in a given sequence feature captured as the definition of variant type #1 (VT1). For example, for HLA-DRB1 the DRB1*0101 allele was chosen as the reference. The amino acid sequence for the ‘Hsa_HLA-DRB1_beta-strand 2_peptide antigen binding’ sequence feature in HLA-DRB1*0101 allele is 26L_28E_30C, defining Hsa_HLA-DRB1_SF153_VT1 (Table 3). Since HLA-DRB1*0102 and HLA-DRB1*0103 have the same amino acid sequences at these positions, they are also VT1 for SF153. However, HLA-DRB1*0113 has different amino acids at positions 26 and 30 and thus defined the second variant type Hsa_HLA-DRB1_SF153_VT2 as 26F_28E_30L and so on. Using this strategy, detailed comparisons of sequence similarity can be described that are not apparent using the standard allele nomenclature. For example, while DRB1*0301 and *0304 share the same amino acid sequence, and thus are of the same variant type (VT3), for SF153, the *0302 and *0303 alleles are distinct (VT4); *0307 defined yet another variant type (VT5) for SF153; *0701 and *0703 are related to *0113 through SF153_VT2 sequence identity. The sequence relationships between all known HLA alleles are now explicitly described for each sub-region of the encoded proteins. Since each SF is treated individually, a comprehensive assessment of sequence similarity is captured when the variant types are compared. Finally, the approach also reduces the complexity seen when each allele is treated as an independent entity. For example, in the case of HLA-DRB1, the ~700 known alleles are collapsed into just 11 distinct variant types for SF153, in theory leading to increased statistical power in disease association studies.
The complete set of SF and SFVT definitions is being made available through three public database resources that are committed to their maintenance and consistency (www.immport.org, www.ebi.ac.uk/imgt/hla/, www.ncbi.nlm.nih.gov/gv/mhc/). One of the authors (S.G.E.M.) maintains the IMGT/MHC database of new HLA alleles as they are defined and will create the relevant variant type vectors during their regular database version release process for distribution to other database resources, including ImmPort and dbMHC, and use by other investigators.
In order to assess the utility of the SFVT approach for HLA disease associations, we analyzed a data set of HLA typing information for a cohort of 1300 subjects with systemic sclerosis together with 1000 healthy controls. Four-digit typing data for HLA-DRB1 was used to prepare the complete SFVT feature vector for all 181 DRB1 sequence features. Chi-square analysis was used to determine which sequence features and corresponding variant types demonstrated significant association with the presence or absence of disease (see Materials and Methods for details).
Table 4 shows the 21 SFVTs with the most significant association with the disease. [A complete list of SFVTs with significant association (corrected P < 0.05) can be found in Supplementary Material, Table S2.] In some cases, the amino acid sequences from significantly associated SFVTs were over-represented in the control cohort suggesting that these sequences might have a protective effect (e.g. Hsa_HLA-DRB1_SF155_VT2), with odds ratios <1. In other cases, the specific SFVT was over-represented in the disease cohort suggesting that these sequences increased the risk of developing disease (e.g. Hsa_HLA-DRB1_SF98_VT3), with odds ratios >1. Variant types from sequence features of a wide range of sizes were identified, ranging from the entire polypeptide sequence to single amino acids. The comprehensive SFVT analysis necessarily produces some overlapping results. For example, Hsa_HLA-DRB1_SF163, SF164, SF165 and SF21 all describe different sequence features within the same amino acid stretch (65–72). In this case, the SFs share the same set of variant positions among the alleles typed in the cohort, resulting in identical odds ratios and P-values. The sequence features highlighted are those with the smallest number of common amino acids and can thus be thought of as ‘tagging’ this family of features. Similarly, Hsa_HLA-DRB1_SF161, SF74 and SF15 all include residue 37, which appears to be the sole contributor of disease association for this family of features. Sequence feature Hsa_HLA-DRB1_SF91 describes the single residue 58, which is biallellic (A/E) in this sample, leading to variant types with the same P-value and reciprocal odds ratios.
There is evidence for dependencies in association between amino acid positions that would not be apparent in strategies that investigate single polymorphic amino acid associations. For example, the most significant SFVT found is Hsa_HLA-DRB1_SF155_VT2 consisting of residues 26 and 28 (P-value 4.18 × 10−11). Interestingly, position 26 was found to be a phenylalanine (F) residue both in risk and protective alleles, whereas position 28 was preferentially glutamic acid (E) in protective alleles and aspartic acid in risk alleles. However, SFVTs for position 28 by itself were not as strongly associated with disease, with a more than five log difference in P-value (8.15 × 10−6). This indicates a dependency on position 26 for the strong association found with differences at position 28 (see in what follows).
The entire protein sequence from each HLA-DRB1 allele can also be considered as an SFVT, in essence replicating the traditional method of association testing. Thus, included in the list of significantly associated SFVTs is the allele HLA-DRB1*1104, which was found to be over-represented in the disease cohort with an odds ratio of 2.88 and a P-value of 6.58 × 10−10. This confirms previous studies on this (29) and other SSc cohorts (15,17,18,30).
Fourteen amino acid positions were found repeatedly in significant SFVTs. We generated a set of temporary sequence features (tSF) corresponding to all combinations of these amino acid positions, and performed the SFVT statistical analysis. Two protective tSFVTs (tSF155_VT2, tSF517_VT4) and five risk-associated tSFVT's (tSF1501_VT9, tSF1557_VT10, tSF1592_VT9, tSF1612_VT10, tSF1498_VT12) showed extremely low corrected P-values (<10−11), and were composed of overlapping amino acid groups. For example, tSF1557_VT10 is composed of amino acids 28, 70, 71 and 86 and SF1498_VT12 is composed of 26, 67, 71 and 86. We assembled another tSF composed of the union of seven amino acid residues (26, 28, 30, 37, 67, 70, 71, 86) in the most highly significant sequence features. We examined the amino acid sequence of this tSF (SFVT 13036_14) for all alleles with either statistically significant association with SSc or extreme odds ratios. We found that risk alleles with odds ratios substantially >1 tended to have specific amino acids at each of these positions, which were different from those found in protective alleles with odds ratios substantially <1 (Table 5). For example, protective alleles tended to have an asparagine or phenylalanine at residue 37, whereas the risk alleles tended to have a tyrosine at position 37. Thus, protective alleles are characterized by the sequence 26F_28E_30Y/L_37N/F_67L/I_70Q/D_71K/R_86G, whereas risk alleles have the sequence 26F_28D_30Y_37Y_67F/I_70D_71R_86V.
Finally, the SFVT results can be used to identify amino acid residues for a conditional analysis to quantify the risk contributed by each residue. A full conditional analysis is beyond the scope of this paper, but a study of amino acids 26, 28 and 30 has been done (Table 6) as an illustration. The combination of 26 and 28 has a stronger association with disease than either residue individually (26F_28E is the highest ranked SFVT effect); in fact residue 26 by itself is not associated with disease (uncorrected overall P-values: 26 and 28 P < 1.30 × 10−9; 26 alone P = 0.11; 28 alone P < 2.22 × 10−5, Table 6). Residue 30 is individually very strongly associated with disease (uncorrected overall P < 3.58 × 10−5) and found in a number of the highest ranked SFs. Residue 26 has moderate LD with 28 and high LD with 28 and 30, while 28 and 30 have very high LD (31). A series of conditional haplotype method (CHM) analyses show that both 26 and 28 together influence disease risk differentiation compared with each amino acid individually. The association of these two residues with protection from disease was highly significant (26 and 28 condition on 26F, P < 6.47 × 10−11; condition on 28E, P < 6.64 × 10−7). The combination 26F_28E is very significantly protective compared with the remaining homogenous risk set of all other observed haplotypes. CHM analyses also show that residue 30 has little effect on this protective effect (26F_28E_30L versus 26_F28_E_30Y, P = 0.12), nor does it alter the risk effect of any other haplotypes of 26 and 28 (Table 6).
The SFVT approach is a novel method for analyzing the association of highly polymorphic proteins with any phenotypic characteristic. It was specifically designed to facilitate the analysis of disease association with HLA gene products in the human MHC, but could be applied to any analogous situation, such as the association of viral pathogen sequence variation and measures of virulence. In developing the approach, we have created a catalog of relevant sequence features and their variant types for all HLA class I and class II proteins, enabling the efficient mapping of information obtained in traditional high-resolution HLA typing studies to known structural and functional elements of the protein. The current compendium is completely extensible, as new sequence features can easily be added without changing the current list, and has the potential to be easily automated so that existing data sets can be re-analyzed quickly.
There are two major advantages of this approach to association analysis. First, it focuses on molecular variants that are likely to have biological importance—structural features of the three-dimensional protein, functional domains and combinations of the two. This information is not explicitly utilized when information is only analyzed at the level of the whole allele. Analysis of individual polymorphic amino acids is similar to the SFVT in the extreme case where each amino acid is considered as an SF. However, this does not take into account any information that puts each residue into structural and functional context. In this regard, Salamon et al., used set covering computation to describe all the possible unique combinations of amino acids in HLA DQA1-DQB1 proteins distinguishing predisposition to Type 1 diabetes (12). While this method is unbiased to the extreme, it is far more computationally intensive than the SFVT method and does not take advantage of biological knowledge relating to protein structure and function.
Second, the SFVT method can increase the statistical power of smaller data sets. In a given case–control cohort, rare disease-associated alleles may not show statistical significance. However, these alleles may harbor causative SVFTs in common with other associated alleles, thus allowing appropriate allele grouping and thereby increasing the sample size and decreasing degrees of freedom, leading to increased statistical power when such cohorts are analyzed using this approach.
The analysis of the SSc data set illustrates the applicability of this method. By including the entire polypeptide sequence as one of the sequence features, this analysis confirms the positive association seen with HLA-DRB1*1104 (previously DR5) in Caucasian and Hispanic SSc patients. This allele contains the amino acid sequence 26F, 28D, 70D and 78Y in the peptide-binding pocket 4. These residues are also seen in HLA-DRB1*0804 which is associated with SSc in African Americans (29). Sequence variation in peptide antigen binding pockets 4 and 7 has been shown to greatly influence the type of peptides presented by a particular DRB1 allele, e.g. myelin basic protein bound by HLA-DRB1*1501 versus DRB1*0401 (32). Intriguingly, there is a significant difference in risk between these two alleles in our analysis of SSc as well.
Only three of the seven alleles in Table 5 showed significant disease associations at the P < 0.05 level, and yet many of them carry an SFVT strongly associated with disease. For example, DRB1*0804 (primarily an African allele) carries the same sequence at this composite sequence feature as DRB1*1104 (a Caucasoid allele) and yet does not demonstrate a significant P-value on its own due to the relatively small sample size. However, when tested as an allele in the African-American SSc patients versus race-matched controls, it was strongly associated with disease, similar to DRB1*1104 in European-derived Caucasians and Hispanics (29). When the association is tested on the basis of the composite sequence feature, then DRB1*0804 and DRB1*1104 individuals are combined, yielding a highly significant result. Thus the SFVT approach identified stronger associations with sub-regions of the protein than with entire alleles although ethnicity also needs to be considered.
Distinguishing between the effects on disease of specific amino acids and SFs is not simple, due to the well-known complex LD patterns of the polymorphic amino acids in the HLA classical genes (31, and G. Thomson et al., in preparation). Conditional analyses can help to disentangle causative effects from those due to LD with a causative agent, but these are complex, as many different comparisons must be made. In this data set, a preliminary conditional analysis shows that while residue 30 appears in many sequence features associated with the disease, its contribution appears to be by virtue of LD with residues 26 and 28, not due to an independent effect. For amino acids with very high LD it may be impossible to distinguish the causative amino acid(s). Cross ethnic studies can aid in such cases if there is a sufficiently different LD pattern, but for some amino acids this is not the case.
These results suggest that not only is HLA-DRB1*1104 associated with disease risk for systemic sclerosis, but also known structural and functional elements, especially beta strands 2 and 3, alpha helix 2 and peptide binding pockets 4 and 7. An inspection of the predicted structure of the high-risk allele shows several bulky aromatic amino acids protruding into the peptide-binding groove of risk alleles that could dramatically influence the repertoire of peptides bound by HLA-DRB1 (Fig. 1B). These data are consistent with the hypothesis that the ability of particular HLA class II alleles to bind and present auto-antigenic peptides determines the extent of T-cell help stimulated, which in turn determines the degree of auto-antibody production. This was suggested by the analysis of 28 Japanese SSc patients who were tested for the presence of anti-topoisomerase antibodies (19). The authors noted that both the presence and amount of anti-topoisomerase was associated with those HLA-DRB1 alleles containing the linear sequence 67FLEDR71. This sequence includes residues that make up peptide binding pockets 4 and 7. In our analysis, three of the four risk alleles contain this sequence. But in addition, our analysis finds a more extensive and select group of localized amino acids that appear to be responsible for the underlying association. The analysis to determine if the same HLA-DRB1 SFVT association is found with auto-antibody presence in our cohort is ongoing. Moreover, there is strong linkage disequilibrium (LD) within HLA, and associations between HLA-DQB1*0301 and related alleles have been seen, particularly in patients who have anti-topoisomerase antibodies (24). We have completed the SFVT assignments for HLA-DQB1 and -DQA1 and are now analyzing them for association with SSc.
In conclusion, we have developed a novel approach for the analysis of HLA genetic associations in which the proteins are broken down into smaller sequence features. These sequence features can be large (e.g. complete protein domains) or small (e.g. single amino acids), they can be based on structural features (e.g. a particular beta strand) or functional features (e.g. a peptide binding region), they can be continuous or discontinuous with respect to the linear sequence, and they can overlap with each other. Once the sequence features have been defined, the corresponding variant types found in the population of HLA alleles are identified. This approach then allows for the independent analysis of disease association with any SFVT regardless of which HLA alleles carry the variation. In order to test the hypothesis that this approach will both provide stronger statistical correlations with disease and highlight the critical functional parts of the HLA molecule, we have used this approach to analyze the correlations of HLA-DRB1 sequence features in a cohort of study participants with systemic sclerosis and show that a sequence feature composed of specific amino acid residues in peptide binding pockets 4 and 7 of HLA-DRB1 (residues 26, 28, 30, 37, 67, 70, 71, 86) best explains the molecular determinant of HLA-DRB1*1104 associated disease risk for systemic sclerosis. Although this study is focused on the analysis of HLA, this approach can be applied in any circumstance in which associations between sequence polymorphisms and phenotypic characteristics are being investigated.
A protein sequence feature is any part of a protein that is expected to have some biological or experimental relevance and is defined by a specified combination of amino acid positions encoded relative to a consensus allele. There are virtually no restrictions to what constitutes a sequence feature. Sequence features may be defined based on purely structural characteristics (e.g. ‘α-helical segment 1’), functional characteristics (e.g. ‘peptide binding’) or a combination of both, and can vary in size from the entire protein to a single amino acid residue. Sequence features can be contiguous or discontinuous in the primary amino acid sequence. Each sequence feature is defined to be unique, though they can overlap with each other either partially or completely. The list of sequence features is expected to evolve; new sequence features can be added to the list as needed at any point in the future without affecting previously defined sequence features.
To define parts of each HLA protein that might correspond to the molecular determinants of disease association, four general categories of sequence features were proposed—structural, functional, sequence alterations and combinational. Structural sequence features are defined purely based on secondary and tertiary structural elements of the folded protein (e.g. beta strand 1). Functional sequence features are defined based on experimental evidence that specific amino acid positions play a defined role in a specific functional property of the protein (e.g. antigenic peptide binding). Sequence alteration sequence features are defined based on variations detected in the human population (e.g. amino acid position 67 in HLA-DRB1). Combinational sequence features combine structural and functional characteristics through their intersection as detailed below (e.g. peptide binding positions in beta strand 1). By convention, protein positions are numbered starting at the N-terminal amino acid of the mature HLA protein; leader sequences are numbered in the opposite direction with negative integers. Each sequence feature is given a descriptive name that reflects the categorical types, but is actually defined as the string of the specific amino acid positions that constitutes the sequence feature. In some cases, sequence features have been generated that correspond to the same amino acid position string (e.g. when a combinational sequence feature generates a single amino acid variant position). In those cases, the descriptive names are considered to be synonyms of the same sequence feature as defined by the amino acid string.
HLA protein structural features were defined based on two sources of information. Secondary structure sequence features were derived from annotations in relevant records of HLA 3D protein structures in the Protein Data Bank (www.pdb.org). Domain definitions, including transmembrane and cytoplasmic regions, were derived from motif information in the relevant UniProt record (www.uniprot.org). The specific HLA reference proteins and database records used are shown in Supplementary Material, Table S3.
HLA protein functional sequence features were defined based on specific functional annotation in relevant IMGT/3Dstructure-DB database records (33) (http://imgt.cines.fr/3Dstructure-DB). These functional features are focused on amino acid contact residues known to be involved in mediating specific non-covalent interactions with antigenic peptides, T-cell receptors, the CD4 and CD8 co-receptors, KIR proteins, beta 2-microglobulin (for class I), class II beta chains (for class II alpha chains) and class II alpha chains (for class II beta chains). For some HLA molecules, the interacting residue positions were verified and augmented based on experimental evidence from the literature. For each classical HLA locus, the interacting residue positions for several HLA proteins are identified from the IMGT/3Dstructure-DB database records and the literature. These functionally relevant residue positions were then mapped to the co-ordinates of the reference HLA allele sequence for the locus (Supplementary Material, Table S3) using the allele alignment data from the IMGT/HLA database. Thus, for each locus the residue positions known to be involved in a specific interaction were identified from several allele sequences and thus would constitute a functional sequence feature. For such functional sequence features, specific Gene Ontology terms were incorporated into the name in order to facilitate interoperability with other data sources utilizing related GO annotations, e.g. ‘peptide antigen binding’: GO:0042605 (22,34).
HLA protein sequence alteration sequence features were defined from the multiple sequence alignments for each HLA locus using all allele sequence contained in Release 2.24.0 (16 January 2009) of the IMGT/HLA database (1). Every position in which more than one amino acid was found in at least one other allele was defined as a single amino acid variation subtype.
HLA combinational sequence features were defined by determining those amino acids that correspond to the overlap between all structural sequence features with all functional sequence features, i.e. the intersection of the positional definitions.
In addition to the sequence feature descriptive name, which reflects the structural and functional characterization, and the definitions based on the amino acid position string, each sequence feature is annotated with a unique identifier that comprises the following information delimited by an underscore: the species name three-letter abbreviation, the locus name, sequence feature label in the form of ‘SF’ followed by a number that uniquely refers to the sequence feature for the given HLA locus. An example of a sequence feature ID would be Hsa_HLA-DRB1_SF155 for the beta-strand 2 positions involved in peptide binding at pocket 4 (the sequence feature descriptive name is Hsa_HLA-DRB1_beta-strand 2_peptide antigen binding pocket 4) defined by amino acid positions 26, 28 of HLA-DRB1. A complete list of all sequence features defined for all classical HLA class I and II proteins is provided in Supplementary Material, Table S1, and can be found at www.immport.org, www.ebi.ac.uk/imgt/hla/ and www.ncbi.nlm.nih.gov/gv/mhc/.
Once the sequence features were defined for each HLA protein, all of the SFVTs were determined. To define the different variant types, multiple sequence alignment data were obtained from the IMGT/HLA Database for each locus. For each sequence feature, the sequences were determined for each allele by combining the amino acid residues from the allele sequence at the amino acid positions defined by each sequence feature. The unique sequences are then identified and are assigned a variant type (VT) ID, which comprises the following information delimited by an underscore: species name three-letter abbreviation, the locus name, sequence feature label and the variant type label in the form of ‘VT’ followed by a number that uniquely refers to the variant type for the sequence feature. VT1 refers to the motif sequence contained in the selected reference allele. An example of a unique SFVT would be Hsa_HLA-DRB1_SF155_VT2 to refer the unique motif sequence type Phe, Glu at positions 26 and 28. Any allele that includes the same motif sequence is assigned the same variant type designation for that sequence feature. As with the sequence features themselves, the numbering order of variant types does not imply any biologically relevant relationships, but rather reflects the historical nature of the HLA nomenclature at a given locus. Alleles whose sequence information is unknown at any of sequence feature amino acid positions are designated as carrying the ‘unknown’ variant type for the sequence feature.
Once all of the variant types for each of the sequence features have been defined for a given allele, the traditional HLA allele can be represented as a SFVT feature vector in n-dimensional space in which n corresponds to the number of sequence features defined for the HLA locus in question (e.g. 181 dimensions for HLA-DRB1). The SFVT feature vector can then be used to test for significant associations with human populations segregated based on interesting phenotypic characteristics using χ2 statistical analysis.
To test the utility of the SFVT approach as a way to rapidly isolate the molecular determinants of disease associations, we utilized molecular HLA typing data from a large cohort of 1300 systemic sclerosis patients and 1000 healthy controls that was assembled at the University of Texas Health Science Center at Houston. The characteristics of this study group (ethnicities, SSc clinical sub-phenotypes, autoantibody subgroups and standard HLA case–control analyses, including χ2 and exact logistic regression with appropriate corrections for multiple testing) have been fully described (29). The ethnic makeup of the population is shown in Supplementary Material, Table S4. For this analysis, all ethnic groups were analyzed together, and no multivariate analysis for clinical sub-types (e.g. ‘diffuse’ versus ‘limited’ systemic sclerosis or presence of particular autoantibodies) was undertaken in this first analysis. The individuals in the cohort were typed for HLA-DRB1, HLA-DQA1 and HLA-DQB1 at the 4-digit level of resolution. The quality control of the data ensured that the 4-digit HLA types of the individuals were defined in congruence with current nomenclature standards (35). All subjects gave informed consent to be in the cohort and UT Houston and UT Southwestern Institutional Review Boards have approved studies with their data.
For each subject in the data set, the allele designation for HLA locus typed is transformed into the SFVT feature vector that was determined for that allele. The resulting data set matrix of subjects and the SFVT feature vectors was used in the association analysis.
The general goal of the SFVT approach is to analyze the SFVT transformed data set so that sequence features and SFVTs exhibiting significant association with disease are identified in a robust, unbiased manner. The association test can be carried out using a number of standard test approaches, e.g. χ2 heterogeneity testing, or a resampling procedure. The approach involves the assessment of a relatively large number of SFVTs. In this study, several strategies were employed to avoid false positive associations, including the generation of pseudo-replicate data sets and initial filtering based on significant sequence features common to both data sets and P-value adjustments to control for multiple hypothesis testing. The following steps have been used for the analysis:
The systemic sclerosis data set matrix was sampled randomly and partitioned into two subsets that contained equal number of cases and controls (650 SSc patients and 500 controls in each pseudo-replicate data set). These two data sets were then analyzed separately to identify which SFs and SFVTs showed a skewed distribution separately in order to select those SFVTs to be used for a combined final analysis.
For every sequence feature, a 2 × n contingency table was constructed in which the occurrence of n SFVTs in systemic sclerosis cases and controls were compared. The chi-square statistic χ2 was calculated with k = n−1 degrees of freedom. The P-value from the distribution then gives the probability of observing the distribution of the variant types of the sequence feature among the cases and controls, or more extreme values. The χ2 test was performed on each sequence feature using SAS®9 (SAS Institute, Inc., Cary, NC). The number of pairwise comparisons possible for n SFVTs is n(n−1)/2, (which equals k*(k + 1)/2). The corrected P-value was obtained by multiplying the P-value from the 2 × n χ2 test by the number of pairwise comparisons for that SFVT. The sequence features whose corrected P-values were less than or equal to 0.01 in both half data sets were selected for further analysis.
For each selected sequence feature, the χ2 analysis was used again in a series of 2 × 2 tables to test if the distribution of a particular variant type in comparison with all other variant types (i.e. Type ‘X’ versus non-Type ‘X’) of the sequence feature is associated with disease. The χ2 statistic was thus calculated for the variant type of the sequence feature on the original complete data set. The odds ratios for each variant type were also calculated. The P-value of each of the variant types of the sequence feature was adjusted to correct for multiple comparisons by multiplying the P-value with the number of multiple comparisons for that SF, which in this case was the number of variant types of the sequence feature minus 1. A complete list of SF, SFVT, P-values and odds ratios from this analysis is provided in Supplementary Material, Table S2.
From the list of significant SFVTs obtained, all variant positions from alleles in the cohort in two or more significant SFVTs were selected. In addition, those variant positions with unknown sequence information in any subset of alleles in the cohort are filtered out, giving rise to the following set of potentially relevant amino acid positions: 11, 13, 14, 25, 26, 28, 30, 37, 58, 67, 70, 71, 74, 86. tSFs of length 2–14 positions were defined for all possible combinations derived from the selected 14 positions. The SFVTs for the temporary SFs were identified, and χ2 tests were performed on the tSFs and tSFVTs as described earlier. The χ2 P-values of the tSFs were adjusted to control for differences in degrees of freedom between tSFs. The χ2 P-values of the tSFVTs were adjusted to control for multiple hypothesis testing. Among the tSFVTs with odds ratios greater than equal to 1.5 and lesser than equal to 0.67, those that do not completely overlap with any other tSFVTs of similar odds ratios are provided in Supplementary Material, Table S5, along with their P-values and odds ratios.
The CHM (14,36) was applied to a subset of amino acids of interest to illustrate the next step in analysis of the data, i.e. determining which amino acids are most likely causative in disease risk, as distinct from associations due to LD with causative amino acids. In CHM analysis, heterogeneity testing is performed to determine if stratification by an additional amino acid of specific haplotype combinations at one or more other amino acid sites affects the disease risk. Significant effects imply that the additional amino acid either itself directly affects disease risk, or is in LD with another amino acid that does.
The structure of HLA-DR1 in complex with a peptide antigen derived from influenza hemagglutinin (PDB ID: 1fyt) (37) was used in the analyses in Figure 1. Space-filled models were produced using MBT Protein Workshop (38) (http://mbt.sdsc.edu/software/applications). Visualization of the amino acid residues involved in the composite sequence feature associated with risk to and protection from SSc was performed using SwissPdb Viewer DeepView v4.0.1 (39) (http://spdbv.vital-it.ch/). Residue changes were performed using the DeepView mutate function, and for each amino acid the rotamer with the lowest clash score and highest probability was chosen. (See http://spdbv.vital-it.ch/mutation_guide.html for a description of how these parameters are calculated.) Images from DeepView were rendered using POV-Ray raytracer (http://www.povray.org/).
Conflict of Interest statement. None declared.
This work was supported by the National Institutes of Health [contracts N01-AI40076 and N01-AR02251; grants P50-AR054144, UL1-RR024148 and UL1-RR024982].