Several large studies using standard multivariable modeling have established the importance of molecular matching at HLA-A, B, C, and DRB1 for the outcome of HCT [
1-
5]. It is estimated that on average, every additional mismatch is associated with a 10% decrement in survival after adult unrelated donor transplantation for good risk patients [
2]. But it is equally clear that many patients, particularly minorities lack matched unrelated donors [
20] and suitable mismatched donors need to be identified to offer transplants to these patients. The effect of HLA mismatching on GvHD, relapse, and transplant related mortality (TRM) is mediated by amino acid substitutions, several of which can be found in most mismatched alleles. In this study we have identified 33 amino acid substitutions' locations that are associated with survival at day 100 post-transplant. Some of these locations, 97, 116 and 156, were present in all three HLA class I loci. Substitution locations 9, 77, and 95 were present on HLA-A and HLA-C mismatched antigens or alleles. Some locations were only identified on mismatched antigens or alleles of a single locus; HLA-A 43, 62, 63, 76, 114, 152, 166, 167; HLA-B 109; and HLA-C, 6, 11, 14, 21, 66, 80, 99, 163, and 173. The majority of the important amino acid substitutions identified in this study as associated with survival to day 100 are located on the alpha 1 or the alpha 2 domains of the peptide binding site, encoded by exons 2 and 3 respectively and are predicted to directly affect T-cell allorecognition [
21-
23]. The most common HLA mismatches associated with these amino acids are HLA-A*02:01/02:05, 02:01/02:06, 03:01/03:02, 01:01/11:01, 02:01/68:01, and 24:02/24:03; HLA-B*35:01/35:03 and 35:01/35:08; and HLA-C*01:02/02:02, 04:01/16:01, 05:01/07:04, 14:02/15:02, 03:03/04:01, 07:01/12:03, 06:02/07:01, 01:02/03:03, 01:02/15:02, 03:04/07:02, and 02:02/15:02. The identification of amino acid substitutions that are associated with a higher than average risk of failure in HCT, the so called non-permissive amino-acid substitutions, represents a first step towards the ultimate goal of identifying acceptable mismatches that could be used in the clinical setting for selection of suitable mismatched unrelated donors for patients lacking HLA-identical donors. However, additional studies using different datasets as well as functional studies are necessary to confirm these findings prior to clinical implementation of these results.
Initial insights of the importance of specific amino-acid substitutions were based on identification of individual patients and isolation of cytotoxic T-cell clones directed against HLA subtypes absent in the donor [
8,
9,
24]. Ferrara and collaborators [
10] using a large dataset reported in 2001 that substitutions at position 116 of class I molecules increase risk for acute GvHD and TRM. However, they did not attempt to distinguish the effects of substitutions in HLA-A, HLA-B or HLA-C [
10]. Recently, Kawase and collaborators [
11] have reported non-permissive HLA mismatches associated with acute GvHD in HCT patients from the Japan Marrow Donor Program (JMDP). In contrast to our study, Kawase's study population was comprised of recipients with heterogeneous diagnoses and disease stages, and donor-recipient pairs with mismatches at multiple HLA loci [
11]. They conducted a traditional multivariate analysis to evaluate the effect of HLA one-locus allele mismatch on acute GvHD while adjusting for clinical factors (disease, treatment and patient-related predictors) as well as mismatch status in other loci [
11]. They found 4 non-permissive mismatches in HLA-A, 1 in HLA-B, 7 in HLA-C, 1 in DRB1, 1 mismatch associated with DRB1-DQB1, and 2 in HLA-DPB1 [
11]. A similar model was used to analyze the impact of each amino acid substitution type on each position separately. However, they did not adjust for multiple amino acid substitutions that commonly occur within a single HLA-mismatch [
11]. They found 2 non-permissive amino-acid substitutions at HLA-A, positions 9 and 116 and 6 non-permissive amino-acid substitutions at HLA-C positions 9, 77, 80, 99, 116, and 156 [
11]. More recently, the same group has published an analysis of HLA-mismatches that predict for relapse and overlap minimally with the mismatches associated with acute GvHD [
25]. Functional studies have also been reported [
12,
13], however their results are in conflict with Ferrara [
10] and Kawase's [
11] reports and only include a small number of cases.
Our analysis differed from Kawase's [
11] in several ways. First, we used a different endpoint namely death by day 100 and restricted our analysis to patients with good or intermediate risk leukemia. By focusing the analysis to a more restricted and hence more homogeneous study population, we hypothesized that we would reduce variability due to disease variables and increase the power to detect variables that predict for GvHD. Second, we used a new statistical method, random forest analysis, which has not been previously applied in HCT but which has several advantages over more conventional analysis methods as demonstrated by our results. Using random forest analysis, we confirmed all non-permissive amino-acid substitutions identified by Kawase et al [
11] as well as the few amino-acid substitutions reported by other investigators [
8-
10,
24]. Although RF analysis does not validate the interpretation of substitutions as permissive versus non-permissive and does not provide a
p-value, the fact that we were able to identify these previously reported non-permissive amino-acid substitutions by random forest and not by traditional multivariate analysis in our dataset, supports the observation in other fields that random forests provides greater data analytic power. Furthermore, in addition to the 8 amino acid substitutions identified by Kawase et al [
11], we identified another 25 that had similar or higher importance scores in the random forest analysis. Future studies in different patient populations are required to confirm the importance of these amino-acid substitutions in HCT. However, for the patient who needs a HCT today from an HLA-mismatched donor, the evolving literature suggests that using a donor who is mismatched with the recipient at positions 116 or 156 at either of the HLA class I loci, at position 9 at HLA-A or HLA-C, and at position 99 at HLA-C may increase the risk for early death and other adverse outcomes.
A number of limitations of this study should also be mentioned. Although there were some notable commonalities, the three separate analytic techniques we employed using the same data set identified different sets of clinical variables and amino acid substitutions associated with survival at day 100, highlighting the need for independent validation in multiple datasets and using multiple approaches. Also, we chose survival at day 100 as our primary endpoint since it is objective and likely most closely associated with acute GvHD. However, further studies should be done to investigate amino acid substitutions that have their maximal association with other outcomes and to determine permissive amino acid substitutions. Our analysis identified associations between amino acid substitutions and survival at day 100, but we cannot confirm biologic importance. Only well designed functional studies will show if the specific amino acid substitutions identified affect T-cell allorecognition or function or if they are markers for other critical factors causing increased mortality. Other biological factors that affect HLA amino acid mismatches and T-cell allorecognition in HCT such as shape of the T-cell receptor repertoire have not been investigated in this study. Finally, although most of these amino acid locations have been identified in other studies, we acknowledge that some of these amino acid substitution locations may only be a marker of a specific allele mismatch instead of a truly important location that has an effect on survival.
In conclusion, using random forest to analyze the largest currently available dataset of HCTs, we were able to confirm 13 previously identified class I amino acid substitutions as well as 20 additional novel class I amino acid substitutions that are predictors of survival at day 100. Random forest analysis presents a robust statistical methodology for analysis of HLA-mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods. Based on these results, random forest analysis may prove an equally valuable tool to evaluate other transplant outcomes of interest.