|Home | About | Journals | Submit | Contact Us | Français|
Acute lymphoblastic leukemia (ALL) is the most common cancer in children and the incidence of ALL varies by ethnicity. Although accumulating evidence indicates inherited predisposition to ALL, the genetic basis of ALL susceptibility in diverse ancestry has not been comprehensively examined.
We performed a multiethnic genome-wide association study in 1605 children with ALL and 6661 control subjects after adjusting for population structure, with validation in three replication series of 845 case subjects and 4316 control subjects. Association was tested by two-sided logistic regression.
A novel ALL susceptibility locus at 10p12.31-12.2 (BMI1-PIP4K2A, rs7088318, P = 1.1×10−11) was identified in the genome-wide association study, with independent replication in European Americans, African Americans, and Hispanic Americans (P = .001, .009, and .04, respectively). Association was also validated at four known ALL susceptibility loci: ARID5B, IKZF1, CEBPE, and CDKN2A/2B. Associations at ARID5B, IKZF1, and BMI1-PIP4K2A variants were consistent across ethnicity, with multiple independent signals at IKZF1 and BMI1-PIP4K2A loci. The frequency of ARID5B and BMI1-PIP4K2A variants differed by ethnicity, in parallel with ethnic differences in ALL incidence. Suggestive evidence for modifying effects of age on genetic predisposition to ALL was also observed. ARID5B, IKZF1, CEBPE, and BMI1-PIP4K2A variants cumulatively conferred strong predisposition to ALL, with children carrying six to eight copies of risk alleles at a ninefold (95% confidence interval = 6.9 to 11.8) higher ALL risk relative to those carrying zero to one risk allele at these four single nucleotide polymorphisms.
These findings indicate strong associations between inherited genetic variation and ALL susceptibility in children and shed new light on ALL molecular etiology in diverse ancestry.
Acute lymphoblastic leukemia (ALL) is the most common childhood malignancy and a leading cause of death due to disease in children (1,2). A genetic basis of ALL susceptibility is supported by its association with certain congenital abnormalities (3,4) and, more recently, by genome-wide association studies (GWASs) identifying common variants at ARID5B (10q21.2), IKZF1 (7p12.2), and CEBPE (14q11.2) influencing ALL risk in children of European descent (5–9). In fact, the disease risk associated with these common variants are among the strongest in cancer susceptibility variants identified through GWASs (10), consistent with a relatively large impact of inherited genetic factors on the pathogenesis of this childhood malignancy (11). However, the loci reported in ALL GWASs thus far cumulatively accounted for only 8% of genetic variation in ALL risk (11), suggesting additional susceptibility variants yet to be discovered in larger studies.
There is an extreme lack of population diversity in GWASs such that 96% of subjects studied in GWASs so far are individuals of European descent (12,13). This exclusive focus on selected few ethnic groups raises a number of critical questions: For example, are findings from European-only GWASs transferable to other populations (14, 15)? Can disease etiology be different among populations and thus characterized by distinct genetic risk factors (16)? What is the contribution of ancestry-related genetic variation to ethnic differences in cancer prevalence (17,18)? These issues are of particularly relevance to childhood ALL because the incidence of ALL varies substantially by ethnicity (14.8 per million person-years in African Americans, 35.6 per million person-years in European Americans, and 40.9 per million person-years in Hispanic Americans) (19,20), at least partly attributable to population differences in inherited genetic variations [eg, ARID5B (21,22)].
To identify novel ALL susceptibility loci and also to evaluate the associations of known susceptibility variants in diverse populations, we examined 709059 single nucleotide polymorphisms (SNPs) for association with childhood ALL in a multiethnic GWAS of 1605 case subjects and 6661 control subjects, followed by three independent replications in 845 case subjects and 4316 control subjects of European American, African American, and Hispanic American ethnicity.
Two nonoverlapping series of childhood ALL case subjects and control subjects were included: the GWAS series and the replication series. In the GWAS series, we included 1605 B-precursor childhood ALL case subjects enrolled on the Children’s Oncology Group (COG) P9904/P9905 protocols (23), and 6661 unrelated subjects from the Multi-Ethnic Study of Atherosclerosis (MESA) (dbGAP phs000209.v9) were considered as non-ALL control subjects because the prevalence of adult survivors of childhood ALL is less than 1 in 10 000 in the United States (5). The replication study included three case–control series separately by ethnicity (Supplementary Data, available online): European Americans: 574 case subjects and 2601 control subjects (24,25); African Americans: 128 case subjects and 1075 control subjects (26); Hispanic Americans: 143 case subjects and 640 control subjects (27). ALL case subjects in the replication series were from the St. Jude Total Therapy XIIIB/XV and the COG P9906 protocol (5). Ethnicity was defined by genetic ancestry as described below. ALL molecular subtypes included MLL rearrangements, ETV6-RUNX1, TCF3-PBX1, or BCR-ABL1, and hyperdiploid. Patients include in the genetic association analyses represented 85.3% (n = 1605 of 1882) of total enrolled participants on the COG P9904/9905 treatment protocols and 83.1% (n = 854 of 1017) of participants of the COG P9906 and St. Jude Total Therapy XIIIB/XV protocols (Supplementary Figure 1, available online).
Genotyping of ALL case subjects was performed by using the Affymetrix (Santa Clara, CA) Human SNP Array 6.0 (COG P9904/P9905, the GWAS series) or the Affymetrix GeneChip Human Mapping 500K Array (St. Jude Total Therapy XIIIB/XV and COG P9906, the replication series). Non-ALL control subjects in both the GWAS and replication series were genotyped using Affymetrix Human SNP Array 6.0. Genotype calls (coded as 0, 1, and 2 for AA, AB, and BB genotypes) were determined by the Birdseed (Affymetrix Human SNP Array 6.0) (28) or BRLMM (Affymetrix GeneChip Human Mapping 500K Array) algorithms (29). Samples for which genotype was ascertained at less than 95% of SNPs on the array were deemed to have failed and were excluded from the analyses (Supplementary Figure 1, available online). SNP quality control procedures were performed on the basis of call rate, minor allele frequency, and Hardy–Weinberg equilibrium, and 197541 of 906600 SNPs were excluded during GWAS quality control (Supplementary Figure 2 and Supplementary Data, available online).
This study was approved by the respective institutional review boards, and informed consent was obtained from parents, guardians, or patients, as appropriate.
European, African, Asian, and Native American genetic ancestry was determined by using STRUCTURE (version 2.2.3) (30,31) with HapMap CEU, YRI, CHB/JPT, and indigenous Native Americans (32) as reference populations, respectively. European Americans, African Americans, and Asian Americans were defined as having more than 95% European genetic ancestry, more than 70% African ancestry, and more than 90% Asian ancestry, respectively. Hispanic Americans were individuals for whom Native American ancestry was more than 10% and greater than African ancestry; the remaining subjects were grouped as “Others” (Supplementary Figure 3, available online).
In the GWAS, the association between genotypes at each of 709059 SNPs and ALL susceptibility was tested by comparing the genotype frequency between ALL case subjects and control subjects in the logistic regression model, after adjusting for the top four principal components inferred by EIGENSTRAT (33) to control for population stratification (Supplementary Figure 4, available online). To validate associations at four susceptibility loci previously identified in populations of European descent [ARID5B (5,6), IKZF1 (5,6), CEBPE (6), and CDKN2A/2B (8)], we focused on variants within 600kb of the top SNP at each locus and applied statistical significance threshold that corrected for the number of SNPs tested at each locus (n = 174, 241, 104, and 145, respectively). To agnostically search for novel susceptibility variants by GWAS, we applied a genome-wide statistical significance cutoff of P less than or equal to 5×10−8 and sought to verify SNPs meeting this threshold in independent replication series.
In the replication studies, we tested six SNPs at the BMI1-PIP4K2A locus separately in European Americans, African Americans, and Hispanic Americans by logistic regression test with genetic ancestries as covariates. Those with P less than .05 in replication series were considered as validated.
Logistic regression model was also used to determine the independent association of multiple SNPs within the same locus, to examine the cumulative effects of multiple susceptibility variants, and to evaluate the effects of susceptibility variants in different age groups. Association between PIP4K2A SNP genotype and gene expression was assessed by a linear regression model in HapMap CEU lymphoblastoid cell lines [GSE7851 (34)] and in diagnostic blasts from European American children with ALL from St. Jude Total Therapy XIIIB/XV protocols (35,36).
R (version 2.15.1) statistical software was used for all analyses unless indicated otherwise, and a detailed description of statistical procedures is provided in the Supplementary Data (available online). All statistical tests were two-sided.
To comprehensively examine germline ALL susceptibility variants, we performed GWAS in 1605 children with newly diagnosed B-precursor ALL and 6661 unrelated non-ALL control subjects of diverse ancestry (ie, European, African, Asian, and Native American genetic ancestry) (Supplementary Figures 3 and 4, available online). Controlling for population structures, we evaluated 709059 germ line SNPs for differences in genotype frequency between ALL case subjects and control subjects.
We first focused on three susceptibility loci previously identified by GWAS in populations of European descent (5,6)—ARID5B at 10q21.2, IKZF1 at 7p12.2, and CEBPE at 14q11.2—to compare the association signals among populations, particularly in those of non-European descent (Table 1; Supplementary Table 1, available online). At the ARID5B locus (ie, rs10821936), the association with ALL was consistent in all three genetically defined ethnicities: European Americans (P = 6.9×10−30; n = 972 case subjects and 1386 control subjects); African Americans (P = .004; n = 89 case subjects and 1363 control subjects); Hispanics (P = 3.8×10−11; n = 305 case subjects and 1008 control subjects); and the multiethnic group (P = 5.9×10−46; n = 1605 case subjects and 6661 control subjects). The frequency of the ALL risk allele (C) at rs10821936 increased in the order of African Americans, European Americans, and Hispanic Americans, consistent with the ethnic differences in ALL incidence (21). Multivariable analyses adjusting for rs10821936 did not identify any additional independent association signal at this locus (Supplementary Figure 5, available online). The top SNP in IKZF1 (ie, rs11978267; P = 5.3×10−24 in the multiethnic group) was also associated with ALL risk across ethnicities. Interestingly, another cluster of SNPs further upstream of rs11978267 were statistically significant even after controlling for rs11978267 (ie, rs10235226; P = 1.4×10−5 in the multiethnic group) (Supplementary Figure 5 and Supplementary Data, available online). Association at CEBPE SNPs was validated in the multiethnic group (ie, rs4982731; P = 9×10−12) (Table 1), but multivariable model conditioning on the top SNP (rs4982731) did not support additional independent associations in this region (Supplementary Figure 5, available online). Another previously reported ALL risk locus at 9p21.3 (8) was also validated in our GWAS series (ie, rs1775631 at CDKN2A/2B; P = 1.4×10−5 in the multiethnic group). In total, of 664 SNPs at these four loci, 79 remained statistically significant after correcting for multiple testing (Supplementary Table 1, available online).
Importantly, our multiethnic GWAS also discovered a novel ALL susceptibility locus at 10p12.31-12.2 that was not identified by previous studies in populations of European descent. As shown in Figure 1, Figure 2, and Table 2, six variants in the BMI1-PIP4K2A region exhibited genome-wide statistically significant associations with ALL. Four SNPs were clustered within the intronic region of the PIP4K2A gene; the other two were upstream of the COMMD3 and BMI1 genes and further distal to the centromere (Figure 2). The SNPs with the strongest association in each region were rs7088318 (PIP4K2A; P = 1.1×10−11) and rs4748793 (COMMD3/BMI1; P = 8.4×10−9), respectively (Table 2). Although both SNPs conferred a similar degree of increase in ALL risk (odds ratio [OR] = 1.4), they were independently associated with disease susceptibility (P < .0001) in a multivariable model adjusting for each other (Supplementary Figure 5, available online) and were separated by distinct linkage disequilibrium (LD) blocks in all ethnic groups (Figure 2). Also, the frequency of the ALL risk allele at rs7088318 was highest in Hispanic Americans, followed by European Americans, and lowest in African Americans, in parallel with ALL incidence in these populations (20) (Table 2). To explore possible functional consequences of this PIP4K2A variation, we investigated the relationship between rs7088318 genotype and PIP4K2A mRNA expression. In lymphoblastoid cell lines derived from the HapMap CEU samples, the ALL risk allele (A) at rs7088318 was linked to higher PIP4K2A expression (P = .001; n = 55) (Figure 3). Consistently, the number of A allele at this SNP was also positively associated with PIP4K2A expression in diagnostic blasts from children with ALL (P = .02; n = 228) (Figure 3), indicative of a cis-acting expression quantitative trait locus.
We next sought to validate the association at the novel susceptibility locus BM1-PIP4K2A in three independent case–control series in an ethnicity-specific manner: European Americans (n = 574 case subjects and 2601 control subjects), African Americans (n = 128 case subjects and 1075 control subjects), and Hispanic Americans (n = 143 case subjects and 640 control subjects). The top PIP4K2A SNP, rs7088318, was statistically significantly associated with ALL susceptibility in all three ethnic groups: European Americans (P = .001); African Americans, (P = .009); and Hispanic Americans, (P = .04) (Table 2). The remaining five SNPs at this locus were all replicated in at least one ethnic group (Table 2) and so was the independent association at rs4748793 (P rs4748793 = 3.5×10−4, after adjusting for rs7088318 and genetic ancestry in replication series).
The genetic underpinning of childhood ALL susceptibility is likely to be complex, and current evidence strongly favors a polygenic model of ALL risk (11). We next examined the combined effects of four genome-wide statistically significant loci (Figure 1) on ALL susceptibility by multimarker analyses on the basis of genotype at top SNPs at each locus: rs10821936 at ARID5B, rs11978267 at IKZF1, rs7088318 at PIP4K2A, and rs4982731 at CEBPE. In the combined GWAS and replication series (n = 2450 case subjects and 10 977 control subjects), there was a positive correlation (P = 1.6×10−5; correlation coefficient = 0.39, 95% confidence interval [CI] = 0.33 to 0.45) between the number of risk alleles at these four SNPs and relative ALL risk (ie, odds ratio, relative to subjects carrying 0–1 copy of the risk alleles) (Figure 4). For example, subjects with six to eight copies of risk alleles (n = 252 case subjects and 314 control subjects) were at ninefold (95% CI = 6.9 to 11.8) higher risk of developing ALL than those with zero to one copy of the risk alleles (n = 153 case subjects and 1753 control subjects). Cumulative effects of these variants were also estimated separately in the GWAS and replication series (Supplementary Figure 6, available online).
Finally, because the incidence of ALL is highly related to age with the majority of cases occurring in children aged 2 to 5 years (2), we examined the effects of ALL susceptibility variants by age. Combining GWAS and replication series, risk allele frequency at rs10821936 was higher in children who developed ALL before 10 years of age than in those diagnosed with ALL at ages older than 10 years (P = .02, .18, .007 in European Americans, African Americans, and Hispanic Americans, respectively) (Table 3), most evidently in hyperdiploid ALL (Table 3). Consistently, when we further classified children into age groups of those aged less than 5 years, those aged 5 to 10 years, and those aged greater than 10 years, there was a trend for decreasing allelic odds ratio (ie, relative risk of ALL conferred by each copy of the C allele at rs10821936) as age increased: 2.01 (95% CI = 1.85 to 2.19), 1.8 (95% CI = 1.6 to 2.02), and 1.48 (95% CI = 1.3 to 1.68), respectively (Supplementary Figure 7, available online). Similar results were observed when we restricted the analysis to hyperdiploid ALL (Supplementary Figure 7, available online). In contrast, the effects of IKZF1, CEBPE, and PIP4K2A variants did not differ between age groups (data not shown). Together, these results suggest possible modifying effects of age on genetic predisposition to ALL.
Non-European populations are indisputably underrepresented in GWASs (12,13). Recent GWASs in diverse populations reveal both similarities and differences in genetic architecture of disease susceptibility among ethnic groups. We reported here the first GWAS of ALL in multiethnic populations (including African Americans and Hispanic Americans), in which we discovered novel susceptibility variants at BMI1-PIP4K2A locus and comprehensively compared associations at known susceptibility loci (ARID5B, IKZF1, CEBPE, and CDKN2A/2B) in different ethnic groups.
The discovery of the BMI1-PIP4K2A susceptibility variants that were not detected by previous European-only GWASs of ALL (5–7) raises the question of improved power as a result of population diversity. When the disease variant is substantially more common in non-European populations, GWASs in these ethnic groups obviously heighten the power to discover such loci compared with GWASs in Europeans with the same sample size, as illustrated in the case of the KCNQ1 locus in type 2 diabetes (37) and by simulation using the 1000 Genome data (38). However, this is probably unlikely to explain the BMI1-PIP4K2A variants that are actually less frequent in African Americans than European Americans, although they are modestly more common in Hispanic Americans. Another plausible explanation is population differences in LD around the BMI1-PIP4K2A variants: if the causal variant is better tagged in African populations, including African Americans in the GWAS is likely to improve the sensitivity to detect the signal at the genome-wide threshold. At this locus, LD pattern is similar between European Americans and Hispanic Americans, but is much less extensive in African Americans, as expected (Figure 2). Lastly, population heterogeneity in effect size of the risk allele can also influence the sample size required in GWAS. In type 2 diabetes, the allelic risk at multiple disease variants is statistically significantly greater in the Japanese population than in the European population, although these variants are statistically significant in both ethnic groups (39). Interestingly, the per-allele odds ratio at rs7088318 was greater in African Americans and Hispanic Americans relative to European Americans (Table 2), consistent with possibly improved power when these non-European populations are included in GWASs of ALL.
The population diversity in our GWAS also offered a unique opportunity to examine the genetic basis of ethnic differences in ALL incidence (20). Variants at the ARID5B, IKZF1, and BMI1-PIP4K2A loci were associated with ALL susceptibility across ethnic groups (ie, the SNP with the strongest association at each locus was statistically significant in all three populations), suggesting common causal variants across ancestral backgrounds. In contrast, CEBPE SNPs were strongly related to ALL risk in European Americans, with variable effects in non-European populations. Such disparities might reflect existence of true population-specific disease variants but can also arise from population differences in genomic structure at these loci (differences in LD between tagging SNPs and causal variants). Further, the frequency of ALL risk variants at the ARID5B and PIP4K2A loci vary substantially by ethnicity in a pattern consistent with their possible contribution to ethnic differences in ALL incidence (21) (Tables 1 and 2).
The genetic basis of ALL is most likely to be polygenic (11). However, it should be noted that carrying ALL risk variants at merely four SNPs (ARID5B, IKZF1, CEBPE, and PIP4K2A) conferred a ninefold increase in disease susceptibility (Figure 4) and these GWAS signals are concentrated to genes directly related to hematopoietic differentiation and development [ARID5B (40), IKZF1 (41), and CEBPE (42)]. We hypothesize that genetic predisposition to ALL might be largely mediated by robust effects of a modest number of key genes rather than cumulative effects of tens of thousands of variants with small effects (OR = 1.1–1.2), as seen in GWASs of other common diseases (43,44). In fact, it is estimated that variants in ARID5B, IKZF1, CEBPE, and CDKN2A/2B account for approximately one-third of ALL risk conferred by common genetic polymorphisms (11). The effect of ALL susceptibility variants was particularly strong in younger children (Supplementary Figure 7, available online), suggesting possible variation in ALL genetic predisposition at different developmental stages. Interestingly, several of the GWAS hits are also frequently targeted by somatic aberrations in ALL cells [IKZF1 (45) and CEBPE (46)]. Susceptibility variants in ARID5B are also related to gross cytogenetic abnormalities in ALL blasts (ie, hyperdiploidy) (Table 3), consistent with prior reports from us and others (5,6,21,47). The C allele at rs10821936 confers a greater disease risk for this subtype of ALL (5), although the molecular mechanisms linking ARID5B to aneuploidy remain unclear. Nevertheless, these observations raise the possibility of interactions between inherited (germline) and acquired (somatic) genetic variations in the pathogenesis of ALL.
PIP4K2A is a member of the family of enzymes that catalyze phosphorylation of phosphatidylinositol-5-phosphate to form phosphatidylinositol-5,4-bisphosphate (PIP2), a precursor of the important second messenger molecule, PIP3. Upon B-cell receptor activation, PIP4K2A is directly recruited by BTK to the plasma membrane as a means of stimulating local PIP2 synthesis (48). Similarly, PIP5K enzymes also interact with the Rho-family small GTP-binding proteins (eg, Rac1) to regulate membrane PIP2 synthesis and PI3K and PLC signaling in B cells (49). Although these observations point to PIP4K2A as a plausible regulator of lymphoid cell differentiation, functional studies are warranted to determine the mechanisms linking PIP4K2A to leukemogenesis.
Our study was not without limitations. Further fine-mapping and/or resequencing of the causal variants will be required to completely characterize the contribution of BMI1-PIP4K2A variants to ALL etiology in the context of ethnicity. Future GWASs and/or admixture mapping with even larger samples of non-European populations are needed to comprehensively characterize genetic variants that predispose children to this most common childhood cancer and to fully understand the genetic basis of ethnic disparity in ALL. Nonetheless, we argue that a GWAS approach that includes multiethnic subjects is likely to be more effective in discovering ALL risk loci than analyses selectively procuring large samples in a single population, as suggested by observations from GWASs of other diseases (13,15,16,50).
This work was supported by the National Institutes of Health (grant numbers CA156449, CA21765, CA36401, CA98543, CA114766, CA98413, CA140729, and U01GM92666), in part by the Intramural Program of the National Cancer Institute, and by the American Lebanese Syrian Associated Charities (ALSAC).
Genome-wide genotyping of COG P9904/P9905 samples was performed by the Center for Molecular Medicine with the generous financial support from the Jeffrey Pride Foundation and the National Childhood Cancer Foundation. S.P. Hunger is the Ergen Family Chair in Pediatric Cancer, and J.J. Yang is supported by the American Society of Hematology Scholar Award, Alex Lemonade Stand Foundation for Childhood Cancer Young Investigator Grant, and by the Order of St. Francis Foundation.
H. Xu and W. Yang contributed equally to this work.
The study sponsors were not directly involved in the design of the study, the collection, analysis, and interpretation of the data, the writing of the manuscript, or the decision to submit the manuscript.
We thank the patients and parents who participated in the St. Jude and COG protocols included in this study, the clinicians and research staff at the St. Jude and COG institutions, and J. Pullen (University of Mississippi at Jackson) for assistance in classification of patients with ALL. We thank M. Shriver (Pennsylvania State University) for sharing SNP genotype data of the Native American references. Full acknowledgements of the dbGAP datasets are provided in the Supplementary Data (available online).