|Home | About | Journals | Submit | Contact Us | Français|
We conducted a genome-wide association study (GWAS) to discover single nucleotide polymorphisms (SNPs) associated with the severity of sickle cell anemia in 1,265 patients with either “severe” or “mild” disease based on a network model of disease severity. We analyzed data using single SNP analysis and a novel SNP set enrichment analysis (SSEA) developed to discover clusters of associated SNPs. Single SNP analysis discovered 40 SNPs that were strongly associated with sickle cell severity (odds for association >1,000); of the 32 that we could analyze in an independent set of 163 patients, five replicated, eight showed consistent effects although failed to reach statistical significance, whereas 19 did not show any convincing association. Among the replicated associations are SNPs in KCNK6 a K+ channel gene. SSEA identified 27 genes with a strong enrichment of significant SNPs (P < 10−6); 20 were replicated with varying degrees of confidence. Among the novel findings identified by SSEA is the telomere length regulator gene TNKS. These studies are the first to use GWAS to understand the genetic diversity that accounts the phenotypic heterogeneity sickle cell anemia as estimated by an integrated model of severity. Additional validation, resequencing, and functional studies to understand the biology and reveal mechanisms by which candidate genes might have their effects are the future goals of this work.
Phenotypic heterogeneity is characteristic of sickle cell anemia, a Mendelian disorder caused by homozygosity for the sickle HBB gene (glu6val). Among patients, different rates of hemolysis/vasculopathy and viscosity/vasoocclusion-related complications are typical [1–3], and account for a substantial reduction in life expectancy: In 1994, the median life expectancy for men and women with sickle cell anemia was 42 and 48 years, respectively . Despite many advances in care, the annual mortality still approaches 4%. To integrate individual disease complications into a comprehensive measure of severity, we developed a model of the associations among clinical and laboratory variables that scored disease severity as the risk of death within 5 years . This network was developed using data obtained from more than 3,400 subjects from the Cooperative Study of Sickle Cell Disease (CSSCD) , and its accuracy was validated in two unrelated sets of sickle cell patients. Recently, the network was also validated in a small European cohort of patients with sickle cell anemia .
This model did not include the genetic components that likely determine the substantial between-subject variability of sickle cell anemia complications. Preliminary work based on candidate gene analysis suggested that several genes might modulate the overall severity of the disease but was limited to the choice of candidate genes . We report here the results of a genome-wide association study (GWAS) of the severity of sickle cell anemia in 1,265 patients from the CSSCD based on the Illumina Human610-Quad SNP array. Using traditional single SNP analysis, we identified 40 SNPs strongly associated with disease severity (odds for association >1,000) and, of the 32 that we could analyze in an independent validation set of 163 patients, five replicated, eight showed consistent effects although failed to reach statistical significance, and 19 were not significant.
The small number of replicated associations that were significant is a limitation of the standard approach to GWAS that usually requires a very large sample size to detect associations with so-called genome-wide significance (P values < 10−8) to reduce chances of false positive findings . It is acknowledged that many more associations remain to be discovered from GWAS data , especially in rare diseases where sample sizes are constrained. Therefore, we developed a novel SNP set enrichment analysis (SSEA) to mine the GWAS data further. SSEA searches for regions of the genome that are enriched for sets of significantly associated SNPs and computes the probability of detecting by chance the number of SNPs associated with a phenotype in a window of a flexible number of SNPs, or in a gene. SSEA identified 18 genes with a strong enrichment of significant SNPs in the discovery set (P < 2 × 10−6) and 11 were replicated with varying degrees of confidence. Functional category enrichment analysis showed an enrichment of genes involved in heme binding.
The CSSCD remains the largest study of sickle cell disease yet done in the United States . Although it has been more than 30 years since the first patient was enrolled, it is unlikely to be replicated soon. In the CSSCD, phenotype and laboratory data were collected according to protocol, quality checked, and biological samples obtained from patients with sickle cell anemia . We used 1,265 sickle cell anemia patients, all African-Americans, from the CSSCD as our primary discovery set for the GWAS, using stored DNA for genotyping. Our secondary validation data sets were from much smaller contemporaneous sickle cell anemia studies and included African-Americans subjects being screened prospectively for the presence of pulmonary hypertension from the Duke University Pulmonary Hypertension Study  and from Boston Medical Center (BMC) . Some of the latter patients were previously included in the validation of the score of sickle cell anemia severity . In both the Duke and Boston studies, patients were at least 18 years old. In the CSSCD, patients were recruited regardless of symptoms in an attempt to capture the spectrum of disease severity. The Duke and BMC samples were clinic-based, and therefore, biased toward symptomatic disease. These studies were approved by the Institutional Review Boards of the participating institutions. Summary characteristics of the populations are described in Table I.
We used the Bayesian network described in  and available from http://www.bu.edu/sicklecell/downloads/Projects/ to compute the severity of disease of patients of the discovery and validation sets. The network uses 25 clinical and laboratory variables to assess the severity of disease by the risk of death within 5 years and its sensitivity and specificity were originally evaluated in two unrelated patient data sets. The network model includes well-known markers of disease severity such as LDH and systolic blood pressure that correlate with hemolytic anemia, and other complications of the disease such as stroke, painful episodes, and priapism that are known markers of severity. The quantitative score was used to generate a group of 1,088 patients with “mild disease” (score < 0.4) and 177 patients with “severe disease” (score ≥0.6 for ages 40 and younger, and score >0.8 for ages 40 and older) that were used as controls and cases for the GWAS. This grouping was based on the observation that the severity score has a U shape that changes in age groups (Fig. 1). The lower value of 0.4 appears to capture groups of mild disease patients for all age groups, and a score of 0.6 or higher is sufficient to characterize severe patients aged 40 or younger. For patients aged 40 and older, we restricted the definition of severe cases to a score of 0.8 or higher to reduce the risk of misclassification. The severity of disease in the 163 subjects of the validation set was scored in analogous way and lead to 95 mild cases (score < 0.4) and 68 severe cases (score > 0.6).
We genotyped the DNA samples of the CSSCD and Duke subjects with the Illumina Human610-Quad SNP array that includes more than 600,000 SNPs and covers ~60% of the genome of the HapMap Yoruban (r2 > 0.8) and 18,000 known genes. We genotyped the DNA samples from the BMC patients in an earlier phase of the study, with the Illumina HumanCNV370-duo bead chip that comprises ~350,000 SNPs common to the 610 array. Genomic DNA (0.5–1 μg depending on array) was analyzed on the Illumina arrays using the standard Illumina protocol and BeadStudio software was used for genotype calling using predetermined clustering provided by Illumina.
We excluded samples with gender inconsistencies and call rates <93%. We conducted identity by state analysis using the software PLINK  to confirm known familial relationships and to identify hidden familial relationships. Repeat genotyping of duplicate samples received from the CSSCD as well as repeat genotyping of the same tube of DNA was used to evaluate sample misidentification (estimated to be <5%). All suspect samples were excluded from the study (~10%). We conducted genome-wide principal component analysis of the subjects included in the discovery and validation sets with the program EIGENSOFT . The scatter plot of the first two principal components is displayed in Fig. 2.
We examined general, allelic, dominant, and recessive associations using Bayesian tests and scored the evidence of association of each model by its posterior probability. The details of the calculations are as described in Balding . and Sebastiani et al . and we assumed uniform probability on competitive models, so that the posterior odds of each model is equivalent to the Bayes factor (BF). To address the issue of multiple testing, we conducted extensive simulations to compute the expected number of false positive associations for different thresholds of the BF. The simulations showed that the false positive rate of the Bayesian decision rule changes with the allele frequency and suggested using a BF >1,000 to reduce the number of false positive associations to less than 1 in 10,000 independent tests. This procedure is also described in . Forty SNPs had a model of association with a BF >1,000 (Supporting Table 1). Of these 40 SNPs, only 32 had an estimated MAF in the validation set of at least 15% and could be included in the validation analysis. A SNP association was considered validated if the same genetic model showed significant associations in both the discovery and validation set, and the genetic effects were in the same direction. However, the threshold for significance in the validation set was reduced to BF >1 given that replication was conducted for 32 SNPs only. A SNP association was considered consistent if the genetic effect for the same genetic model was in the same direction in both the discovery and validation set, but statistical significance was not reached in the validation set. The results are in Tables TablesIIII and III. For comparisons, allele frequencies in the HapMap CEPH and Yoruban are also reported.
Traditional one-SNP-at-a-time methods are underpowered to discover those myriad variants that explain minor effects upon common complex phenotypes or diseases. Therefore, for better mining of the data, we developed a SNP set enrichment analysis or SSEA. SSEA computes the probability of detecting by chance the number of SNPs associated with a phenotype within a set of SNPs, for a given threshold of significance. The smaller this probability, the stronger the evidence that the SNPs set is globally associated with the phenotype. The probability is computed using the hypergeometric distribution
where N is the total number of SNPs analyzed, n is the total number of SNPs with BF > threshold, mg is the number of SNPs in the gth set, and kg is the number of SNPs in the gth set with a BF > threshold. We used a “gene-centric” definition of SNP sets that were defined as the set of SNPs in each gene transcript, including SNPs within 10 kb from the transcripts. We used a BF >3 to select significant SNPs in genes based on the following consideration: The mean number of SNPs per gene in the Illumina 610 is 25, and because the decision rule to accept an association as significant if the BF >3 has a false positive rate of about 2%, the expected number of false positive single SNP association per gene is on average 5%. This analysis identifies 91 genes with a SSEA P <0.001. The genes with significant SSEA scores were examined in the validation set and 20 of them had one or more SNPs significantly associated with sickle cell anemia severity. We report the 11 genes with a SSEA P < 10−6 in the discovery set and SSEA P <0.2 in the validation set in Table IV. The threshold 10−6 limits the expected number of false positive findings to less than 5% in the discovery step, whereas the threshold of 0.2 for replication limits the expected number of falsely replicated genes to 4 and a false discovery rate of 4/11 = 0.36. Note that SSEA searches for clusters of significant SNPs that are individually mildly significant. Therefore, this analysis is complementary to the single SNP analysis.
Table I summarizes the patients’ characteristics of the discovery and validation sets. As expected, patients with severe disease in the primary discovery study are on average older than milder cases, have a substantially higher prevalence of osteonecrosis (AVN), leg ulceration, stroke, and sepsis but essentially no difference in frequency of pain episodes and priapism and HbF concentration. The clinical characteristics of the contemporary validation set reflect the overall older age of this clinic population, the greater use of transfusion, and for LDH, different test characteristics and therefore, normal ranges in the discovery and validation samples. While widely separated in time from the CSSCD patients, the patients in the validation set had similar degrees of genetic admixture, and our analysis of the level of genetic diversity in the patients included in this study using principal component analysis  did not show differences in ancestry (Fig. 2). Full details of the principal component analysis are described in the manuscript .
Figure 3 displays the Manhattan plot with the results of the GWAS, using single SNP analysis. Forty SNPs reach our genome-wide significance threshold in at least one model of association (Supporting Table 1). Table II reports the eight SNPs that reach genome-wide significance in the discovery set and show consistent effects in the validation set. The results are shown for different models of genetic inheritance, and when more than one model reached genome-wide significance, the model with strongest significance and replication is reported. Table III reports the five SNPs that are associated with sickle cell anemia severity in the primary study and replicated in the secondary study. Besides the SNPs in KCNK6 and OTUD3, the other three are in intergenic regions with no known functional role.
Table IV shows the list of 11 genes that have a significant SSEA probability score <10−6. TNKS is the gene with strongest SSEA score in the discovery set, with 14 of 35 SNPs that were associated with a BF >3. The linkage disequilibrium heatmap in Fig. 4 shows the physical positions of the associated SNPs and their linkage disequilibrium map. The gene spans more than 240 Kb and the patterns of associations identify two blocks that would need to be followed by fine mapping or sequencing for SNP discovery.
Table V shows the top functional categories that were enriched of genes with SSEA score <0.001. The full list of 92 genes included in this analysis is in the Supporting Table 2. Noticeable gene categories are the set of three genes CYP2C18 (SSEA score 0.0003); CYP2C9 (SSEA score 5× 10−5); and CYP4F8 (SSEA score 10−5) that are members of the cytochrome P450 superfamily of enzymes. The full set of results is available also at http://www.bu.edu/sicklecell/projects/.
Phenotypic heterogeneity is characteristic of sickle cell anemia and is partially genetically influenced . GWAS are an unbiased approach toward discovering potential genetic modulators by seeking inherited polymorphisms associated with phenotypic traits. Single laboratory measurements, like HbF and pulmonary tricuspid regurgitant velocity on echocardiography, or single clinical events like stroke, priapism, osteonecrosis, pain, and leg ulceration, have been used as subphenotypes in genetic association studies [12,23–30]. We chose to use an integrated estimate of the severity of disease as a phenotype to better model global disease severity.
Our studies support the plausible notion that the actions of multiple genes determine the overall severity of sickle cell anemia. The modulatory effects of HbF and α thalassemia have been amply demonstrated (for a review see ) and genes modulating HbF concentration have been associated with the rate of acute painful episodes . GWAS using a novel phenotype that reflects the overall severity of disease defined by death suggests an association with genes whose properties are pertinent to the pathophysiology of sickle cell anemia.
KCNK6 is a member of a superfamily of K+ channel proteins. Widely expressed in CD34+ cells induced toward erythroid differentiation and in endothelium across different vascular beds, we found its expression upregulated when human pulmonary artery endothelial cells were exposed to sickle cell plasma . The vascular endothelium plays a major role in vasoocclusive processes in sickle cell disease via its interactions with sickle erythrocytes, leukocytes, and platelets making it an important potential target for genetic modulation . Also, sickle erythrocyte cation content is a critical determinant of HbS polymerization tendency that leads to cell damage and hemolysis. Whether or not this gene is expressed in reticulocytes or mature erythrocytes is unknown. An analysis of the reticulocyte transcriptome did not detect its expression among the 120 most expressed genes. However, cation transport channel genes like the Gardos channel (KCCN4) or KCl cotransporter (KCC1, KCC3, KCC4) channels that are known to be active in sickle reticulocytes were also not represented among the highly expressed genes of the normal reticulocyte transcriptome [32–36].
Tankyrase1, the protein product of TNKS, is a poly(ADP-ribose) polymerase that localizes to telomeres. Nuclear overexpression of TNKS leads to loss of telomere elongation, suggesting a role as a positive regulator of telomere length. Conversely, reducing TNKS expression using RNAi produces telomere shortening . As sickle cell anemia, as well as many other vascular diseases, are characterized by endothelial dysfunction, of interest is the role that telomerases might play in this process [38,39]. Endothelial senescence is modulated by telomerase activity, and the pro-oxidant environment in sickle cell anemia might also lead to telomere shortening. Additionally, hydroxyurea, an important drug therapy for sickle cell anemia that prolongs survival, affects telomere replication and maintenance through a mechanism that may involve the direct modification of TRF2, a member of the shelertrin complex modulated by TNKS. TRF2 can protect TRF1 from the effects of tankyrase1 .
Several genes associated with severity by SSEA have unknown functions. Two genes identified by SSEA have interesting functions: The protein kinase MAP3K13 may have a role in the JNK pathway that is activated by inflammation . The complement component 8 (C8A) is one of the components of the membrane attack complex, but is not required for inducing hemolysis or for killing gram-negative bacteria[42,43]. Sixteen SNPs in this gene are in the 610 array, and eight are significantly associated with disease severity, including the functional SNP rs652785 (CAA → AAA position 414, Q93K). Patients with the AA genotype have 3.2 times the odds for severe disease compared to patients with the AC or CC genotype.
Although our study is a first step using GWAS to help understand the genetic diversity that accounts for the phenotypic heterogeneity of sickle cell anemia, the small size of the validation set limits our ability to validate the findings of the discovery set. Sickle cell anemia is a rare disease in developed countries where extensive phenotype data and biological samples are most likely to be available for genetic studies, thereby constraining the sample size needed to achieve a traditional genome-wide level of significance for individual SNPs with small effect sizes. This also compromises the ability to assemble replication and validation populations of similar ethnic composition. Our results will require additional validation, resequencing of promising leads, and functional studies to understand the biology and reveal mechanisms by which candidate genes might have their effects. However, they form a base for other investigators to use in their work on this subject. Any therapeutic use of our observations are likely to be years away. Nevertheless, when we more fully understand the genetic differences among patients and the association with phenotypes of disease this might be prognostically useful.[27,29]
Contract grant sponsor: NIH/NHLBI; Contract grant numbers: R01 HL87681, R01 HL 068970.
Additional Supporting Information may be found in the online version of this article
Conflict of interest: Nothing to report