|Home | About | Journals | Submit | Contact Us | Français|
We report a genome-wide association (GWA) study of severe malaria in The Gambia. The initial GWA scan included 2,500 children genotyped on the Affymetrix 500K GeneChip, and a replication study included 3,400 children. We used this to examine the performance of GWA methods in Africa. We found considerable population stratification, and also that signals of association at known malaria resistance loci were greatly attenuated owing to weak linkage disequilibrium (LD). To investigate possible solutions to the problem of low LD, we focused on the HbS locus, sequencing this region of the genome in 62 Gambian individuals and then using these data to conduct multipoint imputation in the GWA samples. This increased the signal of association, from P = 4 × 10−7 to P = 4 × 10−14, with the peak of the signal located precisely at the HbS causal variant. Our findings provide proof of principle that fine-resolution multipoint imputation, based on population-specific sequencing data, can substantially boost authentic GWA signals and enable fine mapping of causal variants in African populations.
The malaria parasite Plasmodium falciparum kills on the order of a million African children each year1, and this is a small fraction of the number of infected individuals in the population1-3. In communities where everyone is repeatedly infected with P. falciparum, host genetic factors account for ~25% of the risk of severe malaria, that is, life-threatening forms of the disease3. The strongest known determinant of risk, hemoglobin S (HbS), accounts for ~2% of the total variation, implying that only a small fraction of genetic resistance factors have so far been discovered3. Identifying the genetic basis of protective immunity against severe malaria may provide important insights for vaccine development.
Here we examine the possibility of approaching this problem by genome-wide association (GWA) analysis. There are many unsolved methodological questions about how to conduct an effective GWA study in Africa4. High levels of ethnic diversity may result in false-positive associations owing to population structure. Variations in haplotype structure between different ethnic groups may reduce power to detect GWA signals, particularly when data are amalgamated across multiple study sites. Low LD implies the need for denser genotyping arrays than are currently available: a crude estimate is that an African GWA study with 1.5 million SNPs would have approximately the same statistical power as a European study with 0.6 million SNPs5, but this is based on HapMap data from a single ethnic group and a larger number of SNPs may be needed to achieve adequate power across different ethnic groups.
We carried out an initial GWA study in Gambian children that explores these methodological questions. Genotyping of ~500,000 SNPs was conducted on 1,060 cases of severe malaria and 1,500 population controls using the Affymetrix GeneChip 500K Mapping Array Set. The results reported here are based on a set of 402,814 SNPs in 958 cases and 1,382 controls that passed stringent quality control procedures. Access to these data may be requested online (see URLs section in Methods).
Subjects were recruited from an area of approximately 400 square miles in the Kombos region of The Gambia. Four ethnic groups, Mandinka, Jola, Wolof and Fula, accounted for 89% cases and 86% controls (Supplementary Table 1 online). Using Wright's FST across all autosomal SNPs, we found that differences between ethnic groups accounted for a small fraction of genetic variation within the population as a whole (FST = 0.004). The greatest differentiation was seen between Fula and Jola (FST = 0.007) and the least between Mandinka and Wolof (FST = 0.002) (Supplementary Table 2 online).
To investigate the relationship between population structure and self-reported ethnicity, we carried out principal components analysis (PCA) of 100,715 SNPs, selected to reduce LD between markers (Fig. 1)6. The first two principal components distinguished Fula and Jola, and the third principal component separated the Mandinka and Wolof from others. Some individuals could be confidently assigned to a specific ethnic group, whereas others seemed to have a more complex ancestry. These findings were verified using the STRUCTURE7 program on 8,000 SNPs, which gave an optimal model of population structure with four genetic subpopulations corresponding to the four most common ethnic groups (Supplementary Fig. 1 online).
To place these findings in the context of global population structure, we compared the Gambian sample with populations studied by the HapMap project5,8. The Gambian sample can be clearly distinguished by PCA from the Yoruba people of Ibadan, Nigeria (a different part of West Africa) but is much closer to Yoruba than to European, Han Chinese or Japanese samples (Fig. 2a). Individual ethnic groups within The Gambia seem to have greater genetic diversity than the HapMap Yoruba sample (compare Fig. 1a with Fig. 2b, Supplementary Fig. 2 online). This may reflect the fact that Gambian samples were recruited from the general population, whereas the HapMap Yoruba samples were collected in a particular community from individuals with four Yoruba grandparents.
To evaluate the likelihood of false-positive GWA findings due to population structure, we conducted a trend test of association in cases versus controls on all SNPs, and compared observed χ2 values to expected values under the null hypothesis in a quantile-quantile plot (Fig. 3). The overdispersion of association test statistics (λ = 1.23) implies a high number of false-positive associations in the raw data, but this was greatly reduced by correction for self-reported ethnicity (λ = 1.07), and became negligible when the first three principal components from the eigenanalysis of population structure were entered as covariates in logistic regression analysis (λ = 1.02). For comparison, λ has been estimated to be 1.03–1.11 in case-control studies in the British population, a range considered acceptable for GWA analysis9. Thus, with appropriate statistical correction, false-positive GWA findings arising from the Gambian population structure can be reduced to a very low level.
After PCA correction for population structure, we tested each SNP for disease association using an unguided genotypic test with 2 degrees of freedom (d.f.), as well as tests with 1 d.f. for models of dominance, recessiveness, heterozygous advantage and trend. Cluster plots were visually inspected on all potentially significant results, which yielded 139 SNPs with unequivocal genotype results in 100 independent regions of the genome with P< 10−4 (Supplementary Table 3 online), including 6 with P< 10−6 (Fig. 4). The strongest signal of association was close to the HBB gene on chromosome 11p15, where the HbS polymorphism is located, with 13 SNPs at P< 10−4 and a minimum of P = 3.9×10−7 by trend test. In following sections, we examine the signal of association around the HbS polymorphism, evaluate other known and putative malaria resistance–associated genes and describe newly identified signals of association.
HbS provides a benchmark for evaluating GWA methods, as the causal polymorphism responsible for the malaria-protective effect is known: it is a SNP (rs334) in the coding region of HBB on chromosome 11p15.4 which results in replacement of glutamic acid with valine at amino-acid residue 6 of the β-globin chain. When we genotyped rs334 on the same samples used in the GWA study, using the Sequenom iPlex platform, we found a much stronger signal of association (P = 1.3 × 10−28). This raises several questions: why was the GWA signal (P = 3.9 × 10−7) much weaker than the true effect; is there an effective way to increase the GWA signal; and is there an effective way to get from the GWA signal to identification of the causal variant?
To investigate these questions, we sequenced 111 kb in the center of the GWA signal on chromosome 11p15 in a reference panel of 62 randomly selected Gambian controls (see Methods). These reference data were used to impute genotypes for all ~2,500 individuals in the GWA study with the IMPUTE program10, and a trend test of association was conducted at each imputed SNP. Out of 202 SNPs examined across this 111 kb region, three imputed SNPs had stronger signals of association than any of the SNPs genotyped on the initial GWA scan (Fig. 5). The HbS causal polymorphism (rs334) stands out as the imputed SNP with the strongest association (P = 4.5×10−14), several orders of magnitude more significant than the strongest signal from SNPs that were directly genotyped (P = 3.9×10−7).
This result provides proof of principle that it is possible to identify the causal polymorphism within a GWA signal by regional sequencing followed by multipoint association mapping using model-based imputation, provided the appropriate reference panel is used. We observed two features of LD in this region of the genome in this population, which may together be favorable for fine mapping. First, over a 1-Mb region we identified 55 SNPs with D' = 1 in relation to the HbS causal polymorphism rs334 (Fig. 6): this is consistent with previous evidence that the HbS allele is associated with an extended haplotype a result of recent positive selection5,11 Second, the region as a whole has weak LD with a well-known recombination hot spot12, and the correlation between rs334 and neighboring SNPs does not exceed an r2 of 0.36 (Fig. 6). In other words, there are no neighboring SNPs that are sufficiently strongly correlated with rs334 to imitate the true signal of association generated from the causal variant. We are still at an early stage of understanding how the process of fine mapping is affected by different patterns of natural variation in the human genome, and this example of extended haplotype within a region of generally low LD provides an interesting case study.
In general, the performance of imputation strategies depends on the overall information content the genotyped SNPs carry for the untyped SNPs in the region, which was estimated at only 40% for rs334 for our data. This may explain, in part, why the imputed association signal (P = 4.5×10−14) was weaker than the value obtained when we genotyped rs334 directly on the same samples (P = 1.3×10−28).
The HapMap Yoruba sample has been used as the basis for designing GWA genotyping arrays intended for African populations in general, and these data provide an example of where this approach may fail. When viewed at a macroscopic level, patterns of LD in The Gambia and the HapMap Yoruba sample are similar, both for the genome as a whole (Supplementary Fig. 3 online) and for the genomic region around the HbS locus (Supplementary Fig. 4 online). However, when we attempted to impute rs334 genotypes in our Gambian data using the HapMap Yoruba as the reference panel, we failed to identify any association signal (P = 0.06). This may be explained by the fact that the SNP on the Affymetrix array with the strongest LD with rs334 in The Gambia (rs11036238, r2 = 0.32) has negligible LD in the HapMap Yoruba samples (r2 = 0.009); conversely, the SNP in strongest LD with rs334 in the HapMap Yoruba samples (rs7936221, r2 = 0.35) is not in LD with rs334 in The Gambia (r2 = 0.005). This is consistent with evidence that the HbS allele has arisen independently in different African populations11,13,14. Although the HbS allele may not be representative of genomic variation as a whole, it highlights the possibility of local anomalies particularly in regions under strong selective pressure, and thus raises important questions about the design of an optimal SNP tagging strategy for African populations in general.
Taken together, these findings support the view that low LD in African populations can help to distinguish the causal polymorphism from neighboring polymorphisms. But they also highlight the importance of understanding regional variations in haplotype structure when designing and interpreting GWA studies in African populations, particularly for loci that are under selective pressure.
The GWA analysis did not identify any of the well-known erythrocyte variants that have been selected by malaria, other than HbS. This can partly be explained by population genetic factors; for example, the Duffy FY*O allele has reached fixation in The Gambia, whereas other variants such those affecting hemoglobin C and southeast Asian ovalocytosis are rare or absent in this population. We might have expected associations at G6PD and HBA1-HBA2, the loci for glucose-6-phosphate deficiency15-17 and α+thalassaemia18-20, respectively, but our GWA dataset had no SNP within 100 kb of G6PD and only one SNP within 50 kb of HBA1-HBA2.
To investigate G6PD in more detail, we used the Sequenom iPlex platform to genotype rs1050828, a coding polymorphism that has received considerable attention as a marker of protection against severe malaria15,17, although there are other polymorphisms associated with reduced G6PD enzyme activity that have been less well studied in malaria and could possibly also be involved21. The minor allele frequency of rs1050828 in the Gambian control sample was 0.03, considerably lower than samples from Kenya (0.18) and Malawi (0.19) that we genotyped by the same method. Power to detect association with rs1050828 in The Gambia is affected by this low allele frequency, and the results were consistent with a modest protective effect but were not statistically significant: odds ratio (OR) for male hemizygotes 0.71 (95% CI = 0.34–1.49) and for female heterozygotes 0.79 (95% CI = 0.43–1.46). Even if it had been a strong effect, it would not have given a GWA signal because the best tagging SNP for rs1050828 on the Affymetrix 500K array had r2 = 0.06.
We also examined the ABO locus, where the functional variant is known and an effect has been conclusively replicated across different populations22. A previous study combining case-control and family-based analyses of ~9,000 individuals in three populations found that individuals who are not of blood group O (as defined by the functional variant rs8176719, a splice-site insertion in the ABO gene) have ~1.2-fold increased risk of severe malaria with a combined P value of 2 × 10−7 (ref. 22). We genotyped rs8176719 in our GWA sample and found an association that was entirely consistent with previous data (OR = 1.26, 95% 5 CI = 1.11–1.44, P = 5×10−4) but which would our initial GWA significance threshold of P < 10−4. The lack of a GWA signal can be explained by the fact that the best tagging SNP had r2 = 0.15.
Other SNP associations have been reported for malaria but have not been conclusively replicated in large studies across different populations, and are mostly thought to be markers rather than true causal variants. At seven loci (CD36, CD40LG, CR1, ICAM1, IL22, NOS2, TNF) we genotyped nine candidate SNPs previously reported to show association with malaria (Supplementary Table 4 online). A weak association was identified at TNF for rs2516486 (P = 0.02) but this did not result in a GWA signal, as the best tagging SNP had r2 = 0.51. Other SNPs tested showed no significant association, but had they done so it might have been missed by GWA analysis, as all candidate SNPs were poorly tagged by the Affymetrix 500K array (median r2 = 0.45; range 0.01–0.61) (Supplementary Table 4).
In summary, the lack of GWA signals corresponding to previously reported malaria associations can at least in part be explained by low tagging efficiency of the Affymetrix 500K array in this population and other causes of low statistical power, particularly low allele frequencies. However these data also raise the question of how many previously reported associations may have been false positives. In some cases an authentic association may fail to replicate because the effect size was overestimated in initial reports (‘winner's curse’); because the frequency of the causal variant varies between populations; because LD between the marker SNP and the causal variant varies between populations; or because the effect is complex, for example, due to allelic heterogeneity or epistasis. These issues are currently being addressed by the MalariaGEN consortium in a multicenter study across 11 different malaria-endemic populations4.
From the above analyses it is clear that in the Gambian population the Affymetrix 500K array may fail to detect authentic resistance loci with weak effects, and that even strong genetic determinants may give relatively weak GWA signals. In the following analysis we focus primarily on GWA signals with P-values <10−4, although it will be important to follow up weaker GWA signals in future work.
We identified 19 regions of the genome in which there were ≥2 SNPs with P< 10−4 or a single SNP with P< 10−6 (Table 1). Three regions other than HBB had signals of P< 10−6. Chromosome 2q37 had five SNPs at P< 10−4 with a minimum of P = 5.1×10−7 in a recessive model: the closest genes are SPATA3, encoding a spermato-genesis-associated protein; LOC257407, encoding a hypothetical protein; PSMD1, encoding a proteasome subunit; and GPR55, encoding a G protein–coupled receptor. Chromosome 5p12 had three SNPs at P< 10−4 with a minimum of P = 4.5×10−7 in a heterozygous advantage model: this region has a number of genes encoding proteins of unknown function. Chromosome 14q21 had three SNPs at P< 10−4 with a minimum of P = 6.5×10−7 in a dominant model: this area seems to be a gene desert.
Signal plots for each of the regions listed in Table 1 are shown in Supplementary Figure 5 online. In addition to SNPs that were directly genotyped, Supplementary Figure 6 online shows imputed genotypes in each of these regions, with the caveat that imputation is based on the HapMap Yoruba reference panel and this can give erroneous results when applied to the Gambian population, as we show above for HbS.
Severe malaria is composed of several overlapping clinical entities, notably cerebral malaria and severe malaria anemia23,24, and there may be genetic effects that are specific for a particular subphenotype. We carried out separate principal component–corrected analyses on the 758 cases with cerebral malaria and 297 cases with severe malaria anemia (97 subjects had both cerebral malaria and severe malaria anemia). Results of the subphenotype association tests are included in Supplementary Table 3. Acknowledging the limitation of low sample size, there is an indication that some GWA signals may be specific for cerebral malaria. They include two signals of P ~ 10−5, one close to CAMTA1 (encoding calmodulin-binding transcription activator) and another within RYR2 (encoding a ryanodine receptor involved in calcium-dependent signaling). These loci are particularly notable in view of the importance of calcium signaling in cerebral function.
We conducted a replication study in an independent sample of 1,087 severe malaria cases and 2,376 controls for a number of loci containing genes of interest (Table 2) and for all of the listed in Table 1. The replication study used the Sequenom iPlex platform: at some loci it was not possible to type the SNP with the strongest association in the GWA study by this method. Analysis of the replication data was stratified by self-reported ethnicity and was integrated with the PCA-corrected GWA analysis using a weighted combination of linear odds ratios. The combined sample after application of quality control filters to the GWA data comprised 2,045 cases and 3,758 controls.
At the HBB region on chromosome 11p15.4, a trend test for rs11036238 had OR of 0.61 (P = 3.9 × 10−7) in the GWA study, 0.65 (P = 6.8 × 10−7) in the replication study and 0.63 (95% CI = 0.55–0.72, P = 3.7 × 10−11) in the combined sample.
A number of newly identified loci that showed notable associations in the combined sample are described in the Supplementary Note online. The strongest effects were observed on chromosome 17p13 (rs6503319: trend test OR = 1.21, 95% CI 1.12–1.31, P = 7.2 × 10−7) close to SCO1, which encodes a protein involved in cytochrome oxidase function; and on chromosome 7p12.2 (rs1451375: dominant model OR = 0.75, 0.66–0.85, P = 6 × 10−6; and rs7803788, model OR = 0.76, 0.68–0.85, P = 2.4 × 10−6) intronic to DDC, encoding dopa decarboxylase, which is involved in dopamine and serotonin synthesis. Notably, DDC has also been linked to malaria refractoriness in mosquitoes25. Although these findings are of considerable interest, they cannot be regarded as conclusive until they have been replicated at multiple study sites. Below we discuss the challenges of multicenter replication studies in African populations.
The application of GWA analysis to populations in Africa could provide fundamental insights into resistance to infectious disease and the genetic origins of common diseases26. The main conclusion of the current study is that this will require a different methodological approach than that used for GWA analysis in European or Asian populations. GWA analysis proceeds in three stages: first, discovery of regions of the genome with significant associations; second, multi-center replication studies; and third, fine mapping of causal variants. Here we consider each of these stages of analysis, and how they might be effectively implemented in Africa.
At the first stage of GWA analysis, screening many SNPs across the genome, a stringent threshold for statistical significance is used to reduce false-positive rates. There is theoretical debate about where to set this threshold but take, for example, the level of P < 5 × 10−7 used by the Wellcome Trust Case Control Consortium in the British population9,27,28. In practice it is difficult to achieve this threshold in Africa, because of weak LD between the marker SNPs that are genotyped and causal variants. We found, for example, that the GWA signal at the HbS locus barely reaches this level of significance, although the causal variant is common and confers tenfold reduction in risk of severe malaria. It is likely that there are many authentic loci with modest genetic effects that are several orders of magnitude below this level of significance, as we have shown here for the ABO locus.
Multipoint imputation offers a potential solution to this problem. We found that the GWA signal around the HbS variant can be boosted by several orders of magnitude by imputation, from P = 3.9 × 10−7 to 4.5 × 10−14. Although GWA signals are often increased =by imputation, this is a much larger effect than is commonly observed in European populations. It reflects the fact that no individual marker SNP is strongly correlated with the HbS causal variant but that it is possible to gather much additional information about the causal variant from haplotypic combinations. The ability of multipoint imputation to boost a GWA signal is potentially greatest in situations where there is no single marker SNP that is strongly correlated with the causal variant, and imputation could therefore be of particular importance in Africa.
Multipoint imputation cannot work effectively without accurate data on sequence variation and haplotype structure. This is critical in situations of low LD, where a significant proportion of common variants have no strong marker SNP, as is the case for HbS, and important genetic effects might be missed if these variants are not genotyped, or accurately imputed, at the first stage of GWA analysis. The genetic diversity found across Africa increases the imperative for the data underpinning imputation to be population specific. Population-specific data are needed because sequence variants and haplotypes can be highly localized within Africa; for example, the malaria resistance factor hemoglobin C (HbC) is found in some parts of West Africa but not others29,30.
Development of an optimal genome-wide SNP genotyping platform for use in Africa would help to strengthen the signals of association that are directly observed at the first stage of GWA analysis, as well as increase the accuracy of imputation. Although there have been steady improvements in genotyping platforms since the start of the present study, there is no platform that achieves genome coverage in Africa that is close to that in Europe31-34; indeed, the number of SNPs needed to achieve this cannot be precisely estimated until more complete sequencing information from African populations is available35. From initial HapMap data it was estimated that 1 SNP per 2 kb in Africa would give approximately the same tagging efficiency as 1 SNP per 5 kb in Europe, but as more information emerges about sequence variation in Africa, the estimated number of SNPs required to tag common variants is tending to increase5,35. The challenge is to determine the optimal number of genotyped SNPs that, when combined with genome-wide resequencing data from a representative sample of the same population, would allow accurate imputation of all common variants.
The second stage of GWA analysis is to replicate signals of association in large multicenter studies. The problem is that replication of association at multiple locations depends on the allele frequency of the marker SNP and the causal variant, as well as the LD between the marker SNP and the causal variant, being relatively constant across locations. The high extent of genetic diversity across Africa therefore creates uncertainty about whether associations with marker SNPs will replicate at different locations, even if there is a true causal variant. For example, we find that patterns of LD between the HbS causal variant and surrounding SNPs vary to such an extent between the Gambian and Yoruba populations that none of the GWA associations found at the HbS locus in the present study would be expected to replicate across West Africa. Associations with marker SNPs may also fail to replicate when the causal variant differs in frequency between locations.
It is therefore difficult to design effective multicenter replication studies without information about sequence variation and haplotype structure in the relevant African populations. However, because the main problem is the variable relationship between marker SNPs and causal variants, the problem of replication becomes greatly simplified once a shortlist of potential causal variants has been identified. This is an additional reason to carry out high-resolution multipoint imputation at the first stage of GWA analysis, as it allows putative causal variants to be tested directly in different populations.
The third stage of GWA analysis is to identify causal variants. In European populations it can be extremely difficult to distinguish causal variants from nonfunctional polymorphisms that are in strong LD with causal variants. This study provides proof of principle that, under the conditions of low LD found in Africa, it is possible to uniquely identify a causal variant at the first stage of GWA analysis by multipoint imputation, based on deep sequencing data that are population specific. Thus, although a considerable amount of work is needed to overcome the problem of low LD at the initial stages of GWA analysis in Africa, a byproduct of this work is that it might be possible to proceed relatively rapidly from a GWA scan and replication studies to the identification of causal variants. The major limiting factor, at all stages of GWA analysis in Africa, is the need for population-specific data on genome sequence variation. In the near future, this limiting factor will be overcome by advances in genome sequencing technologies, through initiatives such as the 1000 Genomes Project.
The study sample comprised 1,060 cases and 1,500 controls from a mixed urban and rural area of approximately 400 square miles in the Kombos region of The Gambia. Cases were children admitted to hospital with severe malaria: they had a median age of 4.3 years and 18% had a fatal outcome. Controls were newborns recruited from routine births at local health clinics. The control data were shared with a GWA study of tuberculosis (unpublished data). We could theoretically achieve a modest increase in power by selecting controls that had gone through childhood without developing severe malaria, but past medical history is difficult to ascertain with confidence in this population.
Severe malaria is made up of several overlapping clinical entities23,24,36. The cases analyzed here included 82% with cerebral malaria, 30% with severe malarial anemia and 11% with respiratory distress. By estimating the protective effect of the HbAS genotype in this study sample we can exclude high rates of diagnostic misclassification, which can arise when other severe diseases mimic the clinical features of severe malaria: we found ORs of 0.12 (95% CI = 0.07–0.21) for cerebral malaria, 0.10 (0.04–0.24) for severe malaria anemia, 0.08 (0.02–0.38) for respiratory distress and 0.09 (0.05–0.16) for severe malaria in general. These analyses are stratified for self-reported ethnic group using the Mantel-Haentzel test. Because the control group was chosen to represent all births, and because only a minority of children develop severe malaria, these ORs can be viewed as an estimate of relative risk in the general population.
The initial sample set consisted of 1,060 cases of severe malaria and 1,500 controls from the Kombos region of The Gambia, West Africa. Cases of severe malaria, defined essentially according to WHO criteria36, were recruited on admission to the Paediatric Department of the Royal Victoria Hospital in Banjul after obtaining informed consent. The main forms of severe malaria in Gambian children are cerebral malaria, severe malaria anemia and respiratory distress23,24. In this study we define cerebral malaria as a Blantyre coma score37 of ≤3, persisting for >30 min after cessation of a transient seizure or after correction of hypoglycaemia, in a child with asexual forms of P. falciparum on blood film and no other evident cause of coma. Severe malaria anemia is here defined as packed cell volume ≤15%, or hemoglobin ≤5 g/dl with asexual forms of P. falciparum on blood film. Controls were cord blood samples obtained from routine births at local health clinics in the Kombos region after obtaining informed consent. The study was approved by The Gambia Government/Medical Research Council Joint Ethics Committee and by the Oxford Tropical Research Ethics Committee. Ethics approval documents and informed consent forms are available upon request.
The program IMPUTE10 was used to infer the genotypes of SNPs in the Yoruba panel of the International HapMap Project. To avoid SNPs with genotyping errors in the Gambian study, we used only SNPs on the Affymetrix platform for the Gambian data with minor allele frequency >41% and missingness <5% in both case and control cohorts, and with Hardy-Weinberg equilibrium P ≥ 10−7 in controls as the input for IMPUTE. The results from IMPUTE were used in our association study in two ways: (i) the same spectrum of association tests was conducted at each imputed SNP where, for each individual, the genotype corresponding to the maximum posterior probability is assigned unless this probability is <0.9 and a missing genotype is assigned instead; (ii) the same spectrum of association tests was conducted at each imputed SNP where the posterior probabilities of the calls are used to average over the uncertainty in the inference of the genotypes. In the former analysis, only imputed SNPs that satisfy the genome-wide SNP quality control criteria (see ‘SNP filters’ in Supplementary Methods online) were considered; in the latter analysis, only imputed SNPs with relative statistical information >0.5 were considered.
The beta-globin region was sequenced in 62 randomly selected Gambian control individuals who had previously been genotyped on the Affymetrix 500K array. The sequenced region spans 110 kb from 5,179,297 to 5,289,530, and encompasses all five beta-globin genes (HBB, HBD, HBE1, HBG1, HBG2) and an olfactory receptor (OR51B1). To avoid false positives introduced by sequencing errors and as the underlying interest here is in detecting malarial association with common polymorphisms, we only define a polymorphic marker if the minor allele frequency of this marker exceeds 5% or, equivalently, if at least 4 minor alleles out of the 124 chromosomes are observed. On the basis of this definition we identified 202 SNPs in the sequenced region (66 of which were previously unknown SNPs), although we expect the true number to be higher.
To avoid edge effects in haplotype phasing and imputation, the data for each sequenced sample was extended by including SNPs from the Affymetrix array flanking both ends of the sequenced region, creating a 1-Mb region centered on rs334 (at 5,204,808 in build 35) from 4,705,000 to 5,705,000 spanning 453 SNPs in total. Haplotypes for this 1-Mb region were constructed using fastPHASE38, and were subsequently used as the reference panel for imputation using IMPUTE10.
Genotypes for all 453 SNPs in this 1-Mb region were imputed across the remaining 1,325 controls and all 958 cases. We carried out tests for malaria association on all the imputed SNPs and the results from the trend test are shown in Figure 5 as red dots. Association results for SNPs on the Affy 500K array are shown in black. Rather than threshold the imputed calls with a missingness filter, the association test averages over the imputation uncertainty by using the genotype posterior probabilities from the imputation. As before, the first three principal components estimated from the array genotypes using EIGENSTRAT were included as covariates in the test for association.
We thank the Gambian children and their parents and guardians who made this study possible; and the doctors, nurses and fieldworkers at the Royal Victoria Hospital, Banjul and other health clinics who assisted with this work. MalariaGEN's primary funding is from the Wellcome Trust (grant number 077383/Z/05/Z) and from the Bill & Melinda Gates Foundation, through the Foundation for the National Institutes of Health (grant number 566) as part of the Grand Challenges in Global Health initiative. The Wellcome Trust (Sanger Institute core funding) and the Medical Research Council (grant number G0600230) provide additional support for genotyping, bioinformatics and analysis. The MalariaGEN Resource Centre is part of the European Union Network of Excellence on the Biology and Pathology of Malaria Parasites.
Note: Supplementary information is available on the Nature Genetics website.
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/