A high-throughput method that combines allele frequency determination by Pyrosequencing with a mathematical model was developed to estimate the MSP-119 haplotypes present in mixed malaria infections. After adjustment to a standard curve, Pyrosequencing yields accurate and precise estimates of the relative frequency of alleles in mixed infections. The haplotype-estimating algorithm uses maximum likelihood methods to determine the most probable combination of haplotypes given the allele frequencies for an infection and the haplotypes known to be circulating in the population, and provides accurate estimates of haplotypes present in lower multiplicity of infection (MOI) infections (≤3 types). For higher MOI infections (≥4 types), the algorithm gives statistically reasonable, but less accurate, estimates. The reduced accuracy at high MOI is primarily due to the inability of the algorithm to choose between several haplotype combinations with similar likelihoods.
is highly conserved, measures of MOI based on this locus are likely to be lower than those based on more polymorphic loci (e.g. MSP-1 block 2 or MSP-2). Therefore it may be acceptable to have an algorithm with greater accuracy at lower MOI. In Mali, the vast majority of samples have low MOI (≤3 types) based on this locus. In 24 infections from six infants living in a high transmission area of western Kenya, the largest number of MSP-119
haplotypes observed in an infection was two; however, the largest number of clones picked per sample was four, and it is possible that higher MOIs would have been observed had more clones been picked [11
]. At a population level, with large sample sizes, the inaccurate estimation of some haplotypes in a small number of high MOI infections is not likely to be statistically relevant. When individual histories are of interest, it may be possible to fine-tune the algorithm to allow more accurate estimation of high MOI infections by using information about the haplotypes present in low MOI infections that come before and after the high MOI infection to choose the "best" answer out of several statistically "good" answers.
Identifiability problems will also increase as the number of circulating haplotypes in a population increases. Therefore, in areas of high transmission where there may be more circulating haplotypes, it may be necessary to restrict the algorithm to include the most common haplotypes. By doing so, the algorithm should be able to resolve most of the infections, and will be unable to resolve infections that contain rare haplotypes. These rare haplotypes can then be identified using other methods such as PCR cloning.
Similar expectation-maximization methods have been used to estimate haplotype frequencies in diploid human populations [20
] and in pooled human DNA [22
]. The expectation-maximization (EM) algorithm developed by Excoffier and Slatkin uses maximum likelihood methods to determine the most probable haplotype assignment given the observed sample genotypes and the estimated population haplotype frequencies (under the assumption of Hardy-Weinberg equilibrium). This method works best for large sample sizes, and uses several sets of starting conditions to avoid convergence on local maxima [20
]. Stephens and colleagues use a Bayesian method to reconstruct haplotypes based on both the likelihood and an a priori assumption that unresolved haplotypes tend to be similar to known haplotypes [21
]. The EM algorithm has recently been applied toward resolving haplotypes in pooled human DNA samples [22
]. Similar to the algorithm described in this study, haplotype estimation in pooled human DNA samples is most accurate when the pool consists of fewer individuals. Ito et al. achieved the most accurate estimates with pools containing fewer than four individuals [22
], while Quade et al. achieved accurate estimates for up to ten pooled samples (using only two alleles at two loci) [23
]. These studies indicate that lack of identifiablity in samples with larger numbers of haplotypes is a common limitation of these types of algorithms.
The accuracy of Pyrosequencing allele quantification can be affected by several factors including having an "A" allele in the SNP and having flanking bases identical to one or the other alternative alleles in the SNP (i.e. homopolymer formation). Given the A/T rich genome of Plasmodium, four out of six SNPs in MSP-119 contain an "A" allele. In addition, five of the six SNPs in 19 kDa form homopolymers with flanking alleles. Therefore, it is important to adjust the allele frequencies to a standard curve to improve accuracy. However, since allele frequencies of replicate runs of the same sample on different days did not differ significantly, one standard curve can be used to adjust all the data (as opposed to generating a curve every day the assay is run).
Several methods have been used to determine allele frequencies in mixed malaria infections including PCR cloning [11
], real-time quantitative PCR (RTQ-PCR) [25
], and proportional sequencing [26
]. All of these methods, including Pyrosequencing, have advantages and disadvantages. PCR cloning gives definitive haplotypes; however, it is the most time-consuming and expensive of the methods, which significantly limits the number of samples that can be feasibly analyzed using this method. In addition, because Plasmodium
often uses codons different than those used by the competent bacteria used in cloning, not all sequences can be cloned efficiently. RTQ-PCR is a more sensitive method than Pyrosequencing at detecting very low frequency alleles (<5%); however, it has a lower throughput and requires more optimization than Pyrosequencing. Like RTQ-PCR, Pyrosequencing assays are designed to detect known polymorphisms. Methods that rely on sequencing an entire region or gene of interest (e.g. PCR cloning) are better for detecting new SNPs. Proportional sequencing is a method that estimates allele frequencies in mixed infections by measuring the peak heights in direct sequencing electropherograms [26
]. While this method has similar applications and accuracy as Pyrosequencing, it is more expensive and has a lower throughput [26
]. Because Pyrosequencing sequences short stretches of nucleotides (10–20 bp), for certain very polymorphic loci (e.g. domain I of P. falciparum
apical membrane antigen-1, another vaccine candidate antigen), it is not possible to set down a sequencing primer every 20 bp. In this instance, proportional sequencing may be more appropriate. If MSP-119
haplotypes are of interest, allele frequencies from any of these methods can be used with the haplotype-estimating algorithm described here.
The cost of equipment for Pyrosequencing is similar to that for standard DNA sequencing, which is now done in several sub-Saharan African countries, including Mali. Pyrosequencing may be suitable for other applications such as typing known single nucleotide polymorphisms in parasite genes that serve as molecular markers for drug resistant malaria.