The development and application of high-throughput genotyping methodologies for the malaria mosquito
Anopheles gambiae depends upon the identification of SNP markers. We have resequenced approximately 0.12% of the
An. gambiae genome in geographically diverse pools of
An. gambiae M- and S-forms, identifying 6,995 SNPs and 31 indels that could be mapped, and 36 indels that could not be precisely mapped (additional indels were inferred but could not be precisely identified or positioned due to their effect on sequence quality). Of the SNPs we identified, only 10% had been identified previously from sequencing of the PEST strain genome. This suggests that the sequencing of this strain has dramatically underestimated the true SNP frequency in
An. gambiae. Similarly, Morlais
et al., in sequencing of 3 lab strains (Yaoundé, L35, 4arr), found 324 SNPs in 26 loci (total 17 kbp) [
11]. Only 39% of these SNPs had been predicted by Ensembl (although Ensembl records an additional 42 not observed by Morlais
et al. [
11])
By sequencing the gDNA of pooled individuals we substantially reduced the cost of the resequencing programme. Through comparison of allele frequencies estimated from pooled DNAs with those obtained from sequencing of individual templates it is apparent that pooling of template DNAs yields relatively accurate allele frequency estimates and a very low rate of false positives. Many low frequency SNPs that were identified through sequencing of individual DNA samples were missed in the sequencing of pooled templates. However, since low frequency SNPs perform poorly in detection of linkage disequilibrium [
29] this is unlikely to be problematical when identifying SNPs suitable for use in association mapping studies. Though essentially qualitative, our SNP confidence scores proved valuable predictors of false positive rates, and should be considered when choosing from the SNPs we have identified, noting that SNPs with category 3 confidence scores are much less likely to be truly polymorphic than those with confidence scores 1 and 2. In summary, pooling of gDNA templates provided a useful technique in permitting analysis of polymorphism at a large number of genes in a total of 20 individuals (as two pools of 10 each), at one tenth of the cost of individual sequencing. If cost-reduction is not a major consideration and/or if detection of low frequency polymorphisms is a primary concern, sequencing of individual templates or the use of a next generation technology, such as 454 pyrosequencing (454 Life Sciences), with pooled PCR products would be a preferred approach.
Nucleotide diversity estimates in our study are comparable to those obtained in other studies of
An. gambiae [
10,
11,
30,
31] or other mosquitoes [
32] (Table ), particularly those employing similar sample sizes [
10,
11]. Indeed, the only study recording much lower diversity [
30] involves either extremely low sample sizes or loci in a known area of low recombination (Table ). It is interesting to note that we observed the same pattern as Cohuet et al. [
11] with respect to X-chromosome diversity: even allowing for smaller effective population size (3/4) of the X chromosome than the autosomes, nucleotide diversity is low. However, we did not observe the dramatically lower diversity in X chromosome divisions 5 and 6, than divisions 1–4 reported by Stump et al. [
30]. We suspect that the greater degree of mixing of distinct populations in our study might reconcile these findings, since slowly recombining regions will tend toward loss of diversity within, and increased differentiation among, populations. Mixing of populations will thus have a proportionately greater impact in such a genomic area since differentiation will inflate measures of diversity.
| Table 1Estimates of nucleotide diversity in mosquitoes ( ), obtained from different source populations, numbers of loci sequenced (N loci) and sample sizes (N). |
Polymorphism estimates based upon nucleotide diversity are less informative than the frequency of segregating sites for the design of high-throughput assays where variable bases close to the SNP of interest can affect assay design and therefore should be avoided. On average we find a segregating site every 34 bp, a figure which compares favourably with previous estimates from mosquitoes. Apart from the aforementioned exceptional figures associated with centromeres or a small sample, the range of estimates for segregating site frequency for the studies cited in Table are 1 SNP per 29 to 1 SNP per 48 bases. The problems for assay design resulting from this high SNP frequency will frequently be exacerbated because SNPs show a clustered distribution. Unrecognised non-target SNPs in probe-binding sites can appear as null alleles in Illumina analyses [
33,
34]. Whilst their effects on the use of Affymetrix Genechips for genotyping are unknown, non-target SNPs are detrimental to gene expression profiling on this platform [
35,
36]; it is reasonable to assume they may also negatively affect genotyping accuracy. In addition to the impact of high SNP density, the effect of multiallelic SNPs must also be recognised for probe design. Multiallelic SNPs will also pose difficulties for genotyping with multiplex genotyping platforms as null alleles will be scored. Although null alleles can be recognised with some platforms, and controlled for [
33,
34], they could be problematical where not anticipated.
GoldenGate assays have, to date, been successfully applied to a variety of species, including humans, honey bee [
37], cattle [
38], spruce [
39], soybean [
40] and barley [
41]. Conversion rates of assays have been consistently high for these species, indicating that secondary polymorphisms or unrecognised multiallelic SNPs have not had a major impact on study success. However, all of these species either exhibit low polymorphism or studies were undertaken on inbred lines. For example, in the human genome, SNPs occur on average at 250 bp intervals (Ensembl 50 human genome statistics). Therefore, the high SNP frequency in
Anopheles, and the coincident effect on Goldengate assay design, is a far more significant problem than for previous studies. Indeed, according to Illumina's assay design tool, the majority of SNPs were unsuitable for Goldengate assay probe design.
The
Anopheles/
Plasmodium Affymetrix Genechip, which was designed for gene expression studies, rather than as a genotyping tool, has been used to study the degree of differentiation between the M and S forms [
28]. Since the probe length for this assay is shorter (25 bp) than in the Illumina GoldenGate assay, the high SNP frequency may be less problematical. However, since the array was not designed specifically for genotyping it is difficult to assess the inherent difficulties posed by the high diversity and clustering in
Anopheles for this assay. Although quantitative extrapolation of our array design experience with Illumina to other platforms is difficult, it seems clear that for
Anopheles, and probably other mosquitoes or species with high rates of genomic diversity, high throughput SNP-typing will be negatively impacted, through loss of SNPs at the design stage and/or loss of data due to null alleles at the analysis stage. Whilst somewhat speculative, it also seems likely that confident assembly of short-read fragments into contigs or onto the template of an existing genome assembly in massively parallel sequencing runs [
42] will be rendered difficult if multiple SNPs are present in many fragments. Hopefully, a more comprehensive database of segregating sites in
An. gambiae might ameliorate this problem.
In the present dataset, SNP frequencies varied both physically and according to their location within or near gene classes. As reported elsewhere [
30] and predicted by lowered recombination rates within the regions, diversity was lower toward the centromeres of autosomes and on the X chromosome. Diversity was significantly elevated in loci of the cytochrome p450 mono-oxygenase and carboxy/cholinesterase (COE) families than in the glutathione-S-transferases and control loci, with a segregating site every 26 bp in the p450s and COEs compared with every 34 bp overall. This higher SNP frequency is likely to exacerbate the problems for assay design in these gene families, especially given the significant SNP clustering in this genome. High rates of variability in human p450s have been reported [
43] but higher rates of polymorphism in mosquito p450s or COEs have not been previously identified.
A higher rate of insertion of transposable elements in xenobiotic-metabolising p450s of
Drosophila (in contrast to those p450s involved in ecdysone biosynthesis and developmental regulation) result in high rates of mutability of p450s [
44] indicating that the function of such p450s is more tolerant of polymorphism. Also in
Drosophila, enzymes involved in xenobiotic metabolism exhibit a higher nonsynonymous: synonymous (
dN/
dS) ratio than the average over the dataset (ω = 0.05 compared with ω = 0.045 overall,
P = 0.011 [
45]). The higher levels of
dN/
dS for xenobiotic enzymes may indicate that the higher polymorphism levels seen in p450s and COEs reflects less stringent selection at these loci than others, perhaps because of flexibility in function among closely-related gene family members.
The high diversity in
An. gambiae is likely related to large effective population size (
Ne). Nucleotide diversity is a product of mutation rate and
Ne and the highest recorded levels of polymorphism, for the urochordate
Ciona savignyi, are thought to be due to its high
Ne [
46]. The estimates of
Ne available for
An. gambiae suggest levels of
Ne equal to a few thousand [
47,
48]. However,
Ne is notoriously difficult to estimate accurately, particularly for species exhibiting often limited genetic population structure over wide geographic scales, such as
An. gambiae. Improved
Ne estimates would help elucidate the role of
Ne in explaining the high nucleotide diversity that we, and other authors, have observed.
In
Drosophila spp. recombination rates are positively correlated with nucleotide diversity [
49-
51], especially at a fine-scale [
51], although the relative roles of selection and mutation generated by recombination in underpinning the pattern are controversial [
49-
51]. In
An. gambiae, the first major study to estimate recombination rate indicated a small recombination map length of 215 cM over the 278 Mb genome, or 0.78 cM/Mb [
1]. This is lower than typical average figures of 1–4 cM/Mb for most organisms and far less than the 19 cM/Mb recorded in the honey bee [
52]. Thus, broad-scale recombination estimates in
An. gambiae do not support a relationship between diversity and recombination rate. However, more recently, a survey of recombination rate along the X-chromosome, recorded an overall average recombination rate of 1 cM/Mb, but with dramatic variation in local rates between 0.2 and 7 cM/Mb [
53] dependent on chromosome position. Thus a link between sporadically high recombination rates – perhaps involving recombination hotspots – and high, clustered diversity could apply in
An. gambiae. Fine-scale estimates of recombination rate are now required to permit investigation of how the interplay between recombination and selection determines diversity.