|Home | About | Journals | Submit | Contact Us | Français|
The outcome of an Entamoeba histolytica infection is variable and can result in either asymptomatic carriage, immediate or latent disease (diarrhea/dysentery/amebic liver abscess). An E. histolytica multilocus genotyping system based on tRNA gene-linked arrays has shown that genetic differences exist among parasites isolated from patients with different symptoms however, the tRNA gene-linked arrays cannot be located in the current assembly of the E. histolytica Reference genome (strain HM-1:IMSS) and are highly variable.
To probe the population structure of E. histolytica and identify genetic markers associated with clinical outcome we identified in E. histolytica positive samples selected single nucleotide polymorphisms (SNPs) by multiplexed massive parallel sequencing. Profile SNPs were selected which, compared to the reference strain HM-1:IMSS sequence, changed an encoded amino acid at the SNP position, and were present in independent E. histolytica isolates from different geographical origins. The samples used in this study contained DNA isolated from either xenic strains of E. histolytica trophozoites established in culture or E. histolytica positive clinical specimens (stool and amebic liver abscess aspirates). A record of the SNPs present at 16 loci out of the original 21 candidate targets was obtained for 63 of the initial 84 samples (63% of asymptomatically colonized stool samples, 80% of diarrheal stool, 73% of xenic cultures and 84% of amebic liver aspirates). The sequences in all the 63 samples both passed sequence quality control metrics and also had the required greater than 8X sequence coverage for all 16 SNPs in order to confidently identify variants.
Our work is in agreement with previous findings of extensive diversity among E. histolytica isolates from the same geographic origin. In phylogenetic trees, only four of the 63 samples were able to group in two sets of two with greater than 50% confidence. Two SNPs in the cylicin-2 gene (EHI_080100/XM_001914351) were associated with disease (asymptomatic/diarrhea p=0.0162 or dysentery/amebic liver abscess p=0.0003). This study demonstrated that there are genetic differences between virulent and avirulent E. histolytica strains and that this approach has the potential to define genetic changes that influence infection outcomes.
The eukaryotic parasite Entamoeba histolytica, the causative agent of amebiasis, is a major cause of morbidity and mortality worldwide, as well as a category B priority biodefense pathogen . In Dhaka, Bangladesh, surveys done in a cohort of children living in an urban slum showed evidence of E. histolytica infection (determined by detection of parasite antigen in either diarrhea or monthly surveillance stool) in 80% of the children tested . Host genetics can influence susceptibility to infectious disease and a single amino acid substitution in the host cytokine receptor homology domain 1 of LEPR and a difference in the leukocyte antigen class II allele expressed are associated with increased susceptibility to intestinal infection by the E. histolytica [3,4]. Symptomatic disease occurs in only a minority of E. histolytica infections (20%) in an unpredictable manner and an initially asymptomatic infection can over time convert to invasive disease (~12.5%), amebic liver abscess can occur years after travel to an endemic area [5,6]. It is hypothesized that both host and parasite factors contribute to the outcome of an E. histolytica. However, although progress has been made in both the identification and characterization of parasite virulence factors and in understanding the regulation of their gene expression, direct manipulation of the E. histolytica genome remains elusive, and the traits affecting parasite virulence have not been genetically mapped [8-17].
Despite this variations that occur within repeat-containing genes in the amoeba genome chitinase and serine-rich E. histolytica protein SREHP have been used to examine the link between E. histolytica genetics and disease [18-22]. The high rates of polymorphism however at the loci make it difficult to use them for this purpose and an association between some of these markers and virulence has not been proven in large scale studies [18,21]. However, based on the composition of highly repetitive tRNA arrays, E. histolytica has been shown to have distinct genotypes with different potentials to cause disease [23-27].
E. histolytica tRNA genes are unusually organized in 25 arrays containing up to 5 tRNA genes in each array, with intergenic regions between tRNA genes containing short tandem repeats (STRs) . A 6-locus (D-A, S-Q, R-R, A-L, STGA-D, and N-K) tRNA gene-linked genotyping system has shown that the number of STRs at these loci differ in parasite populations isolated from three clinical groups (asymptomatic, diarrhea/dysentery and liver abscess) [24,26]. The variations occurring in tRNA genotypes, even between the ameba strains isolated from the intestine and in the liver abscess of the same patient, suggest that not all strains of E. histolytica have the same capacity to reach the liver of the infected host . However, the diversity of tRNA linked STR genotypes occurring even in a restricted geographic region, and the frequent occurrence of novel genotypes, limit their usefulness to predict infection outcome or to probe the population structure of E. histolytica [25,29,30]. The extensive genetic polymorphism in the repeat sequences of SREHP, chitinase and tRNA arrays for instance could reflect slippage occurring during E. histolytica DNA replication as Tibayrenc et al. hypothesize that the parasites exist as clonal populations that are stable over large geographical areas and long periods of time [31,32].
Compared with other DNA markers, single nucleotide polymorphisms (SNPs) are genetically stable, amenable to future automated methods of detection, and in contrast to the highly repetitive tRNA arrays, their location can be mapped in the E. histolytica genome [33-35]. After the first sequencing and assembly of Entamoeba histolytica HM-1:IMSS genome was published by Loftus et al. Bhattacharya et al. amplified and sequenced 9kb of coding and non-coding DNA to evaluate the variability of E. histolytica SNPs in 14 strains and identified a link between some genotypes and clinical outcome . The advent of the next generation of high throughput genomic sequencing (NGS) technologies has provided more comprehensive opportunities to investigate variation in the genome of E. histolytica and clinical outcome by allowing the fast and efficient way to sequence laboratory-cultured ameba of clinical relevance [35,37]. These cultured strains were isolated from different geographical areas endemic for amebiasis and contained large numbers of “strain-specific” SNPs in addition to SNPs present in more than one strain . The sequence variations associated with virulence strains previously identified in the sequenced 9kb DNA (a synonomous SNP in XM_001913658.1the heavy subunit of the Gal/GalNAc lectin gene (894A/G), and SNPs in the non-coding DNA either between XM_652295.1 and XM_652296.2 sequences (236T/G, 240A/G and 561T/G) or 5’ of the Amoebapore C transcript XM_650937.2 (407A/C and 422A) seemed to be present only the two to four Bangladesh isolates sequenced by Bhattacharya et al. and were not present in the available international sequenced whole genomes .
The goal of this work was to develop a set of less variable markers to profile a large number of strains from different regions of the globe, therefore we selected additional non-synonomous SNPs which Bhattacharya et al. had shown to be less variable, to probe the population structure of E. histolytica in depth . The new SNPs were present with a frequency of 0.3-0.6 in the pool of geographically disparate E. histolytica parasites whose genomes had been sequenced. We restricted our SNP candidates for initial analysis to genes with the potential to be involved in the virulence of this parasite [8-17]. As our current hypothesis is that the development of disease is multifactorial, or polygenic, and involves a combination of parasite factors in the current work we selected several loci to test for their association with disease outcome in E. histolytica. These loci contained SNPs that resulted in non-synonomous changes to the encoded amino acids, were present in more than three of the sequenced E. histolytica genomes, and enriched either in strains originating from symptomatic or asymptomatic infections. We have shown that two of these SNPs were significantly associated with disease severity in Bangladesh isolates.
The genome sequencing projects of multiple E. histolytica strains performed at the J. Craig Venter Institute (JCVI) and at the Institute of Integrative Biology (University of Liverpool) provided the sequence data used for the identification of SNPs (Table (Table1)1) . A total of 10,855 SNPs within coding DNA were identified in the sequenced genomes (Additional file 1: Table S1). Each strain had approximately 1,500 homozygous and 1,000 heterozygous SNPs. Half of all the SNPs identified were unique and present in only one strain (“private” SNPs). Like Ghosh et al. we identified mainly dimorphic SNPs, while potential tri- and tetrazygote variants were very infrequent . This, however, may reflect a bias in SNP detection programs because Mukherjee et al. observed considerable heterogeneity in the ploidy of E. histolytica .
To validate the use of SNPs from next generation sequencing data, a set of 12 SNPs predicted by NGS were verified by conventional Sanger sequencing of PCR amplicons from three selected strains, MS96-3382 (MS indicates monthly stool; this strain was established from an asymptomatic infection), DS4-868 (DS indicates diarrheal/dysenteric stool; this strain was isolated from a symptomatic infection) (sequenced as described in Additional file 1: Table S1) and the reference sequence HM-1:IMSS (Table (Table2).2). Primers were designed to amplify the region containing each SNP. The primers used are detailed in Additional file 1: Table S2 and the amplicons are shown in Additional file 1: Table S3 (primer sequences underlined). PCR was performed with these primers on MS96-3382, DS4-868, and HM-1:IMSS genomic DNA as described in materials and methods. The amplified products were separated on a 2% agarose gel and DNA fragments of the correct size were gel purified and sequenced by Sanger sequencing. In all cases the results of the Sanger sequencing of the MS96-3382 and DS4-868 amplicons matched the sequence produced by the NGS (Table (Table2,2, Additional file 1: Table S1). The Sanger data from HM-1:IMSS also matched the reference genome however a SNP in the alcohol dehydrogenase gene (gene ID EHI_166490/XM_647170.2) was heterozygous in this HM-1: IMSS reference strain, which was not previously known (Table (Table2).2). We therefore concluded that E. histolytica single nucleotide polymorphisms studied here were accurately identified.
The resampling results described above indicated that SNPs were maintained within an E. histolytica strain under continuous culture. However, this does not exclude the possibility that a particular genotype may change in frequency within an endemic population. To test for association between SNPs and disease outcome, E. histolytica samples were collected from an area endemic for amebiasis (ICDDR and Rajshahi Medical College, Rajshahi, Bangladesh- Additional file 1: Table S4). Both field samples and xenic cultures established from asymptomatic and symptomatic infections were used as a source of DNA (19 amebic liver aspirates; 26 xenic cultures (14 established from asymptomatic infections and 12 from diarrheal); 20 E. histolytica positive samples from diarrheal stool; and 19 E. histolytica positive samples collected during monthly stool sample surveillance). We anticipated that the virulence of this parasite in humans may not be the direct target of selection, because invasive disease does not seem to confer an advantage to pathogen dissemination . To focus on potentially genetically stable SNPs, which were nevertheless variably present in the different stains, we selected non-synonomous SNPs in the available data that were present in at least four, but not more than nine genomes. This allowed us to select for polymorphic SNPs that frequently occur in ameba and may represent genetically stable or ancestral variants that remain at a frequency of 0.3 to 0.6 a frequency that gave us sufficient statistical power to detect 2x differences within the amebic population surveyed in this study. For a SNP to be considered a candidate for association with symptomatic disease it had to occur at a greater frequency in the isolates from symptomatic amebic infections. Twenty-one potentially informative loci were chosen for further analysis in a larger number of E. histolytica isolates as described in the methods section of this paper (Additional file 1: Table S5 and S6).
The 21 marker loci selected from whole genome sequencing data were used to genotype clinical isolates of E. histolytica. DNA isolated from three sources, stool samples, short term xenic cultures of parasites from stool and amebic liver abscess aspirates, was used as a template to amplify the 21 loci. PCR products were sequenced using Illumina sequencing technology and the resulting demuliplexed sequence reads aligned to reference sequences representing the genes to which each amplicon corresponds in order to determine the nucleotide(s) present in the sampled genomes (see Additional file 1: Table S7).
Five of the 21 targets were not consistently co-amplified in our PCR reactions. This could have been due to differences in primer efficiency or off-target amplification in the xenic culture and stool specimens that contain an undefined mixture of intestinal microflora or it may also be because the gene is missing from some isolates or highly divergent. These five loci were not included in later analyses that only used the 16 remaining loci. We selected only those samples where all target sequences were efficiently amplified and we observed good coverage of all the expected amplicon sequences. If one or more of the targets was missing, then the sample was eliminated (Additional file 1: Table S7). The final data set consisted of 63 or the original 84 samples (63% of asymptomatically colonized stool samples, 80% of diarrheal stool, 73% of xenic cultures and 84% of amebic liver aspirates) which passed quality control and had the greater than 8 fold sequence coverage needed to confidently call SNPs.
The libraries generated from stool samples and from polyxenic culture contained a greater number of reads that did not map to the E. histolytica amplicons than those obtained from amebic liver abscess aspirates. This was likely due in part to off-target amplification (Figure (Figure1)1) of gut flora, or a reduction in specificity because most of these samples did not undergo nested PCR amplification prior to library preparation. Samples isolated from amebic liver aspirates do not have associated bacterial flora, unlike pyloric abscesses, therefore a higher proportion of the template DNA is E. histolytica.
In the samples that passed quality control, the read depth for individual SNPs was >8x coverage; this was considered adequate for SNP verification. SNPs were scored as described in materials and methods. The results of the illumina sequencing and the presence of predicted and novel SNPs within the amplicon sequences was tabulated as homozygous Reference (the same as the reference HM-1:IMSS sequence at this position) heterozygous (contained both the HM-1:IMSS nucleotide and the variant nucleotide at this position) or homozygous Non-Reference (has only the variant base at this location) (Additional file 1: Table S8). In Figure Figure22 the diversity of the SNPs at each locus in both the original sequence data (genomes shown in Table Table1),1), and in the Bangladesh samples analyzed in this study, (extra details shown in Additional file 1: Table S9).
Our work supports previous finding of extensive diversity among E. histolytica isolates from the same geographic origin, in phylogenetic trees only four of the 63 samples could be grouped into two sets of two with greater than 50% confidence suggesting that the genotypes of individual parasites do not contain consistent phylogenetic signals (Figure (Figure3).3). Such a result suggests that the markers do not share the same genealogy, likely due to extensive recombination or re-assortment breaking down linkage between markers. The diversity of E. histolytica genome raises a concern in regard to later analysis as it raises the possibility that a rapid rate of evolution may drive any observed differences between E. histolytica genotypes in samples isolated in regions separated even by relatively small geographical distances.
To examine the potential effect of adaption to in vitro culture on the frequency of SNP alleles, and therefore how well transiently or long established cultured trophozoites represent the parasite population, SNP allele frequencies were compared between parasites genotyped directly from stool samples and those from cultured trophozoites (Additional file 1: Table S10).
In cultures originating from asymptomatic isolates five linked Non-Reference SNPs at the LCAT EHI_065250/XM_647310.1 locus were detected in 80% of the strains, these same SNPs occurred in only 16% of the E. histolytica positive stool samples from asymptomatic hosts (Figure (Figure4).4). This suggests that during establishment of E. histolytica cultures a strong selection pressure was exerted on sequence in linkage with the LCAT EHI_065250 gene. This could either cause growth failure of the strains with the Reference allele or the outgrowth of a minority genotype in mixed infections (previous studies using the short tandem repeats have indicated that mixed infections are rare however this possibility cannot be discounted ).
To confirm that this change in the distribution of the Non-Reference SNPs was due to culture conditions, and not due to the Bangladesh origin of these samples, the comparison was repeated using only the strains isolated and transiently established in culture at ICDDR, B and genotyped using MLST during this study. A similar trend was observed.
EHI_065250 (LCAT) belongs to a gene family that consists of ten genes; they range in identity from 82% to 51% (Additional file 2: Figure Figure1A1A and B); two are highly similar to EHI_065250 (82 and 81% identity). The primers used in SNP amplification were specific for EHI_065250 and did not amplify the other members of this gene family. The other LCAT gene sequences are sufficiently different that off-target amplification would be detected in the sequence alignments of the Illumina reads. Such off-target amplification was never observed, confirming that amplification was specific for the target EHI_065250 locus.
The effect on SNP genotype was only apparent for the LCAT EHI_065250 SNPs and the p value of the EHI_065250 SNPs was not sufficiently low to eliminate the possibility of false discovery (q value=0.32, Additional file 1: Table S10). Therefore the cultured strains were included in Table Table33 and the statistical association of SNPs with disease phenotype was determined using the complete dataset but confirmed using the data set with only clinical samples (Additional file 1: Table S11 Data Set 1 and 2).
The EHI_080100/XM_001914351.1 cylicin-2 locus contained two closely linked SNPs 1&2. These SNPs were significantly associated phenotype (Non-Reference SNP was present in 75% of ALA samples; positive samples or cultures isolated from the monthly survey stool 52% and in 16% of samples or cultures isolated from diarrheal stool; p=0.002; q=0.032; Figure Figure5).5). Both the Reference and the Non-Reference variants at this locus were present in asymptomatic samples however the Non-Reference variants were more frequent in liver abscess samples and less frequent in samples isolated from patients with diarrhea/dysentery (Figure (Figure55 pair wise comparison between asymptomatic and diarrhea/dysentery p=0.0182 and between amebic liver abscess and diarrhea/dysentery samples p=0.0003; q=0.0144).
Amebic liver abscess is a complication only found in adults whereas dysentery is more frequent in children. The liver aspirate samples in this study were collected from adults, at Rajshahi Medical College Hospital, Bangladesh. This is a geographically distinct location from the dysenteric and asymptomatic samples that were collected from children in Dhaka, Bangladesh. One goal of this study was to identify SNPs to type the virulence potential of the parasite in amebic liver aspirates; if SNPs occur at different frequencies in Dhaka and Rajshahi isolates they will appear as potential biomarkers of parasites with the potential to initiate amebic liver abscesses. The difference in SNP 1&2 frequency in both asymptomatic and diarrheal samples was replicated however in the sequenced genomes from diverse populations in Asia and South America (described in Table Table11 and Additional file 1: Table S6 and included in Data set 2 Additional file 1: Table S11) [24,29,35,39].
The previously discussed locus, LCAT EHI_065250, which contained five different SNPs (3–7), was also associated with symptomatic disease however possible selection in culture rendered the distribution less significant within the larger data set (Table (Table33).
The changes at both the LCAT EHI_065250 and the cylicin-2 EHI_080100 loci altered a potential phosphorylation site in the encoded protein sequence (NetPhos ), and are located at the C-terminal portion of the proteins (Figure (Figure6).6). Expression of EHI_065250 has been shown to be modulated in the mouse model of amebiasis, and to be under the control of the URE3-BP transcription factor [9,44]. EHI_080100 appears to be a novel member of the E. histolytica “promoter family” potential membrane proteins regulated by the transcription factor URE3-BP with highly similar promoters, and amino- and carboxyl-terminal sequences (sites of signal peptide and transmembrane domains) . EHI_080100 encodes a hydrophilic Glutamic acid/Lysine rich protein with an N-terminal Signal P and although annotated as cylicin-2, it is not an ortholog of the human gene . Its function remains unknown.
E. histolytica SNPs were identified in amebic DNA isolated from a Bangladesh population by amplicon sequencing. Non-Reference SNPs in the EHI_080100 cylicin-2 gene were significantly associated with the virulence phenotype (amebic liver abscess>asymptomatic>diarrhea or dysentery).
We initially analyzed the genetic diversity among 12 sequenced E. histolytica genomes that represented different geographical origins and disease manifestations, and selected a set of 21 polymorphic sites in coding regions where SNPs change the encoded amino acid. The distribution of these 21 non-synonymous SNPs in field isolates and cultured strains of E. histolytica were examined in samples collected from an endemic area in Bangladesh by multilocus sequence typing (MLST). Of 16 loci that passed quality control five were invariantor very infrequent in Bangladesh.
Our results are inconsistent with a model of clonality in E. histolytica populations. In a clonal population we would expect to see strong linkage disequilibrium between markers, since linkage would not be eroded by recombination and sexual reassortment. In fact, we saw only two identical genotypes in our sample, suggesting a considerable amount of recombination and/or reassortment. Our results support previous observations, based on short tandem repeat DNA sequences, of high diversity among genotypes even within limited geographical areas [18,21,24]. Due to this complexity, the number of whole genomes sequenced in the pilot studies, were not sufficient to predict accurately the SNPs associated with disease. However, 2 out of the 16 loci examined,(EHI_065250 and EHI_080100), were significantly associated with disease in isolates collected in Rajshahi and Dhaka, Bangladesh.
One caveat to this study was that the amebic liver abscess samples were collected in Rajshahi but the stools samples were collected at a different location (Mirpur, Dhaka); the differences in the grouping of liver abscess and stool E. histolytica could reflect geographical differentiation . Ali et al. have however previously described different genotypes in liver abscess and enteric samples from the same patients . This suggests a possible genetic selection for parasites with invasive capabilities. Based on our data we suggest a divergent rather than sequential model of the potential to cause severe disease . Non-Reference EHI_080100 SNPs were more frequent in asymptomatic than diarrhea/dysentery samples. However the Non-Reference SNP potentially predisposed the asymptomatic infection to initiate an amebic liver abscess rather than amebic colitis (p= 0.0182) as the Non-Reference EHI_080100 SNPs, were present with even higher prevalence, in samples from amebic liver abscess (p=0.0003, q=0.0144). Additional studies are needed to identify additional amebic biomarkers associated with invasive disease.
In both EHI_065250 and EHI_080100 the consequence of the Non-Reference polymorphisms was to change two amino acids within the C-terminal domains. The reason behind the association of these SNPs with invasive disease is not yet clear. The polymorphic genes have not previously been associated with a virulent phenotype, and other than the previously discussed change in at a potential phosphorylation site, there were no other predicted changes in protein function using the currently available bioinformatics tools (PolyPhen http://genetics.bwh.harvard.edu/pph2/http://sift.jcvi.org/www/SIFT_seq_submit2.html) [47,48].
EHI_080100 (cyclin-2) is present on a short region of contiguous DNA in the E. histolytica HM-1:IMSS genome assembly that could not be assembled into a larger contiguous DNA segment or sequence scaffold (Table (Table4).4). This suggests that the gene may be present in proximity to highly repetitive regions that prevent unambiguous assembly. Lorenzi et al. suggest that repeats and repeat-clusters are found at syntenic break points between E. histolytica and E. dispar and could act as recombination hot spots promoting genome rearrangement . This “informative” locus could therefore reside in regions of DNA prone to allelic imbalance. In addition, no E. dispar homologue has been found for EHI_080100, making this gene an interesting candidate for further studies.
The currently identified SNPs could act as genetic “markers” in incomplete linkage disequilibrium with neighboring DNA that contains causative or regulatory SNP (r-SNP) mutations that result in a modulation of gene expression. It is interesting to note that contiguous with the EHI_065250 gene is one of the genes encoding the intermediate subunit of the Galactose- and N-acetyl-D-galactosamine (Gal/GalNAc) inhibitable lectin (igl2) . The Gal/GalNAc inhibitable lectinis a well-characterised virulence factor in E. histolytica.
It is also possible that amino acids changes resulting from the SNPs directly influence the biological activity of the encoded protein and that these changes affect the ability of the trophozoite to invade its host. What has never been clear is the advantage to the E. histolytica parasite to the causation of invasive disease . It is conceivable that these SNPs result in a maladaptation of the coevolving genomes of parasite and host and are the reason for the expression of the virulent E. histolytica phenotype [3,4]. The effect of the changes described in this study on the stability and function of the encoded protein is currently under investigation.
E. histolytica does not follow the model of T. gondii that exists predominantly in a few main lineages . Rather, even in population from a single geographic location, majority of the individual parasites show unique genotypes. The number of tRNA-linked genotypes discovered, are likely to continue increasing in number and will enable the measurement of strain diversity. However, the results presented in this work support the hypothesis that a relationship exists between the genotype of an E. histolytica strain and parasite virulence. Unlike the tRNA-linked sequence types (Ali et al, 2012) which are merely surrogate markers for the prediction of infection outcomes, non-synonymous SNPs detected in the present study shows promise to identify parasite factors directly linked to infection outcomes . Although preliminary, our findings identified two candidate genes that may contribute to the pathogenesis of these parasites. The level of genetic variation we observed increases the importance of the SNPs we have linked to disease. We are currently investigating the impact of the non-synonomous changes on the function of these proteins.
To fully understand the genetics of this parasite, additional biomarkers will be needed to understand virulence and different outcomes of the disease at the genome level. In the absence of stable clonal populations deeper characterization of the variation in the E. histolytica genome requires sequences from additional ameba strains. Using the protocol described in this paper usable sequence data was gathered from approximately half of the field samples. This allowed the testing of the association of selected candidate SNPs within an endemic population. Given the large amount of variation that occurs, SNPs need to be carefully chosen to type the virulence potential in an E. histolytica MLST schema rather than to reflect parasite phylogeny. Future studies are needed which focus on the genome of the infecting parasite in conjunction with the genome of the infected host.
The Ethical Review Committee at ICCDR,B approved this study. Written informed consent was provided by all study participants and/or their legal guardians.
E. histolytica trophozoites isolated from patients of all age groups seen at the hospital for diarrheal diseases, or from children living in an urban slum area in Dhaka were established in culture at the International Centre for Diarrhoeal Diseases Research, Bangladesh (ICDDR,B). Polyxenic cultures were maintained in biphasic Robinson's medium at 37°C (listed in Additional file 1: Table S4) .
Recently collected stools from patients of all age groups seen at the hospital for diarrheal diseases, or from children living in an urban slum area in Dhaka were examined macroscopically for the presence of blood and mucus; a smear of feces in 0.9% saline was examined microscopically for the presence of erythrocytes, leukocytes, and E. histolytica trophozoites. The DNA was extracted using a slightly modified QIAamp DNA Stool Mini Kit protocol (Qiagen Inc., Valencia, CA) as described previously for specimens from ICDDR,B . Stool samples are also listed in Additional file 1: Table S4.
Aspirates from patients with amebic liver abscesses were obtained only from adults because ALA is an extremely rare complication in children . A presumptive diagnosis of ALA was based on clinical picture, ultrasound examination and positive serology using an E. histolytica antigen based ELISA (TechLab E. histolytica II) . Abscess fluid was obtained under ultrasound guidance from patients with ALA and was purified using the modified QIAamp DNA Stool Mini Kit protocol described above (samples are listed in Additional file 1: Table S4) .
Primers for these experiments were designed using the publically available Primer3 program and checked for specificity using the NCBI Primer-BLAST tool  (http://www.ncbi.nlm.nih.gov/tools/primer-blast/). All primers used in this study are listed in either Additional file 1: Table S2 or Table S4.
Whole genome sequencing of five of the E. histolytica strains used in this study was carried out at the J. Craig Venter Institute. These sequence traces are deposited athttp://http://www.ncbi.nlm.nih.gov/bioproject/9532 dbSNPsGenbank (http://www.ncbi.nlm.nih.gov/projects/SNP/) and AmoebaDB (http://amoebadb.org/amoeba/)[57,58]. This project is also fully described at the NCBI Bio Project page (Accession: PRJNA9532). Whole genome re-sequencing was performed at the Institute of Integrative Biology, (Centre for Genomic Research) University of Liverpool and results deposited at AmoebaDB [35,57]. For a complete list of E. histolytica genomes, sequencing technology and Sequencing Center see Table Table11 and Additional file 1: Table S1.
For genome-wide SNP detection at JCVI the sequenced strains were analyzed using the CLC Genomics Workbench 4.0.2 SNP detection component as described below (see SNP detection and validation of amplicon sequences). In genomes sequenced at the Centre for Genomic Research, SNPs were identified according to the methods described Weedallet al.. For a list of the SNP detection method used in each genome see Additional file 1: Table S1. SNPs are listed in Additional file 1: Table S5. The default in the table was Reference and only high confidence SNPs (sequence coverage >8) were identified as Non-Reference. At selected locations a visual inspection of available sequence traces was performed to identify lower confidence SNPs (Additional file 1: Table S6). To identify “ancestral” or genetically stable SNPs we selected SNPs that were present in more than three strains. To pick out SNPs linked to disease the SNPs were grouped according whether the sequenced genome was first isolated from patients with asymptomatic or symptomatic disease. The list of weighted selection criteria included whether the SNPs enriched asymptomatic or symptomatic isolates, if the SNP was present in repeat regions or large E. histolytica protein families, whether it was contained in genes with any potential role in virulence, or if orthologous sequences were present in the non-pathogenic but closely related species E. dispar . The selected SNPs are shown in Additional file 1: Table S6.
PCR amplifications were performed on a C1000 Thermal Cycler (Bio-Rad) using the High Fidelity Phusion DNA polymerase Master Mix (Finnzymes). Sample DNA (0.5μl) was added to a 25μl reaction mix containing 125pm of the designated primers (5 nM). After an initial denaturation step of 98°C, denaturation at 98°C for 10sec, annealing of primers at 50°C for 30sec and elongation at 72°C for 30sec was performed for 34 cycles. This was followed by a final extension at 72°C for 10min. The amplified products were separated on a 2% agarose gel and the DNA fragments of the correct size were gel purified and sequenced by Sanger sequencing (GENEWIZ, Inc).
For clinical samples and low copy number culture material, amplicons were generated by nested PCR (see Additional file 1: Table S2 and S3). PCR amplifications were carried out using Phusion High Fidelity DNA polymerase Master Mix (Finnzymes). 1μl of first round amplified DNA was used as template for the second round of amplification, using the same conditions as for the first round PCR with the exception that the annealing temperature was increased to 60°C and the nested PCR primers were used with tails that contained the unique “barcode” sequences and adaptors necessary for Illumina paired-end sequencing, as described by Meyer and Kircher (Additional file 1: Table S4) . DNA from cultured parasites was used directly as template for the second round PCR amplification only, as its more abundant template made nested PCR unnecessary.
After this step, the different PCR products amplified from original samples were pooled in groups of 5 or 6 and one μl was amplified using 200 nM of the IS4 primer and an indexing primer (Additional file 1: Tables S2 and S4) for an initial denaturation step of 98°C, denaturation at 98°C for 10sec, annealing of primers at 60°C for 20sec and elongation at 72°C for 20sec was performed for 34 cycles. This was followed by a final extension at 72°C for 10min.
The final amplification product was pooled and purified using the Qiagen MinElute 96 UF PCR purification kit according to manufacturer’s directions, and an aliquot of the amplified library and the accompanying negative controls were run on a 2% agarose gel for quality control. The final library was pooled and DNA concentration determined using a Quant-iT Kit (Invitrogen). Prior to submission for sequencing the size distribution of the DNA in the pooled library sample was examined for insert sizes and confirmed to be of the expected range (200–300bp) using an Agilent 2100 bioanalyzer.
An aliquot of the multiplexed libraries (5 pmol) was denatured and then processed with the Illumina Cluster Generation Station at the J. Craig Venter Institute, Rockville, MD (JCVI, MD, USA), following the manufacturers protocol. Libraries were sequenced on an Illumina GAII,run for 100 cycles to produce reads of 100bp. Images were collected over 120 tiles (one lane) which contained 715,000 ±60 clusters per tile.
After the run image analysis, base calling and error estimation were performed using Illumina/Solexa Pipeline (version 0.2.2.6). Perl scripts were used to sort and bin all sequences according to indexes CASAVA 1.6 (Illumina).
Amplicon sequence analysis was performed using the high-throughput sequencing module of CLC Genomics Workbench 4.0.2. Raw read output for each indexed amplicon set (derived from samples as indicated in Additional file 1: Table S4) was cleaned by trimming of adaptor sequences, ambiguous nucleotides and low quality sequences with average quality scores less than 20. The remaining reads were used for reference assembly. To assess the level of redundancy and non-specific alignment in each individual dataset, an initial reference-based assembly was executed using the whole E. histolytica HM-1:IMSS reference genome (Genbank accession AAFB00000000). As some level of non-specific alignment occurred, the alignment conditions utilized for the final mapping of Illumina reads to the reference assembly were adjusted to require a global alignment of 80% identity over at least 80% of the specific concatenated reference assembly of the target sequences (see Additional file 1: Table S3). Default local alignment settings with mismatch cost of 2, deletion cost of 3 and insertion cost of 3 were used. Reads that were not assembled into contigs in the reference assembly were not analyzed. Consensus sequences derived from the reference assemblies for each amplicon set were utilized for SNP scoring and further phylogenetic analysis.
SNP detection in the amplified DNA was performed using CLC Genomics Workbench 4.0.2 SNP detection component, which is based on the Neighborhood Quality Standard (NQS) algorithm . To identify quality SNPs, putative SNPs were screened following specific criteria based on the read depth, minor allele frequency (10%), the quality of flanking regions and absence of other SNPs within 15-bp flanking regions. SNPs located in repetitive regions were also not considered. The central base quality score of ≥30 and average surrounding base quality score of ≥20 were set to assess the quality of reads at positions for SNP detection. A minimum coverage of 10 and a minimum variant frequency of two was required, and the variations compared against the reference sequence were counted as SNPs. The NQS algorithm looked at each position in the genome alignment to determine if there was a SNP at that position.
The sequences spanning the SNPs were extracted and the IUB base code guide used to describe heterologous bases (see Additional file 1: Table S8). At each locus the sum of the squared allele frequencies was subtracted from 1 to gauge the diversity (heterozygosity) in both the original sequenced genomes and the new MLST data (Figure (Figure2).2). The E. dispar Mercator whole genome alignment deposited in AmoebaDB was used to obtain the equivalent sequences where they existed in this related species (Additional file 1: Table S8) [57,61]. The statistical significance of SNP distribution or genotype group versus the phenotypic manifestation of disease (asymptomatic/diarrhea or dysentery/amebic liver abscess) was determined by use of a Chi-squared contingency test or Fisher’s Exact test using the Prism 5 program (GraphPad Software) and the resulting p values were corrected for multiple comparisons by use of the false discovery rate formula of Benjamini and Hochberg in the R program FDR online calculator made freely available by the SDM project [62,63]. To obtain the correction for multiple comparisons in the pairwise comparison the p-values of all possible combinations (i.e. asymptomatic vrs dysentery; asymptomatic vrs amebic liver abscess; dysentery vrs amebic liver abscess) for a given data set were combined prior to correction. A FDR of 10% was considered significant (http://sdmproject.com/utilities/?show=FDR_).
The authors have no competing interests to declare.
CAG conceived, designed, performed experiments, analyzed data and wrote the manuscript. WAP, IKMA, RH, and EC participated in the design of the study and also helped to write the manuscript. IKMA also preformed experiments. MK and FA collected samples and prepared DNA. SS, EF and EC conducted the next generation sequencing of amplicons and analysis of the resulting sequence data. GDW, NH and EC sequenced all genomes and discovered all SNPs described in this study. GDW helped in the writing of the manuscript. All authors read and approved the final manuscript.
Supplemental Tables. This file includes all supplemental tables mentioned in the text in an excel spreadsheet.
This investigation was supported by grant 5R01AI043596 from NIAID to WAP. This project has also been funded in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract numbers N01-AI30071 and/or HHSN272200900007C.We wish to thank Dr Karen Beeson for her expert advice regarding next-generation sequencing technology, Drs. Cynthia Snider and Poonum Korpe for transportation of Bangladesh DNA samples and Dr. A. Mackey, Dr. B. Mann and Dr. M. Taniuchi for informative discussions. We also wish to thank Dr. B. Mann and C. B. Bousquet for careful reading of this manuscript.