|Home | About | Journals | Submit | Contact Us | Français|
To evaluate genetic variability among Entamoeba histolytica strains, we sequenced 9,077 bp from each of 14 isolates. The polymorphism rates from coding and noncoding regions were significantly different (0.07% and 0.37%, respectively), indicating that these regions are subject to different selection pressures. Additionally, single nucleotide polymorphisms (SNPs) potentially associated with specific clinical outcomes were identified.
A minority (~10%) of individuals infected with Entamoeba histolytica develop clinical symptoms (5). Whether this is due to variations in parasite genotypes is unknown. Based almost exclusively on analysis of highly repetitive loci, significant genetic diversity among E. histolytica isolates has been reported (1, 7, 8, 20-22). Since highly repetitive regions are prone to incorporating polymorphisms due to DNA slippage, analysis of these loci may overestimate population diversity in a species (15). Analyses of nonrepetitive loci have revealed very limited polymorphisms (3, 6). There is no definitive correlation between genotypes of E. histolytica strains and their clinical manifestations; however, three studies preliminarily indicated that genotypic patterns may be predictive of a clinical outcome (1, 16, 18). To get an improved perspective of genetic variability in nonrepetitive genomic regions, we sequenced 9,077 bp (6,621 bp coding and 2,456 bp noncoding) from each of 14 E. histolytica isolates.
Four E. histolytica laboratory strains and 10 clinical isolates were studied (Table (Table1).1). Laboratory and clinical isolates were cultivated under axenic and xenic conditions, respectively (9, 16). Each E. histolytica isolate was characterized by PCR amplification and sequencing of 13 genetic loci (4) (Table (Table2).2). PCRs contained 0.05 μg (phenol-chloroform) to 0.5 μg (GenomiPhi DNA) template DNA, 20 pmol of each primer, 1.5 mM MgCl2, 10 mM of each deoxynucleoside triphosphate, and 0.5 μl of Taq DNA polymerase and were cycled as follows: 95°C for 5 min; 35 cycles of 95°C for 1 min, 60°C for 1 min, and 72°C for 2 min; 72°C for 10 min; and a 4°C soak. PCR products were sequenced in their entirety, and sequences were aligned using CLUSTALW (www2.ebi.ac.uk/clustalw). For single-copy genes,sequences from GenBank and the The Institute for Genomic Research (TIGR) E. histolytica (strain HM-1:IMSS) database (http://www.tigr.org/tdb/e2k1/eha1/) were the reference sequences (Table (Table2).2). For lectin and actin, which are each encoded by multicopy genes, a master template sequence was constructed (see Table Table2),2), and single nucleotide polymorphisms (SNPs) identified in the genome sequence were considered inherent to the gene and not further noted.
We analyzed coding regions from housekeeping genes (ssu rRNA, cpn60, and genes encoding actin and γ-tubulin) and virulence genes (lectin [hgl3], the gene encoding amebapore C, rabE, and the gene encoding cysteine proteinase 5), since they may be subject to variable selection pressures. Two genes (encoding actin and lectin) are in multiple copies in the genome; the rest are single-copy genes. For the noncoding loci, introns and intergenic and/or promoter regions were analyzed; introns should have essentially no constraint against divergence, whereas intergenic and/or promoter regions may be under some selection pressures (11). We also amplified three polymorphic loci from each strain which showed significant polymorphisms (Table (Table3).3). We sequenced locus 1-2 and locus 5-6 for the laboratory strains, and, despite having identical amplicons, the strains were genetically unique (insertions and deletions of tandem repeat sequences and SNPs) (data not shown). Similar results with repetitive loci have been previously reported with amplicon sizes underestimating sequence diversity (7, 21); therefore, sequence analysis of these regions was not performed for the clinical isolates.
In the nonrepetitive genetic loci analyzed, 14 SNPs were identified: 5 within coding regions and 9 in noncoding regions (Table (Table3).3). The actin and lectin genes (both multicopy genes) showed the maximum polymorphisms, with four of the five SNPs occurring in these two genetic loci. The limited sequence variability in the coding regions we studied concurs with a recent comparative genomic hybridization analysis, where these genes were highly conserved among the strains tested (20). All SNPs in the coding regions were synonymous. This is incongruous with reports for Mycobacterium tuberculosis, where analysis of coding regions from seven housekeeping genes (8,318 bp of sequence) revealed 101 SNPs, of which only 36 were synonymous (2). Similarly, in Plasmodium falciparum, of the 48 SNPs detected in antigenic locus mspI, only 7 were synonymous (17). Whether these differences between our observations and others are due to technical (limited amount of sequence data, types of genes analyzed) or biological (codon bias, differential selection pressure) reasons is not clear at present. We searched the GenBank database for sequences of nonrepetitive coding regions from E. histolytica strains other than HM-1:IMSS. Of the approximately eight sequences identified (X84009, X83685, X79134, X79133, X83685, AY870660, AY460178, X82198), only the latter three have significant matches in the TIGR E. histolytica database, and only one (X82198) had an SNP that was synonymous. Further sequencing may be useful to clarify this issue. It is likely that each point mutation occurred just once in the phylogenetic history of the species. Since SNP markers are evolutionarily stable and unlikely to mutate again to either a novel or ancestral state, they are useful for evolutionary analyses (2).
Using data from all genes studied, we identified 14 genotypic patterns among the 14 isolates. However, the types of polymorphisms we identified in nonrepetitive genomic regions were significantly different from those in highly repetitive regions. Sequence divergence in our study was limited to SNPs, in contrast to sequence analysis of repetitive regions, in which differences between isolates are largely due to copy numbers of 12- to 16-bp tandemly repeated regions (7). The underlying structure of the DNA influences replication errors, and DNA slippage can result in changes in the number of repeat regions, leading to a large degree of variability in repetitive loci (15). For phylogenetic analysis, nucleotide sequences for the six genetic loci with SNPs were combined and used to generate a dendrogram, which distinguished three of the asymptomatic isolates from the others (Fig. (Fig.1).1). Two asymptomatic isolates (MS26-21 and MS53-3046) clustered with the samples from diarrheal stools. Whether this indicates that these two isolates have unrecognized virulence potential remains to be investigated. Among the clinical isolates, there were six SNPs associated exclusively with isolates from asymptomatically infected individuals (the 894 SNP in the lectin gene; the 236, 240, and 561 SNPs in the intergenic region between 2.m00567 and 2.m00568; and the 407 and 422 SNPs in the upstream region of amebapore C) (Table (Table3).3). Additionally, there was one SNP (369 in the 128.m00017 intron) that in the clinical isolates was identified only in samples from diarrheal stools. However, using a two-sided Fischer exact test using Stata/SE 7.0 (Stata Corporation Texas), only the lectin 894 SNP was statistically significant in its association (P = 0.015).
The occurrences of SNPs between coding and noncoding regions (0.07% and 0.37%, respectively) were statistically significantly different (P = 0.0039) (as calculated above). A number of factors could influence this observation. First, SNPs are more common in microsatellite repeats; however, this was not a factor in our observation, as the SNPs in noncoding regions were not in microsatellite repeats. Second, codon bias restricts the extent of genetic variability that can occur in coding regions, especially in AT-rich organisms such as E. histolytica (12). A functional constraint in the coding regions was seen in our analysis, as all SNPs represented synonymous changes. Of the nine SNPs in the noncoding regions, seven were identified exclusively in clinical isolates, suggesting continued selection and evolutionary pressures. Noncoding regions are often targeted for population-based investigation because they represent rapidly evolving sequences, as they are subject to reduced selective constraints compared to coding regions (10, 11). Our results corroborate previous studies, where E. histolytica isolates differing at the locus encoding chitinase or serine-rich E. histolytica protein were identical at rRNA internally transcribed spacer regions and intergenic sequences between superoxide dismutase and actin 3 genes (6, 13). Similar epidemiological studies of Plasmodium falciparum have revealed paradoxical results in population structures partly due to the reliance of some studies exclusively on highly polymorphic genes (14, 15) versus data from housekeeping genes or introns (19). The overall data indicate that noncoding regions may be better gauges to assess evolutionary trends in E. histolytica.
The E. histolytica isolates studied have been subjected to variable culture conditions from long-term (>30 years) axenic culture to short-term (1 to 3 years) xenic culture (Table (Table1).1). The different conditions did not significantly change the genetic composition of nonrepetitive regions. Overall, 4,617 bp and 580 bp of coding and noncoding regions, respectively, did not change among any of the isolates tested (a total of 72,758 bp). Additionally, the presence of SNPs at identical positions (237 in amebapore C and 894 and 1245 in the lectin gene) in both laboratory and clinical samples indicate that E. histolytica isolates have remained phylogenetically conserved during long-term tissue culture (Table (Table3).3). Furthermore, axenization did not trigger significant changes in the genes studied.
We propose that studies of noncoding, nonrepetitive regions might be the most informative for future population-based studies to develop an improved evolutionary and phylogenetic framework for E. histolytica.
Nucleotide sequence accession numbers. The sequences described in this work have been submitted to GenBank under the nucleotide sequence accession numbers AY956427 to AY956440.
This work was supported by a Stanford University Dean's fellowship for D.B. and grants from the NIAID (AI-053724 and AI-063470) to U.S. R.H. is an International Research Scholar of the HHMI and is also supported by a grant from the NIAID (AI-43596).
We gratefully acknowledge the help of Tomoyoshi Nozaki for critical reading of the manuscript, Sebastian Gagneux for statistical help, and all members of our laboratory, especially Jason A. Hackney for helpful suggestions and discussions.