|Home | About | Journals | Submit | Contact Us | Français|
Variable phenotypes have been identified for Entamoeba species. Entamoeba histolytica is invasive and causes colitis and liver abscesses but only in ~10% of infected individuals; 90% remain asymptomatically colonized. Entamoeba dispar, a closely related species, is avirulent. To determine the extent of genetic diversity among Entamoeba isolates and potential genotype-phenotype correlations, we have developed an E. histolytica genomic DNA microarray and used it to genotype strains of E. histolytica and E. dispar. On the basis of the identification of divergent genetic loci, all strains had unique genetic fingerprints. Comparison of divergent genetic regions allowed us to distinguish between E. histolytica and E. dispar, identify novel genetic regions usable for strain and species typing, and identify a number of genes restricted to virulent strains. Among the four E. histolytica strains, a strain with attenuated virulence was the most divergent and phylogenetically distinct strain, raising the intriguing possibility that genetic subtypes of E. histolytica may be partially responsible for the observed variability in clinical outcomes. This microarray-based genotyping assay can readily be applied to the study of E. histolytica clinical isolates to determine genetic diversity and potential genotypic-phenotypic associations.
Entamoeba histolytica causes amebic colitis and liver abscesses. Worldwide, 50 million people have invasive E. histolytica, and 100,000 die each year, making it the second most common cause of parasitic death in humans (71). In Dhaka, Bangladesh, where diarrheal diseases are the leading cause of death in children less than 6 years old, ~50% of children have serological evidence of exposure to E. histolytica by age 5 (34, 35). Infected children suffer from significant morbidity with malnourishment and growth delays. A number of interesting epidemiologic trends have been identified for amebic disease. The data, although limited, indicate that a minority (~10%) of individuals who become infected with E. histolytica progress to clinically overt disease; others remain asymptomatically colonized (29). In individuals with invasive disease, occurrence of amebic liver abscess (ALA) is 5 to 50 times less common than diarrhea (65). Geographically variable disease predilections have been observed, and invasive disease predominantly affects men (1, 2, 11, 61).
The extent of genetic diversity among E. histolytica clinical isolates is unclear. Studies have analyzed a small number of highly repetitive and polymorphic genetic loci by techniques, such as randomly amplified polymorphic DNA (RAPD), RNA arbitrarily primed PCR, and restriction fragment length polymorphism (RFLP), to conclude that there is significant genetic diversity (7, 18, 23, 31, 33, 64, 72, 73). Contradictory data were obtained by two studies; in one study, the lectin gene was sequenced, and in the other study, the intergenic region between the superoxide dismutase gene and the actin 3 gene was analyzed (9, 30). These studies found minimal genetic diversity among clinical isolates. Although limited, these data indicate that the extensive genetic diversity identified by analysis of repetitive regions may not be indicative of a genome-wide phenomenon. A number of studies have attempted to identify an association between parasite genotype and the clinical phenotype and although most have failed, two studies (one using RAPD analysis of amebic isolates and one using PCR and RFLP analyses of the serine-rich E. histolytica protein [SREHP]) did give some preliminary indication that it may be feasible to find a genotypic pattern predictive of a phenotypic outcome (7, 64). Host factors that contribute to variable disease manifestations have also been studied. In a study of Mexican mestizo adults and children, an increased frequency of HLA-DR3 was found to be associated with ALA (5, 6, 63). A recent study identified a potential protective association of the HLA class II allele DQB1*0601 and the heterozygous haplotype DQB1*0601/DRB1*1501 for intestinal amebiasis (26).
E. histolytica was recently reclassified into E. histolytica Schaudinn (EH) (previously described as pathogenic E. histolytica) and Entamoeba dispar Brumpt (ED) (previously described as nonpathogenic E. histolytica) (25). The two species are morphologically similar, exhibit ~98% identity at the rRNA level, and have similar host range and cell biology. However, their pathogenicity in vivo is vastly different. E. dispar colonizes humans, but only E. histolytica is able to cause invasive disease. To date, only four genes have been reported to be present in E. histolytica but absent (EhCP1, Ehapt2, and Ariel1) or significantly degenerate (EhCP5) in E. dispar (12, 67-69). However, the molecular basis of E. histolytica virulence and E. dispar nonvirulence is not yet known.
The parasite and host variables that contribute to the epidemiologic trends described above are not clear. Undoubtedly, there is a complex interplay between host genetics, immunity, enteric flora, nutrition, and parasite genetics that occurs and contributes to disease. Whether there are subtypes of EH that have higher or lower virulence potential or predilection for infection of certain organ systems is not known. The World Health Organization has prioritized efforts to determine whether functional subgroups of EH exist (71). Using an E. histolytica 11,328-clone genomic DNA microarray, we have genotyped four E. histolytica laboratory strains and two E. dispar laboratory strains. This first genome-wide analysis of Entamoeba strains reveals that genotypic fingerprints can be used to distinguish E. histolytica from E. dispar, identify genes restricted to virulent strains, and find potential genotypic-phenotypic associations.
Four EH strains and two ED strains were utilized in our study. The EH strains were as follows: EH (HM-1:IMSS), originally isolated in 1967 from a patient with amebic colitis in Mexico; EH (200:NIH) isolated from a patient with colitis in India in 1949; EH (HK-9) isolated from a patient with amebic dysentery in Korea; and EH (Rahman) isolated from an asymptomatic cyst passer in England in 1972. The ED strains were ED (SAW760), which was isolated from an adult human male in England in 1979, and ED (SAW1734), which was isolated from Ethiopian Jews in Israel in the 1980s (http://www.atcc.org/). All E. histolytica strains were grown under axenic conditions in Trypticase-yeast extract-iron-serum medium (TYI-S-33) with 10 to 15% adult bovine serum (Sigma), penicillin (100 U/ml), streptomycin (100 μg/ml) (Gibco BRL), and 1× Diamond's vitamins (Biosource International, Camarillo, Calif.) in 15-ml glass culture tubes at 37°C (60). E. dispar isolates were grown under xenic conditions (media supplemented with 200 μg of erythromycin per ml) (Sigma Aldrich, St. Louis, Mo.) as previously described (36). Cultures of mid-logarithmic or late-logarithmic-phase trophozoites were chilled on ice for 10 min and centrifuged at 430 × g at 4°C for 5 min, and genomic DNA was isolated using either a phenol-chloroform method or a Wizard Genomic DNA kit according to the manufacturer's directions (Promega Corporation, Madison, Wis.) (58). The EH and ED strains were verified by PCR analysis of the rRNA episome and strain-specific genes and by PCR and RFLP analyses of the SREHP using previously described methods (18, 62). The SREHP PCR products were digested with AluI, and results for all PCR products were compared to banding patterns in previously published studies (18). Primer sequences are available for review (see Table Table11 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
The Institute of Genomic Research (TIGR) (http://www.tigr.org/tdb/e2k1/eha1/) and Sanger Center (http://www.sanger.ac.uk/Projects/E_histolytica/) have sequenced E. histolytica (HM-1:IMSS) and E. dispar (SAW760) (47a). As of March 2004, there was 12X genome coverage of EH and ~2X genome coverage of ED.
We utilized 11,328 sequenced clones from the genomic library of EH (HM-1:IMSS) prepared by TIGR to generate our DNA microarrays. Approximately 62% of these clones were used in the final sequence assembly. Microarray construction was similar to previously described methods (19). Briefly, we PCR amplified the genomic insert directly from 5 μl of bacterial cultures using M13F and M13R primers that are part of the multiple cloning sites of the cloning vector. A total of 35 cycles of PCR was performed, with 1 cycle consisting of 1 min at 94°C, 1 min at 50°C, and 2 min at 72°C. PCR was performed with 2 mM MgCl2 in a Tetrad Thermo cycler (MJ Research, Waltham, Mass.). The products were ethanol precipitated, resuspended in water, dried, resuspended in 3× SSC buffer (1× SSC is 0.15 M NaCl plus 0.015 M sodium citrate), and printed on polylysine-coated glass slides (19). To check the accuracy of printing, quality controls were performed. Probes from the rRNA episome, EhActin, EhPFK, and EhATPase were used to verify clone orientation and contamination issues. One hundred nanograms of each amplified PCR product was labeled with Cy5-dUTP and hybridized on the arrays (see “Microarray hybridization” for details). Primer sequences are available for review (see Table Table11 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
The assembly, locus information, and final TIGR/Sanger Center gene annotation were kindly provided by Brendan Loftus (annotated on 12 February 2004). Clones from potentially contaminated plates and clones from the rRNA episome and repeat elements (Ehapt2 and EhRLE4) were identified. The final information was compiled and stored in the Stanford Microarray Database (SMD) (http://smd.stanford.edu/) for data retrieval and analysis. We also identified a subset of clones which were fully sequenced, used in the final assembly, contained one open reading frame (ORF) according to the TIGR annotation (>200-bp overlap at >98% identity), and were not from contaminated plates. These clones (2,802 good clones which contain 2,112 unique genes) were used for analyses in which we considered data from clones with a single ORF.
Genomic DNA (4 to 8 μg) from the EH and ED strains was resuspended in 30.5 μl of Tris-EDTA (pH 7.6), 3 μl of random nanomers (4 μg/μl) was added to the solution, and the solution was boiled for 1 min and cooled for 5 min. The DNA was mixed with 10× deoxynucleoside triphosphates (0.25 mM concentrations of dATP, dCTP, and dGTP and 0.09 mM concentration of dTTP), 1.5 μl of Cy5-dUTP, and 2 μl of Klenow fragment and incubated at 37°C for 4 h. A reference sample (350 ng of a 140-bp PCR product which labeled all spots on the array) was labeled similarly with Cy3-dUTP. The labeled products were mixed, purified with a Microcon 30 filter (Amicon/Millipore, Billerica, Mass.), and hybridized on arrays for 16 to 18 h at 65°C. The arrays were washed at room temperature sequentially for 2 min each with three solutions: 2× SSC and 0.1% sodium dodecyl sulfate, 1× SSC, and 0.2× SSC (19). The arrays were scanned using GenePix 4000B microarray scanner (Axon Instruments, Union City, Calif.). We performed a minimum of two replicate experiments per strain, except for EH (HM-1:IMSS) and ED (SAW760) for which five replicate arrays were generated. Clones with significant cross-hybridization with bacterial genomic DNA were identified by labeling and hybridizing bacterial genomic DNA on the arrays as described above.
Data were analyzed using ScanAlyze software (http://rana.lbl.gov/EisenSoftware.htm) to determine the fluorescence signal intensity of each dye at each spot. The intensities of channel 2 (Cy5-labeled Entamoeba genomic DNA) and channel 1 (Cy3-labeled reference DNA) measured the Cy5 and Cy3 signals, respectively.
For all arrays hybridized with Entamoeba genomic DNA, the channel 2 signal intensity was adjusted such that the mean channel 2/channel 1 intensity, or R/G ratio, was 1.0 for the entire array (19). For arrays hybridized with control DNA, nonnormalized R/G signal was used; the cutoff for positive hybridization was the mean R/G value plus 3 standard deviations.
Spots were filtered out on the basis of two criteria: poor-quality spots were discarded if they were flagged or had a channel 1 net mean intensity of <250, and spots with high noise/signal ratios were discarded if they had extremely different values in replicate arrays (average deviation/average of >0.4). Additionally, clones from contaminated plates and clones that cross-hybridized with bacterial genomic DNA were not analyzed.
For each EH and ED strain, the hybridization signal to the clones on the array was compared to the hybridization signal for the reference strain, EH (HM-1:IMSS), and the relative hybridization was analyzed by genomotyping analysis by the method of Charlie Kim et al. (GACK) (40). This algorithm generates a graded output and assigns a range of values from −0.50 to +0.50 in 0.05 increments. For a GACK value of +0.50, there is 100% likelihood that a gene is present; for a GACK value of −0.50, there is 100% likelihood that a gene is absent, divergent, or present at reduced genomic abundance. The GACK output was put into four categories (on the basis of Southern blot hybridizations and sequence analysis): A (absent or highly divergent; GACK values of −0.50 and −0.45), B (significantly divergent; GACK values of −0.45 to −0.15), C (moderately divergent; GACK values of −0.10 to +0.40), and D (highly conserved; GACK values of +0.45 and +0.50).
The average signal intensity for low-copy-number genes (1 or 2 loci), multicopy genes (6 to 10 loci), or high-copy-number repeat elements was calculated by analyzing a minimum of 10 clones in each category. Genes were selected on the basis of previously published literature and genome annotation. All clones analyzed contained a single ORF. To estimate the signal for specific regions of the rRNA episome, we chose clones that represented (i) a portion of the upper intergenic spacer (22,000 to 24,000 bp), (ii) a part of the transcribed unit (15,000 to 19,000 bp), and (iii) a part of the lower intergenic spacer (12,000 to 14,000 bp) (58). For regions with >2-kb sequence, clones with 100% match were selected. For regions with <2-kb sequence, clones with the highest E value were chosen, although these clones would always contain genomic sequences other than the sequence of interest. For each of the categories, the R/G signal for each EH and ED strain was retrieved, spot quality filters were applied, and the mean signal intensity ± standard error was calculated. Lists of the clones used in these analyses are available for review (see Table 4 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
To determine the average probe length of the Cy5-labeled genomic DNA, labeled genomic DNA was electrophoresed in an agarose gel. Cy5 and ethidium bromide were visualized with the Molecular Dynamics Typhoon 8600 multimode imaging system (excitation wavelengths of 633 and 493 nm, respectively, and emission filters of 670 and 620 nm, respectively) (Amersham Biosciences, Piscataway, N.J.).
To determine the reliability of the divergence categories, all clones from categories A, B, and C and 200 random clones from category D of the ED (SAW760) analysis were compared to those in the ED (SAW760) sequence database. BLASTN analysis was performed, and the match length and percent identity of the first hit was retrieved.
A Student's t test (two-tailed distribution and two-sample equal variance) was utilized. The values in comparisons of sample pairs were considered significantly different if the P value was <0.05.
Southern blotting was performed by the standard protocol (58). Genomic DNA (10 to 15 μg) was digested with EcoRI or HindIII, electrophoretically separated on 1% agarose gels, blotted onto Hybond N+ nylon membranes, UV cross-linked, and hybridized with radiolabeled probes. The probes were PCR amplified from the appropriate clone, and the probe sequence was verified, labeled with [α-32P]dATP with the Random Primed DNA labeling kit (Roche, Mannheim, Germany), and hybridized with Express-Hyb (Clontech, Palo Alto, Calif.). Blots were subjected to autoradiography, scanned, and prepared for publication using Adobe Photoshop (version 6.01; Adobe Systems, San Jose, Calif.). Blots were stripped using H2O-0.5% sodium dodecyl sulfate and reused for subsequent hybridizations. The probes used were: category A (ENTI070, ENTIL41, and ENTOU30), category B (ENTPF64, ENTON45, ENTNW18, and ENTNQ01), and category D (ENTOI63, ENTOE77, ENTOJ81, and ENTPJ36). Four loci (423.m00019, 271.m00049, 353.m00047, and 8.m00351) were PCR amplified when possible from all EH strains and ED (SAW760). PCR products were directly sequenced and analyzed using CLUSTALW. Primer sequences are available for review (see Table Table11 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
The GACK divergence output was used for clustering. Clones identified as being highly conserved in all strains were removed from the analysis to facilitate use of the clustering and visualization programs. The remaining clones (~2,389) were clustered using the program XCLUSTER (http://smd.stanford.edu/download/), using a Pearson correlation, uncentered metric algorithm, and hierarchical clustering. The clustering output was viewed using Java Treeview version 1.0.3. (Alok Saldahna, http://sourceforge.net/projects/jtreeview). To generate dendrograms based on genotypic differences, the clones were analyzed using the Mega 2.2 program (dMEGA2: Molecular Evolutionary Genetics Analysis Software, Arizona State University) (http://www.megasoftware.net/) using an unweighted pair group method with arithmetic mean (UPGMA) distance matrix algorithm and bootstrap values to estimate confidence intervals.
Nucleotide sequence data for the 423.m00019, 271.m00049, and 353.m00047 loci have been deposited in GenBank. The locus names and accession numbers in GenBank follow: Rahman_423.m00019, AY857547; NIH_353.m00047, AY857548; Rahman_353.m00047, AY857549; HK9_353.m00047, AY857550; and SAW_271.m00049, AY857551.
To perform CGH, we generated an E. histolytica genomic DNA microarray. As of February 2004, the genome annotation was in 888 assemblies with 9,938 predicted genes (Brendan Loftus and Neil Hall [TIGR], personal communication). Due to the compact nature of the Entamoeba genome, clones may contain more than one ORF. The number of clones with one annotated ORF was 2,802 (2,112 unique genes).
In order to determine whether the hybridization signal corresponded to genomic abundance, we analyzed the signal from clones with no genomic insert, low-copy-number genetic loci (1 to 2 loci), high-copy-number genetic loci (6 to 10 loci), and highly repetitive genomic elements (see Fig. Fig.11 in the supple-mental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).The average probe length of Cy5-labeled genomic DNA was 500 to 1,500 bp, and all clones had comparable AT content allowing this comparison (data not shown). For clones with no genomic insert, the hybridization signal intensity was extremely low (R/G, 0.009 ± 0.002), indicating that Cy5-labeled E. histolytica genomic DNA did not cross-hybridize to the reference DNA. Signal intensities for 1-copy (R/G, 0.19 ± 0.01), 2-copy (R/G, 0.31 ± 0.05), and 6- to 10-copy-number genes (R/G, 1.65 ± 0.37) were proportional to the copy number. Clones with repetitive loci (Ehapt2 and EhRLE4) had high signal intensities (R/G, 1.03 ± 0.18 and 1.75 ± 0.16, respectively), although not proportional to the copy number, likely due to signal dampening from the adjacent nonrepetitive genomic regions. Clones from nontranscribed regions of the rRNA episome had half (R/G, 1.54 ± 0.17) the signal from transcribed regions (R/G, 3.95 ± 0.36) (each rRNA episome contains two identical transcribed repeats) (58). We determined the mean signal intensity for all clones on the array that reliably contained one ORF. Of the genes in EH (HM-1:IMSS), 13.8% had a mean signal intensity R/G of >1.0, indicating that these genetic loci were in high genomic abundance or were members of highly similar gene families.
When we compared E. histolytica and E. dispar hybridizations, we identified genetic loci known to be absent in ED compared to EH (CP1, Ariel1, and Ehapt2) as having significantly lower signal intensities in ED (SAW760) compared to EH (Fig. (Fig.1)1) (12, 67, 68). For highly abundant genomic loci (such as the cysteine proteinase genes and Ehapt2), although the signal for ED is markedly reduced than that in EH, it is not completely absent, since each clone contains additional genomic regions that will contribute to hybridization signal. The clones with CP5 also contained another potential ORF and therefore were not assessed. Genes highly conserved in the two species on the basis of homologues identified in the ED database, dipeptidylpeptidase 2 (DPP2), P glycoprotein 6 (PGP6), and actin-binding protein 2 (ABP2), had similar signal levels in EH and ED.
Analysis of the rRNA episome revealed significant differences between EH and ED (see Fig. Fig.22 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).In EH (HM-1:IMSS), this 24.5-kb extrachromosomal DNA is present in 200 copies per genome and consists of two 5.9-kb identical inverted repeat regions that are transcribed and two nontranscribed regions of 9.2 kb (upper intergenic sequence) and 3.5 kb (lower intergenic sequence), respectively (58). Our analysis revealed that the upper and lower nontranscribed intergenic sequences had significantly higher signal in EH than ED. We performed sequence comparison of these regions with the ED (SAW760) genome sequence. The nontranscribed intergenic sequences of EH had little homology to comparable regions in ED (at most 91% identity over 64 to 141 bp). Thus, the low signal intensity of the rRNA nontranscribed region in ED accurately reflected the significant sequence dissimilarity of these two regions. Signal from the transcribed region of the rRNA episome was significantly lower in EH than in both ED strains (sequences of EH and ED exhibit 98.6% identity), and Southern blot analysis confirmed that the copy number of the rRNA episome was lower in EH than ED (see Fig. Fig.2B2B in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
We genotyped four E. histolytica strains and two E. dispar strains (see Materials and Methods for details). There was high reproducibility among arrays hybridized with independently isolated genomic DNA. The mean correlation (± standard deviation) was 0.93 ± 0.03. The R/G signals obtained from hybridizations with each strain were compared to the R/G signals obtained for hybridization with EH (HM-1:IMSS) genomic DNA. For example, (R/GHK-9)/(R/GHM-1:IMSS) would indicate the relative hybridization for a given clone between EH (HK-9) and EH (HM-1:IMSS). Clones for which the experimental strain genomic DNA hybridized as well as EH (HM-1:IMSS) strain would represent genetic loci that were highly conserved. Clones that showed poor or no hybridization from the experimental strain genomic DNA would represent loci that were either absent, highly divergent so as not to cross-hybridize, or in significantly lower copy number in the experimental strain compared to EH (HM-1:IMSS). Data were assigned into functional divergence categories: A (absent or highly divergent), B (significantly divergent), C (moderately divergent), and D (highly conserved) (40).
Southern blot analyses were performed with EH (HM-1:IMSS) and ED (SAW760) genomic DNA and yielded the expected results (Fig. (Fig.22 [some data not shown]). For clones in the divergence categories A and B, the probes did not hybridize to genomic DNA from ED (SAW760). We used genome sequence data from ED (SAW760) to further analyze the validity of the divergence algorithm and performed BLASTN analysis on all clones in categories A, B, and C and 200 random clones in category D. For categories A, B, C, and D, the numbers of sequences that had no significant BLASTN match (E value of >1E−4) were 24.6, 13.3, 9.8, and 7.5%, respectively (Fig. (Fig.3A).3A). The 7.5% of sequences in category D without a BLAST match most likely represent incomplete genome sequence coverage of ED (~2X at present). The higher proportion of such sequences in the highly and moderately divergent categories indicated that those genomic loci had no matches in the ED database on the basis of divergence (rather than incomplete coverage alone). For those sequences with significant matches, the BLASTN results were analyzed (Fig. 3B and C). The average identity to the ED sequence database of clones from categories A, B, C, and D was 89% ± 0.4%, 90% ± 0.3%, 92% ± 0.2%, and 93% ± 0.3%, respectively. The mean match length for clones in categories A, B, C, and D was 191 ± 11, 242 ± 11, 273 ± 7, and 368 ± 16 bp, respectively. Therefore, in the highest divergence category (category A), although the query EH sequences were ~800 bp, the average match in the ED database was 191 bp with 89% identity. Thus, both the Southern blot hybridizations and large-scale sequence analyses indicated that the divergence categories were quantitatively valid. Although these analyses were performed for ED (SAW760), the validity of the categories should be applicable to other Entamoeba strains.
Using the divergence categories outlined above, we analyzed each EH and ED strain compared to the reference strain EH (HM-1:IMSS) (Table (Table1).1). The total number of clones analyzed for these strains ranged from 7,669 to 8,677. The two ED strains (SAW1734 and SAW760) exhibited the greatest divergence from EH (HM-1:IMSS) with 2.2 and 3.1% of clones in category A, 2.5 and 3.6% of clones in category B, and 8.6 and 10% of clones in category C, respectively. EH strains (200:NIH, HK-9, and Rahman) exhibited less divergence from EH (HM-1:IMSS) with 0.4, 0.2, and 0.1% of clones in category A, 1.1, 0.9, and 1.5% of clones in category B, and 5.2, 4.4, and 5.9% of clones in category C, respectively. The majority of clones were in category D (highly conserved). The numbers of clones in category D ranged from 93 to 94% for EH strains and 83 to 87% for ED strains, but the numbers were statistically different for the ED and EH strains (P value of 0.009). In order to determine whether the expected genomic regions contributed to the divergence patterns described above, we looked at the number of clones in category A that were from the nontranscribed region of the rRNA episome or the Ehapt2 or EhRLE4 locus, since these regions are highly divergent or missing in ED (Fig. (Fig.1)1) (59, 67). In the ED strains, 35 to 43.8% of the clones in category A were from the rRNA episome, and 3.9 to 2.3% contained either the Ehapt2 or EhRLE4 repeat region.
In order to obtain an overview of the comparative hybridization data, the divergence data from all usable clones were clustered with the XCLUSTER program using a Pearson correlation, uncentered metric algorithm, and hierarchical clustering (Fig. (Fig.4).4). Clones that were highly conserved in all strains were removed to facilitate the clustering program. Overall, 2,372 clones (27% of clones analyzed) were divergent in at least one strain of Entamoeba. Each Entamoeba strain had a unique genetic fingerprint compared to the EH (HM-1:IMSS) reference strain. Previous studies of EH clinical isolates relying almost exclusively on PCR-based analyses of a few highly polymorphic loci have shown that significant genetic diversity exists among EH strains (32, 33, 72, 73). The microarray data were generated from the analysis of >7,600 clones and suggest that genetic diversity exists on a genome-wide scale, including divergence in nonrepetitive genomic regions.
A total of 1,656 clones were divergent in any one ED strain; of these clones, 673 were divergent in both ED strains. In this latter set, we identified a set of 56 clones that were divergent in all ED strains, conserved in all EH strains, and in high genomic abundance in EH. These clones would represent good candidates for PCR-based diagnostic tests to distinguish EH from ED (Table (Table2).2). Some of the tests used to distinguish EH and ED at this time are based on sequence divergence of the nontranscribed regions of the rRNA episome (17, 22). Our analysis identified that 26 of the 56 clones that could distinguish EH from ED were from the rRNA episome, but we identified 30 clones that could serve as the basis for novel species-specific tests. A number of the diagnostic approaches used to distinguish between EH and ED at this time have demonstrated regional biases in sensitivity and specificity (11, 55). Diagnostic tests to identify ED should be based on genetic regions that were uniformly divergent in all ED strains compared to all EH strains (Fig. (Fig.4,4, sections 3 and 3A). Genetic regions that are divergent in both ED strains but are also divergent in other EH strains (Table (Table2)2) (Fig. (Fig.4,4, sections 2B and 2C) would not be suitable for diagnostic tests. However, regions uniquely divergent in one EH or ED strain could be used for strain-typing assays (Table (Table2)2) (Fig. (Fig.4,4, sections 1 and 1A) (see Tables Tables33 and 7 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml).
The three EH strains (Rahman, HK-9, and 200:NIH) were also clearly distinct from each other and from EH (HM-1:IMSS). Overall, 1,124 clones were divergent in any one EH strain. Clones that have sequence divergence in any EH strain would not be ideal for EH-specific tests. The genes they contain may not be ideal vaccine targets (if also divergent in any clinical EH isolates) and are likely not essential for trophozoite growth in tissue culture (Table (Table2).2). Of the three EH strains analyzed, EH (HK-9) was the least divergent (5.5% of clones divergent compared to the reference strain). The other two EH strains (200:NIH and Rahman) had 6.6 and 7.5% of divergent clones, respectively.
To generate genotypic dendrograms, the divergence categories were analyzed using the UPGMA algorithm (Fig. (Fig.5).5). The analysis revealed that we could successfully differentiate the genotypes of E. histolytica and E. dispar. Additionally, strain EH (Rahman), a less virulent strain as assessed by decreased monolayer destruction, disease in animal models, cytotoxicity, and the only EH strain isolated from an asymptomatic individual was the most divergent of the EH strains and clustered separately from the virulent EH strains with significant bootstrap values (13, 27, 48-51). The separation of the EH (Rahman) strain from the other EH strains could also reflect the geographic origins of the strains and be unrelated to parasite phenotypes.
Data from many more strains (especially clinical isolates) must be obtained before we can assess whether the genotype of avirulent EH strains is distinct from virulent EH strains. One previous study using RAPD also identified EH (Rahman) as distinct from virulent EH strains (64). However, unlike our study, they were not able to identify the genetic regions that differentiated EH (Rahman) from the other EH strains. The strain that aligned closest with EH (HM-1:IMSS) was EH (HK-9), which upon original isolation was invasive, but it is now considered to possess attenuated virulence (4, 15). The loss of virulence in EH (HK-9) may be an epigenetic phenomenon, related to gene expression, due to genomic changes, or have occurred during parasite isolation with selection of a less virulent subtype. For the purposes of this discussion, the assignment of virulence is based on the phenotype at the time of parasite isolation. All strains have been treated similarly (isolated from patients ≥30 years ago; both ED strains are xenic; and all four EH strains are axenic); thus, the overall selection pressures within the ED and EH strains should be comparable.
In order to assess whether any regions of the EH genome were more prone to be divergent, we analyzed the extent of divergence in assemblies. The 12X genome sequence coverage of EH (HM-1:IMSS) at this time has been organized into 888 assemblies (assembly size ranges from 297 to 2 kb; assemblies contain 1 to 151 ORFs); of these assemblies, 554 are on the array (Neil Hall and Brendan Loftus [TIGR], personal communication). The size of the assembly correlated with the number of predicted ORFs (R = 0.9766), and the number of predicted ORFs correlated with the number of clones on the microarray (R = 0.9063) (data not shown).
We analyzed all the clones on the microarray for which we had a definitive assembly and ORF assignment, a total of 4,555 clones. Assemblies were categorized by the number of Entamoeba strains that were divergent in that assembly. The majority of assemblies (39%) had no divergence in any Entamoeba strain, 16% had divergence in one strain, 19% had divergence in two strains, 12% had divergence in three strains, 9% had divergence in four strains, and 5.4% had divergence in all five Entamoeba strains analyzed (see Fig. Fig.33 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml). The assemblies in all five categories were characterized and were highly similar for the following: (i) mean ORF length (387 to 403 bp); (ii) percentage of ORFs that are hypothetical proteins (55 to 62%); (iii) average genomic abundance as assessed by signal intensity (R/G, 0.50 to 0.57); and (iv) the percentage of the assembly that is composed of simple (2.1 to 3.0%), low-complexity (9.7 to 12.6%), SINE (short interspersed nuclear element) (1 to 2.2%), or LINE (long interspersed nuclear element) (8.7 to 10.9%) repeats.
Therefore, the identification of 30 assemblies in which all five amebic strains were divergent but whose structural features and composition were apparently not different from the rest of the genome may indicate that these genomic regions are under different and higher divergence pressures than the remainder of the genome. Although we were able to identify a number of assemblies that apparently have high rates of divergence, we did not identify large portions of individual assemblies that were divergent in a given Entamoeba strain (data not shown). Thus, unlike bacterial genomes, large islands of genomic DNA do not appear to have been affected by the pressures that cause genetic divergence (39, 57). Since we do not have full genome coverage on the microarray, it is possible that there are larger regions of divergence that we did not identify due to lack of clones in a given genomic region. Additionally, single-nucleotide polymorphisms cannot be detected by our approach.
In order to identify genes associated with virulence, we analyzed the divergence data for genes divergent in ED compared to EH (Table (Table1).1). Overall, 1,640 to 1,817 loci were analyzed; the majority were highly conserved (94 to 98% in EH and 88 to 91% in ED), with 1,534 loci that were not divergent in any strain. The two ED strains (SAW1734 and SAW760) had the greatest number of divergent loci: 147 and 196, respectively. The EH strains had lower divergence: 111, 81, and 45 genes in 200:NIH, HK-9, and Rahman, respectively.
As previously mentioned, to date, only four genes have been identified as being absent (CP1, Ariel1, and Ehapt2) or highly divergent (CP5) in ED compared to EH (12, 67-69). We have now identified 80 novel genes that were divergent in both ED strains but conserved in all EH strains. The majority of these genes are hypothetical proteins; however, we identified an Arabidopsis thaliana avrRpt2-induced gene (AIG1) (565.m00023) (a plant gene involved in resistance to bacteria), a putative protein kinase (95.m00150), a peroxiredoxin 1-related gene (105.m00128), and a Ras family gene (395.m00027). We also identified nine genes that were divergent in the two ED strains and in the attenuated-virulent strain EH (Rahman) but conserved in virulent EH strains. Seven of the genes encoded hypothetical proteins; one gene coded for Rab geranylgeranyltransferase beta subunit (505.m00018), and one gene had a dihydrouridine synthase domain (34.m00240) (Table (Table22).
In order to determine whether despite their sequence dissimilarity, these genes encode functionally similar proteins in ED, we performed TBLASTX against the ED (SAW760) genome database on the 67 loci from the highly and significantly divergent categories and 30 random loci from the highly conserved category. The overall homology for the predicted proteins was much lower in the two divergent categories than in the highly conserved set (see Table 8 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml). Some genes from the highly or significantly divergent categories did have highly similar orthologues; these were likely identified as divergent on the basis of decreased copy number.
The overall paucity of genes restricted to virulent Entamoeba strains indicates that the presence or absence of a given gene (or sets of genes) may not fully explain the vastly different phenotypes of E. histolytica and E. dispar (Table (Table1).1). Expression differences in highly conserved genes, as known for the amebapore genes, undoubtedly play large roles in amebic pathogenic potential (46, 52). Since we did not have full genome coverage on our microarray, there are likely other genes restricted to EH that we did not identify. However, our array was composed of a random subset of genes, and a whole-genome analysis will likely reveal similar levels of divergence in ED. To validate that the genes we identified as divergent were functionally different, we checked the expression profiles of these genes. Genes divergent in ED (SAW760) had significantly lower expression levels than those in EH (HM-1:IMSS) (Ryan MacFarlane, personal communication).
In order to determine whether genes in category A were highly divergent or absent, we analyzed the 22 genes that were in this category for ED (SAW760). For each gene, we looked for homologues in the ED (SAW760) genome sequence database. Using these data, we identified 45% of genetic loci that appear to have diverged as a consequence of genetic drift, 23% that were divergent on the basis of copy number differences, and 32% that appear to be absent in the ED genome (although the limited 2X genome coverage makes it difficult to definitively state that a given gene is missing). On the basis of the CGH data and the limited sequence analysis, it appears that the majority of differences between ED (SAW760) and EH (HM-1:IMSS) resulted from genetic drift, rather than gene loss; however, a comprehensive genome analysis of ED will be necessary to make definitive conclusions.
We identified a number of loci that were divergent between EH strains including two hypothetical proteins (423.m00019 and 353.m00047) and a Ras guanine nucleotide releasing factor (271.m00049) (see Tables 7 and 9 in the supplemental material at http://microarray-pubs.stanford.edu/entamoeba_diversity/index.shtml). In order to confirm the CGH data, we PCR amplified and sequenced these loci (Table (Table3).3). For all loci categorized as highly divergent, PCR gave negative results (40). The average sequence identity for loci in categories B, C, and D were 78, 84, and 99%, respectively. Some variability was seen in the sequence identities for categories B and C. Since the array data were based on clones that contained ORFs and adjacent genomic sequences, the divergence we measured by CGH for 423.m00019 may have been in noncoding regions. Intergenic differences can, however, reflect functional differences (i.e., divergent promoters may affect expression levels). Thus, as a screening tool to identify divergent genetic loci, array-based CGH were valid for EH-EH comparisons.
Four EH strains and two ED strains were genotyped on the basis of CGH of >7,600 random genomic clones. On the basis of the identification of divergent genetic loci, each strain had a unique genetic fingerprint, and the degree of divergence correlated with virulence. The unique pattern for each strain indicates that there is significant diversity among laboratory strains of EH. Whether such diversity exists among clinical isolates is not clear. Epidemiologic studies have demonstrated extensive genetic diversity among clinical EH isolates using PCR-based analyses of highly repetitive and polymorphic loci (32, 33, 72, 73). Although these regions easily incorporate polymorphisms due to DNA slippage during replication, the extent of divergence may not be truly representative of divergence across the genome (42, 56). Our data confirmed that a great deal of the divergence in Entamoeba strains occurred at highly abundant genomic regions, but we also identified significant portions of the genome that were not in high abundance but that were divergent. Thus, the CGH data more accurately represented the overall genomic diversity among Entamoeba strains.
In addition to giving a genome-wide overview of the extent of genetic diversity, CGH allowed identification of genetic regions that were highly prone to divergence. We identified assemblies in which every EH and ED strain studied had some divergence, indicating that these genomic regions may be hot spots for genetic variability. No obvious structural component of these regions that would predispose them to diverge could be identified, although their organization in scaffolds is not known. Genomic architecture can predispose to divergence, as spontaneous subtelomeric deletions of Plasmodium chromosomes have been observed and genomic regions rich in tRNA genes or with transposon recognition sequences are more prone to integration (3, 21, 41, 44, 53). Irrespective of structural issues, genomic regions can be under different evolutionary pressures (introns and pseudogenes are rapidly evolving) or functional pressures (highly antigenic genes may be under immunogenic pressures), and these pressures may affect the overall divergence in a given genomic context (45, 47).
A number of factors likely contribute to this observed genome plasticity in Entamoeba. Retrotransposons, which are abundant in EH, play a significant role in genome evolution by their ability to move to new chromosomal locations (10, 59). Significant numbers of E. histolytica genes are in multiple copies or are members of highly homologous gene family members. Duplicate genes have higher rates of intron gain or loss in comparison to orthologues and are under accelerated evolutionary pressures either through decreased functional constraints or via positive selection (8, 16, 28, 38, 74). E. histolytica contains a number of episomally maintained DNA circles, which in other systems contribute to genome plasticity due to intramolecular homologous recombination between direct tandem repeats (20, 24). Whether genetic plasticity is enhanced in clinical scenarios to allow more-virulent strains to survive and disseminate or has resulted in a large subset of avirulent EH strains (which may account for the overall low rates of invasive E. histolytica infection) is not clear.
The genomic differences between E. histolytica and E. dispar were the most extensive. We identified a number of genes conserved in all EH strains and highly divergent in both ED strains, including a putative protein kinase, a peroxiredoxin 1-related gene, and a Ras family member. Peroxiredoxins are antioxidant enzymes that protect against reactive oxygen species and reactive nitrogen species (14, 37). Peroxiredoxin-null Saccharomyces cerevisiae cells are hypersensitive to oxidative stress and genomically unstable (70). Small GTP-binding proteins regulate many diverse processes in eukaryotic cells, including signal transduction, cell proliferation, cytoskeletal organization, and intracellular membrane trafficking. In Entamoeba, small GTP-binding proteins play roles in polarity, motility, phagocytosis, and perhaps encystation; ED may be affected in a number of cellular functions if functionally divergent in some of these genes (43, 54, 66). Nine genes were identified that were divergent in both ED and EH (Rahman) but conserved in all virulent EH strains. Most of these were hypothetical genes, but two had homologues in other systems: a Rab geranylgeranyltransferase and a gene with a dihydrouridine synthase domain. Functional studies will be necessary to elucidate the roles of these genes in amebic pathogenesis. The relatively low number of divergent loci in avirulent versus virulent Entamoeba is surprising, considering the vast biological differences between the two species. This implies that gain or loss of a gene may not be sufficient to confer pathogenicity. Thus, comparative genomics alone cannot unravel the complexities of virulence in these parasites. Expression levels, coordination, and timing, along with the genomic context, may all contribute to virulence and need further investigation.
Using CGH, we identified that the genomic fingerprint of an EH strain with attenuated virulence was phylogenetically distinct from virulent EH strains; it is an intriguing observation, because it implies that genotypes may be associated with phenotypes. If a similar association were identified in clinical isolates, it would have enormous clinical implications, since only those individuals with invasive subtypes of EH would need treatment. At this time, studies to determine the extent of genetic diversity among clinical isolates and potential genotype-phenotype correlations are under way. Our work has laid the foundation for pursuing genome-wide population studies of E. histolytica and understanding the impact of parasite genetics on human disease.
This work was supported in part by a Applied Genomics in Infectious Diseases training grant (5 T32 AI07502) and a Stanford University Dean fellowship for P.H.S., a Cellular and Molecular Biology Training Program grant (NIH 5 T32 GM007276) for R.C.M., a Stanford University Dean fellowship for D.B., and a Burroughs Wellcome collaboration grant and a grant from the NIAID (AI-053724) to U.S.
We gratefully acknowledge the help of Brendan Loftus, Neil Hall, and Iain Anderson (TIGR and Sanger Center) for access to EH clones and sequence and genome data. We also thank David Mirelman and Dan Eichinger for parasite strains, Sandeep Jaggi for help with data analysis, Charlie Kim for input on the GACK algorithm, Kevin Visconti for microarray printing, and all members of the lab for helpful suggestions and discussions.