|Home | About | Journals | Submit | Contact Us | Français|
Few transcriptional regulatory networks have been described in non-model organisms. In Entamoeba histolytica seminal aspects of pathogenesis are transcriptionally controlled, however, little is known about transcriptional regulatory networks that effect gene expression in this parasite. We used expression data from two microarray experiments, cis-regulatory motif elucidation, and a naïve Bayesian classifier to identify genome-wide transcriptional regulatory patterns in E. histolytica. Our algorithm identified promoter motifs that accurately predicted the gene expression level of 68% of genes under trophozoite conditions. We identified a promoter motif (A/TAAACCCT) associated with high gene expression, which is highly enriched in promoters of ribosomal protein genes and tRNA synthetases. Additionally, we identified three promoter motifs (GAATGATG, AACTATTTAAACATC/TC and TGAACTTATAAACATC) associated with low gene expression. The promoters of a large gene family were highly enriched for these motifs, and in these genes the presence of 2 motifs predicted low baseline gene expression and transcriptional activation by heat shock. We demonstrate that amebic nuclear protein(s) bind specifically to four of the motifs identified herein. Our analysis suggests that transcriptional regulatory networks can be identified using limited expression data. Thus, this approach is applicable to the multitude of systems for which microarray and genome sequence data are emerging.
Identification of gene regulatory networks is one promise of the post-genomic era. Identification of cis-regulatory elements and the patterns of gene expression they control become increasingly possible as large-scale expression studies and high-throughput genome sequencing are carried out. A number of approaches can be used to identify cis-regulatory elements that control expression of large numbers of genes. Bioinformatic techniques can identify putative regulatory elements that are conserved in the promoters of co-expressed genes, conserved in promoters in evolutionarily distant species, or both in concert (1–3). Computational identification of motifs can be coupled with Bayesian statistics to allow the identification of potential interactions between transcription factors. In Saccharomyces cerevisiae and Caenorhabditis elegans, computational approaches and extremely large expression data sets, comprising upwards of 255 experiments, were used to identify regulatory modules controlling gene expression during many conditions, including progression through the cell cycle, sporulation and osmotic stress, and development (4,5). Additionally these studies accurately predicted gene expression based on promoter content. Unfortunately, however, in most systems such detailed microarray data are not currently available.
Entamoeba histolytica is a protozoan parasite and the etiologic agent of amebic colitis and amebic liver abscess causing invasive disease in 50 million people annually (6). The amebic life cycle has two stages, a trophozoite form that causes invasive disease, and an encysted form that is the transmissible agent. Changes in amebic transcription underlie developmental pathways, parasite response to host stresses, drug resistance and tissue invasion (7–10). A number of recent studies have utilized microarray technology to characterize the amebic transcriptional profile associated with adhesion to collagen, parasite virulence, stage conversion and tissue invasion (8,11,12,13). Although global changes in gene expression were observed, the promoter elements controlling these transcriptional changes were not identified. Identifying regulatory pathways controlling transcriptional responses is key to understanding how and why amebae cause disease.
The basal transcriptional machinery in E. histolytica has been well characterized, including identification of a TATA box (TATTTAAAG/C) and an Initiator (Inr) element (AAAAATTCA) (14,15). In addition to the TATA box and Inr element, a third-core promoter element, the GAAC box (AA/TGAACT), is independently able to control the rate and site of transcription initiation (16,17). The presence of a third conserved core promoter element contributes to the unusual core promoter architecture in E. histolytica compared to other metazoan systems. A number of other regulatory elements and transcription factors have also been identified in E. histolytica. The most well characterized are the transcription factors upstream regulatory element 3-binding protein (URE3-BP), a calcium-sensitive EF hand protein and the E. histolytica enhancer binding proteins 1 and 2 (EhEBP1 and EhEBP2) (18–20). Additionally, the recent completion of the E. histolytica genome sequence indicates that canonical transcription factors are encoded in the genome (21). Thus, it appears that sequence-specific DNA-binding proteins control multiple aspects of basal and activated transcription in E. histolytica.
Although much work has been done in E. histolytica to characterize the regulation of a handful of genes, global transcriptional networks have not been identified. We have applied a gene regulatory network approach towards understanding coordinate control and regulation of gene expression in this parasite. Utilizing expression data from two microarray experiments, we identified cis-promoter motifs that correlated with the level of gene expression. In addition, we identified a set of three promoter motifs that when present in combination of 2 motifs were associated with increased gene expression in response to heat shock. Furthermore, by using electrophoretic mobility shift assays (EMSAs) we confirmed that a number of the motifs predicted by bioinformatic analyses specifically bind amebic nuclear protein(s).
Entamoeba histolytica (HM-1:IMSS) was grown axenically in trypticase-yeast extract-iron-serum (TYI-S-33) medium as previously described (22). Parasites were subjected to heat shock by exposure to 42°C for 1h. Viability of the heat shock treated trophozoites was determined by Trypan blue exclusion. For RNA isolation, amebae were washed once with TYI-S-33 medium to remove dead cells, chilled on ice for 10min, centrifuged for 10min at 430 × g, resuspended in TRIZOL reagent (Invitrogen), lysed by a syringe needle and RNA isolated following the manufacturer's protocol. RNA used for microarray analysis was further purified using the RNeasy kit (Qiagen), according to the manufacturer's protocol.
Expression analysis was performed using a custom E. histolytica array from Affymetrix, Inc. (Santa Clara, CA, USA), as previously described (8). Probes were designed according to standard Affymetrix chip design protocols (http://www.affymetrix.com/support/technical/other/custom_design_manual.pdf); up to 16 paired oligonucleotides were designed per gene. A total of 9435 of the 9938 genes predicted in the E. histolytica genome are represented on the microarrays. Repetitive sequences from retrotransposon elements, tRNA genes and the ribosomal RNA episomal circle were not included on the array. Probes were also designed for intergenic regions, though these probes were not considered in this analysis. Labeled cRNA for hybridization was prepared from 4μg of total RNA according to published Affymetrix protocol (http://www.affymetrix.com/support/technical/manual/expression_manual.affx). Hybridization and scanning were performed by the Stanford PAN facility according to Affymetrix protocols (http://cmgm.stanford.edu/pan/).
Two arrays from individual mid-log cultures of E. histolytica (HM-1:IMSS) trophozoites and two arrays from trophozoites subjected to heat shock as described above were included in this analysis. Raw data from the microarray scanner were loaded into the GCOS software (Affymetrix, Santa Clara, CA, USA). Data were scaled to have a mean value of 500. Data from probes designed to intergenic sequence were removed. The remaining scaled data were loaded into GeneSpring (Agilent Technologies, Palo Alto, CA, USA) and normalized per chip, to give a median expression value of 1, and yielding an approximately normal distribution. Normalized data were log2 transformed, giving a median value of 0, and the two replicates for each condition (untreated trophozoites and trophozoites subjected to heat shock) were each averaged.
The complete E. histolytica genome sequence was obtained from The Institute for Genome Research (TIGR, http://www.tigr.org/tdb/e2k1/eha1/). The amino acid sequence, nucleotide sequence and locations of all predicted open reading frames (ORFs) were retrieved from TIGR (download date February 21, 2006). This information was used to retrieve the region from −500 to −1 relative to the predicted translation start site for each ORF. We have sequence data for promoter regions of 7638 genes that are present on the microarray.
The MEME and MAST programs were downloaded from UCSD (http://meme.sdsc.edu). The MEME motif elucidation program was run with the command line arguments: –dna –mod zoops –minw 6 –maxw 16 –minsites 35 –nmotifs 30. This identifies 30 motifs found zero or one times in each promoter, with a width between 6 and 16 nucleotides. We created a custom background Markov model file with the nucleotide frequencies in the E. histolytica genome. This is necessary, as the mono- and dinucleotide frequencies of non-coding sequences in the E. histolytica genome are highly skewed (~80% A/T content), and the Markov chain created by MEME is sensitive to the background frequency of nucleotides. We used a custom Python program to determine the correlation coefficient between each pair of motifs. Motifs identified in both expression categories that had a correlation coefficient 0.7 were eliminated from the analysis, leaving a total of 22 motifs (Table 1). We used the MAST program to identify all occurrences of each motif in the promoter sequence database. Command line option used for the MAST program was: –ev 1000. This allows sequences with an e-value of less than 1000. The relatively high e-value for motifs is required for finding some of the shorter motifs. We again made use of the background Markov model file previously generated. Custom parsers for MEME and MAST were written in the Python programming language, and are available from the Biopython project (http://biopython.org/).
To create a hidden Markov model (HMM) of the Ehssp gene family, the amino acid sequences of the C terminal domain for 48 representative Ehssp genes (shown in bold in Supplementary Table S1) were aligned using clustalw (http://clustalw.genome.jp/). The HMMER package was downloaded from Washington University (http://hmmer.wustl.edu). The clustalw alignment was used as input to the hmmbuild program, for generating a profile HMM. This model was run through hmmcalibrate, which calibrates the search statistics for the HMM. We used the hmmsearch program to search all E. histolytica predicted ORFs for significant similarity (e-value < 1e − 7) to our HMM.
We used the hypergeometric distribution to determine the significance of enrichment for each motif identified in the two expression categories. Briefly, the number of occurrences of each motif in genes with high expression and genes with low expression were identified, as well as the number of times the motif occurred in the promoter database as a whole. This allowed us to determine the relative enrichment of each motif for the expression data set in which it was identified. To confirm the motifs we identified as being overrepresented in the appropriate expression set, we randomized the nucleotide positions of the motif and measured the overrepresentation of the shuffled motif in the appropriate expression category. Randomization was performed 1000 times, and the number of times the shuffled motif was significantly enriched, with a P-value less than or equal to the original motif, in the appropriate expression category was counted.
Custom Bayesian classifier libraries were written in Python. Bayes’ theorem, shown in the formula below, gives the conditional probability distribution of a random variable H given a second random variable E, given the marginal probability distribution of H, the conditional probability of E given H and the marginal probability distribution of E.
In our case, we want to know the probability of a gene being expressed given the presence of each motif in its promoter. For each motif, the likelihood of it being present in genes with high or low expression was determined.
To create our Bayesian classifier, we used the motifs we previously identified using the MEME program. The promoter database for the entire E. histolytica genome was then queried using the MAST program against this set of motifs. For ease of calculation, only the most 3′ occurrence of each motif in a promoter was used in our Bayesian classifier. To determine the accuracy of the Bayesian classifier, we created training and cross-validation sets by randomly removing 25% of the genes. We then trained the classifier on the training set. The classifier was then evaluated using the cross-validation set. The random partitioning was repeated 1000 times, to give an estimate of the accuracy of the Bayesian classifier. For each correct prediction by the classifier, we determined which motifs were present in the promoter. By identifying the most frequent combinations of motifs in the correct predictions, we were able to determine if there was any evidence for significant co-occurrence of the motifs.
Amebic nuclear proteins were obtained using the protocol described in (23). Briefly, 1 × 107 mid-to-late log-phase trophozoites were washed one time with ice-cold phosphate buffered saline (PBS) solution. Cells were harvested in ice-cold PBS and centrifuged for 5min at 430 × g. Cells were then resuspended in hypotonic buffer (10mM HEPES pH7.9, 1.5mM MgCl2, 10mM KCl, 6% NP-40) supplemented with protease inhibitors (500μM AEBSF, 1μM leupeptin, 1μM E-64d), and incubated 20min on ice. Nuclei were collected by centrifugation for 10min at 1000 × g at 4°C, resuspended in hypertonic lysis buffer (20mM HEPES pH7.9, 420mM NaCl, 1mM EDTA, 1mM EGTA, supplemented with the same protease inhibitor mix), and incubated on ice for 30min. After lysis, nuclear and membrane fractions were removed by centrifugation at 18000 × g for 20min at 4°C. The soluble fraction, containing nuclear protein, was snap frozen on dry ice, and frozen at −80°C. Protein content was determined using Bradford reagent (24).
Double-stranded oligonucleotide probes (Supplementary Table S2) were designed for each motif, such that the most common nucleotide was used for each position. About 50pmol of double-stranded oligonucleotide were labeled with α-32P dATP using Klenow fragment (Invitrogen) in the supplied buffer according to manufacturer's protocol. Binding reactions occurred in binding buffer (10mM TrisHCl pH7.9, 50mM NaCl, 1mM EDTA, 3% glycerol, 1mg/ml bovine serum albumin, 1mg/ml salmon sperm DNA) with 5μg of nuclear protein and 50fmol of labeled probe. Binding occurred for 30min at 30°C. For competition, 2- and 10-fold molar excess unlabeled oligonucleotide were added to the binding reactions prior to incubation at 30°C. Protein–DNA complexes were resolved on a 0.5 × TBE polyacrylamide gel, which was then vacuum-dried and exposed to a storage phosphor screen (Kodak). Phosphor screens were developed using software from Molecular Dynamics.
To identify the expression profile of mid-log phase E. histolytica (strain HM-1:IMSS) trophozoites, we used a custom short oligonucleotide microarray fabricated by Affymetrix, Inc (Santa Clara, CA, USA). These arrays were designed with probes targeting both predicted mRNAs (9435 of the 9938 predicted genes are represented on the array) and predicted intergenic sequences (8). Total RNA from two independent cultures was used to generate two expression profiles, which had a correlation coefficient of 0.96, indicating highly reproducible results. In addition, the hybridization intensity of the probes targeting coding sequences was significantly higher than those from intergenic sequences (Supplementary Table S3). Consistent with previous data using the same microarray, ~86–89% of probe sets showed hybridization above background levels and were considered ‘Present’ according to the Affymetrix GCOS software (Affymetrix, Inc., Santa Clara, CA, USA) (8). After removing probe sets targeting intergenic sequences and normalization, the data formed a unimodal approximately normal distribution (Figure 1A). There was excellent correlation between the microarray signal intensity and the amount of transcript as assessed by northern blot and semi-quantitative reverse transcriptase-polymerase chain reaction (RT-PCR) analysis ((8,12,25) and R. MacFarlane, G.M. Ehrenkaufer, and J.A. Hackney, unpublished data). Based on these results, we concluded that we could identify genes that differ in steady-state expression level in E. histolytica trophozoites.
Coordinate gene expression is often driven by conserved cis-promoter elements. To identify potential regulatory elements, we followed the bioinformatic procedure depicted in Figure 1B. The E. histolytica genome is highly compact (median intergenic size 326bp), amebic genes have very short 5′-untranslated regions (median 21bp), and almost all E. histolytica promoter elements identified to date are within 400bp of the start codon ((14,27) and J.A. Hackney, unpublished data). The nucleotide sequences of the region from −500 to −1 relative to the predicted start codon were retrieved from the E. histolytica genome sequence data set and should contain the majority of promoter regulatory regions. Although this genomic region will often extend into the adjacent gene, we do not believe this should pose a substantial problem with our algorithm, as elements identified in neighboring genes will not likely be significantly enriched in either expression category. Our analysis will not identify motifs that are present >500bp upstream from the start codon or those present downstream of the stop codon (in the 3′-untranslated region or adjacent genomic regions). In addition, as we are only examining steady-state levels of gene expression, our approach cannot address differences in mRNA stability or other post-transcriptional regulatory processes. Promoter regions that were >98% identical to another promoter and those for which <500bp of sequence was available (at the end of a contig for example) were not analyzed. Following these criteria, a total of 7638 unique promoter sequences were obtained.
In order to identify motifs which correlated with gene expression level in E. histolytica trophozoites, we used the MEME program (27) to identify conserved motifs in the promoter regions of genes with very high (defined as a log2 normalized signal 4; n = 630) and very low (defined as log2 normalized signal −4; n = 477) expression profiles (Figure 1A). Thirty motifs were identified in each set of promoters, with several motifs identified in both sets. We removed motifs identified in the promoters of both gene sets that had a correlation coefficient of 0.7, leaving a total of 22 motifs (Table 1). We then used the MAST program (28) to identify all occurrences of these 22 motifs in the promoters of all E. histolytica genes for which we have promoter sequences. Since we hypothesized that motifs found in each of these sets would be predictive of gene expression level, we would expect, for example, that motifs identified in promoters of highly expressed genes would be enriched in the promoters of this set of genes. Using the hypergeometric distribution, we found that 15 of the 22 promoter motifs were significantly over-represented in the promoters of genes of the appropriate expression data set (P < 0.01), compared to the rest of genes for which we have promoter sequences (Table 1). That is, the promoter motif was more frequently found in the appropriate data set than would be expected by random chance. However, the relative enrichment of a motif in a given gene expression data set was lower than data reported elsewhere (which is often as much as 100-fold enrichment (1)) most likely due to our analysis of only one expression condition instead of the complex data sets typically analyzed. Of the fifteen motifs identified as being significantly enriched in the gene expression data set of interest, seven were biased to the 3′ end of promoter regions (towards the start codon of the gene), seven were equally distributed along the 500bp promoter region, and only one was biased towards the 5′ end of the putative promoter region (Supplementary Figure S1). This suggests the motifs we identified likely represent motifs in the promoter regions of genes of interest, and not 5′ or 3′ elements from the neighboring gene. To determine whether the motifs were overrepresented because of the motif sequence, or merely because of nucleotide bias, we randomly permuted each motif 1000 times and determined whether the shuffled motif showed similar enrichment in the appropriate expression category (Supplementary Table S4). In many cases, no significant motifs were generated by the randomization. In a few cases, however, we did find randomly shuffled motifs that had P-values equal to or lower than the original motifs we identified, likely due to the low nucleotide variability of a motif (e.g. the A/TAAACCCT motif has low nucleotide variability and thus at least some of the 1000 randomly shuffled motifs are likely to substantially resemble the original motif).
We wished to further characterize the motifs described above in order to identify potential cooperative action. For this analysis, we used a Bayesian classifier to identify groups of motifs that were significantly predictive of gene expression levels. The concept of Bayesian statistics is that previously observed frequencies of event occurrences give an indication of the likelihood of those events occurring again. In our case, we are trying to determine the likelihood that a given gene will have a specific expression level based on whether or not it has one or more motifs in its promoter region. Bayesian statistics have been previously coupled with identification of promoter elements to identify patterns of gene regulation dependent on multiple transcription factor binding sites (4,5,29). Because transcription factors often coordinately influence gene expression, this type of analysis is better suited to identification of combinations of motifs than identification of the single motif alone. We hypothesized that we could use the motifs identified in promoters of genes with very high and very low gene expression levels to classify a larger set of genes with more moderate gene expression levels (Figure 1A and B). As expected, many of the motifs that were significantly enriched in promoters of genes with very high gene expression were also enriched in the promoters of more moderately expressed genes (defined as log2 normalized expression value 2.5; 1473 genes, 19% of genes for which we have promoter data; Supplementary Table S5). Likewise, most of the motifs significantly enriched in promoters of genes with very low gene expression were also enriched in the promoters of genes with low gene expression (defined as log2 normalized expression value of −2.5; 1153 genes, 15% of genes for which we have promoter data) (Supplementary Table S5). We can make no predictions about genes with median gene expression levels (log2 normalized expression values between −2.5 and 2.5) as the motifs we previously identified should only be indicative of low or high gene expression; thus these genes were not considered. We also removed genes whose promoters contain none of the motifs, as we cannot make any predictions about their expression levels. A final data set of 1584 genes was used in the Bayesian analysis. To assess the accuracy of our Bayesian classifier, we used a cross-validation strategy, randomly removing one quarter (396) of genes for later testing. We then trained the classifier on the remaining data set of 1188 genes, which includes genes with log2 normalized expression values 2.5 or −2.5 and at least one previously identified motif in their upstream regions. After training the Bayesian classifier, the expression levels of 396 genes in the cross-validation set were predicted using the motifs present in their promoters. We repeated this test 1000 times, each time removing a random set of 396 genes and training the classifier on the remaining 1188 genes. Overall, our prediction accuracy was 68 ± 2%. The P-value for correctly predicting this fraction of genes is 7.4 × 10−7, according to the binomial distribution. Highly expressed genes were more likely to be correctly predicted at 72 ± 2%, while low expressed genes were correctly predicted at 62 ± 2%. The difference in predictive power between the two expression categories is likely due to the greater number of genes with high gene expression in our data set. As Bayes’ theorem incorporates the background frequency of each class occurring and weights the predictions accordingly, our predictions are necessarily weighted toward the genes with high expression, which are the more common set. We believe our results compare favorably with other reports using Bayesian statistics to infer regulatory patterns considering the relatively small size of our gene expression data set (2 arrays versus ~255 arrays used in other studies) (4,5).
We wished to further characterize the promoter motifs that were used to correctly predict gene expression profiles in E. histolytica. For this analysis, we identified all occurrences of a given motif in the promoters of all genes for which we have sequence data, regardless of expression level. We identified a single motif (consensus sequence A/TAAACCCT) that was predictive of high gene expression in trophozoites (282 of 1473 highly expressed genes have at least one copy of this motif in their promoters, P < 0.01) (Table 2). This motif is very similar to an enhancer element in Schizosaccharomyces pombe (AAACCCT), which is found upstream of all S. pombe histone genes (30,31). In S. pombe this element is in direct repeats, whereas in E. histolytica gene promoters it is generally present in only a single copy. Although we did find this motif in the promoters of 4 of the 13 histone genes in E. histolytica, the prevalence was not statistically significant. However, in E. histolytica, this motif was highly enriched in the promoters of ribosomal proteins (72 of 134, P < 0.01), and tRNA synthetase genes (15 of 28, P < 0.01). Importantly, the occurrences of this motif were not restricted to a single type of tRNA synthetase or ribosomal protein subunit, suggesting the presence of the motif in these promoters is not simply through evolutionary conservation of duplicated promoter sequences (Supplementary Table S6).
In order to confirm that the motifs identified on the basis of bioinformatic analyses bind amebic nuclear proteins in a sequence-specific manner, we used EMSAs. We chose motifs such as M37, M40 and M41 that had significant enrichment within the appropriate expression category, were biased toward the 3′ end of the promoter, and had strong conservation at each position in the motif. We identified that E. histolytica trophozoite nuclear extract bound motifs M37 and M40, with a single band identified in the binding reaction, which was competed by 2- or 10-fold molar excess of cold oligonucleotide, but not by a similar excess of shuffled cold oligonucleotide designed to retain similar nucleotide content as the original motif (Figure 2A and B; Supplementary Table S2). For motif M41, three bands were observed upon interaction of the oligonucleotide containing the motif and amebic nuclear extract, however only one band (marked with an arrow) appears to be binding in a sequence-specific manner (Figure 2C). This confirms that our computational approach successfully identified promoter motifs that are recognized by specific DNA-binding proteins in E. histolytica.
We also identified a set of three motifs (M24, M23 and M9) that were overrepresented in promoters of genes with very low expression (Table 1, Figure 3A). Examination of the motif combinations most frequently found by our Bayesian classifier indicated that these motifs were likely to show significant co-occurrence in promoters of genes with low expression. Thus, we analyzed gene expression of all genes that have these motifs (either single or multiple motif combinations) in their promoter regions. We found that genes whose promoters contained only one of the three motifs, but not the other two, had expression levels close to median (Figure 3B). However, genes whose promoters contained 2 motifs had extremely low gene expression signal. Additionally, we found a strong propensity toward co-occurrence of these three motifs in the promoter of a given gene (P < 0.01, Supplementary Figure S2A). Genes that had 2 of these motifs were also more likely to be in the set of genes with moderately low gene expression than genes with only a single motif.
The motif M24 (GAATGATG) is novel, not previously described in E. histolytica or other organisms. The motif M23 (AACTATTTAAACATC/TC) is very similar to the amebic TATA box, but has additional flanking sequences. The M9 motif (TGAACTTATAAACATC) is composed of two motifs: M9A (TGAACT), similar to the amebic GAAC element and M9B, a novel motif (TATAAACATC) (14). The two parts of M9 appear to be inseparable in the context of the other two motifs, M23 and M24. M9B by itself is not significantly co-associated with M24 or M23 and also does not predict low gene expression levels (Supplementary Figure S2B).
A total of 58 genes have all three motifs (M9, M23 and M24) in their promoter regions (Supplementary Figure S3A) and 182 genes contain at least two of the three motifs (Supplementary Figure S3B). The majority of genes that contain 2 of these motifs are annotated as hypothetical proteins by The Institute for Genomic Research, but a number of those have significant BLASTP matches to a polymorphic, charged antigen, E. histolytica stress-sensitive protein 1 (Ehssp1) (32). Three subfamilies of the Ehssp gene family were previously identified, varying in domain composition, primarily of the charged medial domain (32). We created an HMM of the C-terminal domain shared by all three subfamilies. Searching the E. histolytica genome sequence using this HMM, we identified 253 members of the Ehssp gene family (Supplementary Table S1).
In the Ehssp gene family, members that have 2 of the M9, M23 and M24 promoter motifs, the motif position and organization were highly conserved (Supplementary Figure S3C). This stereotypical spacing was retained regardless of the relative divergence of the rest of the promoter sequences for each given gene (data not shown). However, Ehssp family genes that do not retain 2 motifs appear to be more divergent from each other than those that have retained the motifs (data not shown). This suggests that either the Ehssp genes that have 2 motifs are more recent gene duplication events than the Ehssp genes with 1 motif, or that Ehssp genes with 2 motifs were subjected to stronger evolutionary pressure than Ehssp genes with 1 motif. At present, we cannot distinguish between these two possibilities.
There are 43 genes that contain 2 of the promoter motifs M9, M23 and M24 but which do not belong to the Ehssp gene family (Supplementary Table S7). Unlike the genes in the Ehssp gene family, these genes do not tend to have low expression in trophozoites (median log2 normalized signal 0.0 ± 0.70). There were no apparent functional categories enriched within this set of genes (data not shown). The spacing of motifs M9, M23 and M24 in these promoters differs from the Ehssp genes: while there is a trend toward motifs M9 and M23 being found at the 3′ end of the promoter, the distribution throughout the promoters is more broad than in the Ehssp genes (Supplementary Figure S3D), and the relative ordering of the motifs is not as conserved as in the Ehssp genes (data not shown).
When the Ehssp gene family was first identified, it was shown that some members of this gene family were responsive to heat and oxidative stresses (32). However, the authors did not determine how many members of the Ehssp gene family were responsive to stress, or if there were other members of the Ehssp gene family that were not stress responsive. To determine this, we subjected two independent cultures of E. histolytica (HM-1:IMSS) trophozoites to heat shock for 1h at 42°C on different days, isolated RNA and hybridized it to two Affymetrix microarrays. The results from the two microarrays were comparable to each other (correlation coefficient = 0.9) and the data gave a unimodal approximately normal distribution (Supplementary Figure S4). Our data were comparable to previous analyses of the heat shock response in E. histolytica and other systems, with ~10% of the genes on the microarray up-regulated by 2-fold (33,34). Additionally, many genes previously identified as up-regulated under heat shock (including the heat shock proteins and Ehssp genes) were identified as such in our analysis (21,32,35). A scatter plot showing microarray expression data for all Ehssp genes (those with 2 motifs and 1 motif) and non-Ehssp genes with 2 motifs under standard culture and heat shock conditions is depicted in Figure 4. The Ehssp genes with 2 motifs were significantly more likely to be up-regulated and to a much higher degree by heat shock than either the Ehssp genes with 1 motif or the non-Ehssp family genes with 2 motifs (Figure 4 and Supplementary Table S7). Overall, the Ehssp family genes with 2 motifs had a 25.5 (±4.73)-fold increase in expression (±standard error) after heat shock, while the Ehssp family genes with 1 motif had a 3.77 (±0.82)-fold increase in expression (±standard error) after heat shock (P = 1.2 × 10−5 compared to the Ehssp genes with 2 motifs, Supplementary Table S7). The non-Ehssp family genes with 2 motifs had a 4.84 (±1.43)-fold increase in expression (±standard error) after heat shock (P = 5.1 × 10−5 compared to Ehssp genes with 2 motifs, P = 0.51 compared to Ehssp genes with 1 motif; Supplementary Table S7). Although some probes for the Ehssp family and non-Ehssp family genes with 1 motif are up-regulated under heat shock conditions, they are a significant minority of the data set. Whether this represents signal cross-hybridization between members of this gene family or that other motif(s) are responsible for the up-regulation of these genes is not clear at present.
The transcriptional up-regulation of the Ehssp genes with 2 motifs M9, M23 and M24 under heat shock conditions could be due to a number of scenarios. One possibility is that amebic nuclear proteins that function as repressors bind to the motifs under standard culture conditions, but do not bind under heat shock conditions. Thus a loss of repression and resultant induction of gene expression would occur under heat shock conditions. Alternatively, activators may bind to the motifs only under heat shock conditions, facilitating induction of gene expression. We decided to identify whether specific nuclear proteins bind to the M9 motif differentially under heat shock conditions to determine which of these models was correct. We chose the M9 motif because it (i) is highly overrepresented in heat-shock-responsive Ehssp genes, (ii) is novel and (iii) has a highly conserved distribution in the Ehssp gene promoters. We performed EMSA analysis with the M9 motif, both with E. histolytica nuclear extract from untreated trophozoites, and with nuclear extract from parasites exposed to heat shock at 42°C for 1h (Figure 5). A faint band was seen with the M9 motif and standard nuclear extract, but that band does not appear to represent a specific interaction. In contrast, under heat shock conditions, four bands were identified by EMSA analysis, two of which appear to be specific interactions to M9. Both 2- and 10-fold molar excess unlabeled oligonucleotide with M9 substantially competed the binding interaction, whereas competition with M9A mutant and M9B mutant were equally ineffective at competing the two specific bands, although two other bands were substantially competed (Figure 5B). This suggests that some amebic nuclear proteins bind M9 in a much stronger complex when bound to both sites of the motif, than to either part alone. Overall, the data suggest that protein complex(es) bind to M9 specifically under heat shock conditions and are likely involved in transcriptional activation. Further kinetic studies will be needed to determine the exact nature of the binding between the protein(s) and M9 motif and the potential roles of the M23 and M24 motifs in this binding and gene activation.
We have identified patterns of promoter motifs in genes with similar expression levels in the protozoan parasite E. histolytica using Bayesian algorithms. Despite the small size of our microarray data set, we identified a number of interesting motifs that were predictive of gene expression in E. histolytica trophozoites. Additionally, we identified a set of three motifs, which in combination of 2 motifs predicted low baseline gene expression and transcriptional up-regulation under heat shock. Finally, we have validated by EMSA analysis that a number of motifs we identified represent specific binding sites for amebic nuclear protein(s).
Microarray expression analysis is a powerful method to characterize transcriptional changes that regulate multiple aspects of pathogenesis. The large data sets produced by such experiments create the opportunity to find global regulatory patterns. Previous studies in S. cerevisiae and C. elegans generated statistical models of gene expression, predicted gene expression levels based on promoter motifs and identified regulatory modules controlling gene expression during the cell cycle, sporulation and osmotic stress, and developmental changes (4,5). These studies relied on exquisitely detailed expression data from multiple conditions and time points of interest with data from up to 255 microarrays in S. cerevisiae. In these situations, the vast amount of transcriptional profiling proved invaluable. However, such detailed information is unlikely to be available for most other systems of interest.
We have shown that substantially less expression data can still be successfully applied to the identification of global transcriptional networks. Using data from two microarrays hybridized with RNA from log-phase E. histolytica trophozoites we were able to identify promoter motifs that correlate with gene expression level. Although none of these motifs were restricted to promoters of genes within a given expression profile, several motifs were enriched in the promoters of genes with similar functions or gene family members, likely indicating biologically significant enrichment. Three of these motifs bound to specific protein(s) in nuclear extracts from E. histolytica trophozoites. The promoter motifs identified by this method should not be core promoter elements as these would be found in all genes, and several of the motifs identified are enriched in groups of co-regulated genes (the ribosomal protein genes, tRNA synthetases and the Ehssp gene family). Thus we would expect transcriptional regulators identified by this approach to be either general transcription factors, or transcriptional enhancers or repressors. One interesting motif identified in this analysis (A/TAAACCCT) strongly correlated with high gene expression under trophozoite conditions. In E. histolytica this motif is highly enriched in the promoters of several groups of genes, including ribosomal proteins and tRNA synthetases. Ribosomal protein genes often show co-regulated expression, as has been extensively detailed in S. cerevisiae (36). Additionally, two promoter motifs were identified in 95% of ribosomal protein genes in Toxoplasma gondii, though neither of these motifs is similar to the A/TAAACCCT motif identified in E. histolytica (37). An analogous promoter motif (AAACCCT) has been described in S. pombe, where it is present in tandem repeats and is found in the promoter of all histone genes (30).
We identified three promoter motifs (M24, GAATGAGT; M23, AACTATTTAAACATC/TC; and M9, TGAACTTATAAACATC) that in any combination of 2 motifs were highly predictive of low baseline gene expression in E. histolytica trophozoites. The M9, M23 and M24 motifs are overrepresented in the promoters of a large gene family homologous to Ehssp1, a stress-sensitive antigen (32). A majority (55%) of the 253 predicted genes in this family contain 2 of these promoter motifs. In promoters of the Ehssp gene family that contain 2 motifs, the position and spacing of these motifs is conserved. Interestingly, it appears that the Ehssp genes with 2 motifs appear to be more similar to each other than Ehssp genes with 1 motif, suggesting that they either have a more recent evolutionary origin, or that the Ehssp genes with 1 motif may no longer be subjected to the same degree of evolutionary pressure as the Ehssp genes with2 motifs.
Ehssp family genes with 2 motifs were highly up-regulated by heat shock. In contrast, the Ehssp family genes with 1 motif and the non-Ehssp family members with 2 motifs showed more modest up-regulation of gene expression after heat shock. We have demonstrated that the M9 motif binds amebic nuclear protein(s) in a specific manner in nuclear extracts prepared from heat-shocked trophozoites, whereas no substantial specific interaction was identified with nuclear extracts prepared from untreated trophozoites. Our model is that transcriptional activator(s) bind to the M9 motif under heat shock conditions, up-regulating transcription of the Ehssp genes. How binding of the potential activator(s) relates to the presence of other motif(s), and to the spacing between motif M9 and the transcription start site are yet to be determined.
Few complex transcriptional regulatory networks have been identified to date in other parasitic systems. Recent bioinformatic analysis of P. falciparum promoters relied on large-scale expression data combined with evolutionary conservation to identify 12 putative regulatory elements, two of which had been previously described (2). This work also identified an overrepresented group of promoter motifs, many of which had potentially opposing activities. The most complex transcriptional network characterized in a protozoan parasite to date was found in analysis of promoters from heat shock protein (hsp) genes in P. falciparum (3). This work identified a novel G-box motif that was conserved in the promoters of the related Plasmodium species, P. yoelii, P. berghei and P. vivax. Further in vitro analysis identified a second control element within the hsp86 promoter.
We have demonstrated a method for identification of promoter motifs in E. histolytica promoters relying upon small expression data sets. As more amebic expression profiles are determined, this analysis can be extended to identify novel genetic regulatory pathways involved in pathogenesis and developmental control. This work represents a major step forward in identifying transcriptional regulatory networks in protozoan parasites and indicates that this statistical method can be broadly applied in other organisms.
Supplementary Data is available at NAR Online.
We would like to thank all the members of the Singh lab, Swathi Rao, Jon Boyle and Jeroen Saeij for their comments and helpful discussion. J.A.H. was supported by NIH grants T32-AI07502 and T32-AI07328. G.M.E. was supported by NIH grants T32-AI07502 and AI-068899. U.S. was supported in part by NIH grants AI-53724 and AI-37941. U.S. is a member of the Stanford University Digestive Disease Center funded by the NIH grant, P30 DK56339. The authors have no competing interests. Funding to pay the Open Access publication charge was provided by NIAID.
Conflict of interest statement. None declared.