While intense research efforts have focused on understanding how gene expression is regulated in model organisms, there are thousands of species important to human health, the environment, and global economies whose transcriptional control mechanisms are not well represented by current biological models. One such species is the apicomplexan parasite responsible for the most lethal form of malaria in humans,Plasmodium falciparum
. When the P. falciparum
genome sequence was published in 2002, it was revealed that the nucleotide composition was unusually AT-rich (~80% AT on average, ~90% AT in intergenic regions) with approximately 60% of the predicted genes possessing no known function [1
]. Furthermore, initial analyses of the genome using BLAST and profile-Hidden Markov Model searches suggested an apparent dearth of transcription factors [1
] leading to much speculation that the parasite relied primarily on post-transcriptional regulatory mechanisms for control of its gene expression.
However, over the past 15 years, several investigators have identified on a gene-by-gene basis using traditional experimental approaches regions of gene promoters, and in some cases specific sequence elements, that are important for proper gene expression [4
]. Additionally, microarray expression data have shown that for the majority of genes, transcript levels vary significantly between different stages of the parasite life cycle [13
] and the recent applications of more sensitive bioinformatic methods such as two-dimensional hydrophobic cluster analysis coupled with profile-based search methods have identified additional components of the core transcription machinery [15
]. Thus, although post-transcriptional mechanisms such as anti-sense transcription [16
], selective repression of transcript translation [20
], or epigenetic mechanisms [23
] are likely to play crucial roles in the regulation of parasite gene expression, a central role for transcriptional regulation in regulating proper gene expression in P. falciparum
cannot yet be ruled out.
With the recent emergence of genomic sequences and associated transcriptome datasets for many species, in silico
methods of cis
-regulatory element discovery offer much promise towards rapidly elucidating mechanisms of transcriptional control. This is especially true in non-model organisms such as P. falciparum
where traditional genetic and biochemical experimental methods have been slow to yield insights. Some examples of the most commonly used approaches include MEME [24
], AlignACE [25
], MDScan [26
], and Weeder [27
] (for a comprehensive review see [28
]). Most of these methods utilize some type of statistical background-modeling approach to identify putative transcription factor binding sites as sequence motifs that occur in the promoter regions of co-expressed genes in greater frequency than would be expected if a random set of promoter regions were considered (i.e. the background). Although successful when applied to organisms possessing well-annotated genomes of AT contents between 40% and 70% [29
], we have found that these methods tend to produce an undesirably high number of false positive regulatory elements when applied to AT-rich P. falciparum
promoter sequences. Thus, to overcome the challenges posed to in silico cis
-regulatory element discovery by the AT-rich P. falciparum
genome, we have developed an algorithm called Gene Enrichment Motif Searching (GEMS).
When applied to the P. falciparum genome, GEMS was able to identify putative cis-regulatory elements in the repeat-sequence-rich base-biased genome by: 1) using a hypergeometric-based scoring function to analyze empirical sequence data without the use of repeat masking; 2) eliminating the guesswork of mismatch and similarity threshold selection by using an exhaustive parameter optimization routine to determine the best representation of putative cis-regulatory elements as position-weight matrices (PWMs).
When applied to promoter regions of genes contained within 21 functionally-enriched co-expression gene clusters generated from P. falciparum
life cycle microarray expression data using the semi-supervised clustering algorithm Ontology-based Pattern Identification (OPI) [30
], GEMS identified 34 high-confidence putative cis
-regulatory elements including many of cis
-regulatory elements previously described in P. falciparum
literature. These 34 motif candidates were found in the promoter regions of genes associated with a wide variety of parasite processes including sexual development, antigenic variation, cell invasion, sporozoite development, ribosome function and DNA replication, thus supporting the hypothesis that cis
-regulatory elements play an important role in the transcriptional control of a diverse array of P. falciparum
biological processes. Additional support for the biological relevance of these motifs was given by comparative genomic analyses of orthologous promoter sequences from rodent malaria species and detection of element positional enrichment relative to gene start codons. Furthermore, the function of a regulatory element associated with cell invasion genes described herein was characterized using reporter gene and electrophoretic mobility shift assays (EMSAs). Collectively, these results provide much needed robust starting points for the future biological characterization of cis
-regulatory elements in P. falciparum
and demonstrate in general that in silico
approaches to understanding transcriptional regulation mechanisms can be successfully used to predict regulatory elements in non-model organisms possessing unusual genome characteristics.