This study reports our findings on changes in gene expression between those stages of
T. brucei that can be readily studied in the laboratory. During natural infections in the vertebrate host,
T. brucei progresses from the actively dividing slender BF to the non-dividing stumpy BF. This form is primed for differentiation to procyclic forms in the insect host. We examined the expression level of nearly all predicted protein-coding genes of strain 927
T. brucei, as well as many RNA genes, in five different parasite populations. Most of the analyses reported here are restricted to nuclearly encoded transcripts, although some findings for mitochondrially derived transcripts are noted. Three biological replicates were used for each test condition. Three of the five sets were derived from
in vitro culture under highly standardized conditions: log-phase cultured bloodstream forms (cBF), log-phase cultured procyclic forms (PF-log), and stationary-phase procyclic forms (PF-stat) (see Additional file
1). The remaining two samples, slender BF and stumpy BF, were derived from infected rats and were likely to show more inter-sample variability, since they were grown for varying times in rats, and the animals became progressively less healthy after irradiation and infection. Since the stumpy BF populations were the last to be harvested and each contained a variable proportion of intermediately differentiated forms, these populations were the most biologically variable. They ranged from 76% to 93% morphologically stumpy and 68% to 88% of cells expressed the PF marker EP procyclin upon appropriate stimulation (see Table ). The slender BF populations showed less than 5% intermediate or stumpy forms and less than 1% expressed procyclin after stimulation.
| Table 1Characterization of bloodstream form T. brucei from animals |
The Nimblegen arrays that were hybridized with cDNA prepared from each parasite population contained multiple probes per gene (in most cases, eight), and three copies of these probe-sets per chip. After normalization to allow for cross-chip comparisons, we calculated a single value for each set of triplicate probes using Tukey's biweight formula. We then obtained a single gene level value using the Tukey biweight of the signals for the probes corresponding to each gene (Additional file
2). In most cases, this value represented the "average" of 72 data points collected for every protein-coding gene (45 for RNA genes) in each of the five growth conditions. Thus, as detailed in Methods, we were able to utilize robust statistical analyses that increase our confidence in the findings of the comparisons reported below.
Gene expression levels
The normalized expression values from the 8110 probe-sets corresponding to nuclear genes were hierarchically clustered using the TMeV software package and are shown graphically as a heat map in Figure . A small number of genes (those colored black or dark blue in the middle of the figure) showed no or very low expression under any condition, and a similar number (colored red) showed high expression levels in some (top) or all (bottom) conditions. However, the large majority of the genes showed low (blue) or moderate (green or yellow) expression levels. Each biological sample set showed a similar distribution of signal intensities obtained for the 8004 probe-sets corresponding to nuclear protein-coding genes, with the curves obtained from cBF and PF-stat (the most divergent samples) shown Figure . This figure also depicts the signals obtained for the 345 probe-sets that map to multicopy CDSs. This curve was skewed dramatically to the right, as compared to the total sample of probe-sets, confirming gene amplification as a clear strategy for increasing gene expression in T. brucei. Indeed, 20 of the 50 probe-sets detecting the highest signals map to two or more genes (see below and Table ). However, the signals did not follow in rank order of number of genes detected by the probe-sets.
| Table 2The 50 most highly expressed protein-coding genes |
For the total probe-sets, there is a large peak centered at ~4600 for the PF-stat and ~5800 for cBF. In both cases, the peak moves sharply down towards lower values, with a small shoulder at ~200. When the same arrays were probed with RNA derived from a different strain, this shoulder was more pronounced, and a corresponding increase in probe-sets with signal intensities less than 200 was observed (unpublished data). Since many of these probe-sets mapped to
VSGs, most of which are not conserved between strains, a signal level of ~200 can be taken as a generous estimate of background for non-transcribed regions. However, almost all regions of the
T. brucei genome are thought to be constitutively transcribed, and hence even those genes whose mRNAs are unstable would likely show a signal higher than this background. The number of probe-sets that failed to show a signal level of less than four times the "non-transcribed background" signal (
i.e., 800) in least one of the five different biological conditions was small (164 protein-coding genes). Even then,
VSG genes, which are subject to clonal variations in expression, accounted for all but 65 of this set. The majority (55) of the remaining genes have unknown function, with most of these (44) found only in
T. brucei, raising the possibility that they do not represent authentic genes. In addition, almost all of these low-expressing genes are located in sub-telomeric clusters of
VSGs or expression site associated genes (
ESAGs) or are immediately adjacent to convergent strand-switch regions where transcription terminates [
15,
16]. Even when considering only cBF parasites, only 133 predicted non-
VSG protein-coding genes had a signal <800, and of these only 22 had annotated function. Notably, most of these genes were expressed to higher levels in at least one other stage, and those that were low in all stages were located in a sub-telomeric or convergent strand-switch region context.
The protein-coding genes were ranked according to their maximal expression level in any stage, and broadly categorized according to their annotated function (Figure ). As suggested above, VSG genes and T. brucei-specific genes of unknown function accounted for most of the genes showing the lowest expression (86% of the bottom 2% and 64% of the bottom 5%). Conversely, genes which are more highly expressed tend to have some type of functional annotation; indeed many of these have been studied experimentally.
We identified the nuclear CDSs with the top 10% of signals for each biological condition. The genes were individually examined and placed into categories based on their annotation and/or their proteomic detection in specific sub-cellular fractions. Figure shows the distribution of these genes into broad categories as in Figure , while Figure shows the distribution of genes with ascribed function (yellow in Figure ) according to various categories of biological function or location for the each of the five different biological conditions. As indicated in Figure , above, genes with unknown function were under-represented in the highly expressed category, as compared to the whole genome, and this was most striking in the PF-log cells (Figure ). Overall, the categories with the largest numbers of highly expressed genes were translation, metabolism, or cytoskeleton, with the majority (99 out of 144) of the genes in the translation category encoding cytoplasmic ribosomal proteins. Genes involved in mitochondrial function were highly represented in the top expressors in the PF stages, but more modestly represented in BF stages. As expected, significantly higher proportions of ESAGs or genes related to them (GRESAGs) were expressed at high levels in all BF stages than in the PF stages. In general, relatively similar proportion of genes were found in each category in all biological stages, although the number of genes involved in metabolism was increased in PF-log and genes encoding cytoskeleton proteins decreased in stumpy BF. However, this does not mean that the same genes are expressed to high levels in all stages. For example, 13% of mRNAs highly expressed in cBF were more than 2-fold up-regulated as compared to PF and similarly 12% of mRNAs highly expressed in PF were more than 2-fold up-regulated as compared to cBF (see below).
Among the 800 most highly expressed nuclear CDSs in cBF, 307 were highly expressed in every condition examined. Surprisingly, 111 of these had unknown function. Thus, there is a large set of highly expressed
T. brucei-specific (20) or conserved (91) genes that have not been ascribed a function; including nine of the 50 most highly expressed genes in cBF (see Table ). Some of the "hypothetical" proteins encoded by these nine genes have been shown to exist by proteomic analyses (Tb11.01.2800, Tb10.6k15.1500) or other studies (Tb927.1.4600 [
13]). Conversely, two other "hypothetical genes" lie in intergenic regions between coding regions that are highly expressed, raising the possibility that these are not separate genes, but rather simply sequences within 3' UTRs of the neighboring genes. The first, as represented by Tb927.1.2540, is one of a set of seven almost identical putative genes (six of which are annotated as "hypothetical protein, unlikely") that are interspersed between the histone H3 genes. The second, as represented by Tb927.1.4590, corresponds to a set of genes that are interspersed between the highly expressed
CFB1 genes. Tb11.1190 encodes a putative protein which is composed of 49 repeats of 68 amino acids and is represented on the microarray by a single probe corresponding to a unique sequence at the C-terminus. Tb11.02.4690 specifies a 22 kDa protein with a signal sequence and three transmembrane domains. It is expressed to a much higher level than the flanking genes, indicating it is a distinct mRNA. The last of these nine genes, Tb927.4.1000, encodes a 25 kDa protein that is also expressed much more highly than the adjacent genes.
Differential gene expression
Comparison of the Tukey mean maximum and minimum signal levels for all probe-sets corresponding to nuclear genes revealed 122 genes that showed a greater than 10-fold change between two or more of the five biological conditions tested. Of these, 30 were
VSG,
VSG-related (
VR) genes, or
ESAG/GRESAGs, many of which are associated with antigenic variation. A total of 446 genes (including 161
VSGs and
ESAGs) showed more than 4-fold variation, while at the 2-fold level, 2105 genes (including 233
VSGs and
ESAGs) showed a statistically supported difference in expression. Thus, over one-fourth of all of the genes assessed on the microarray were differentially expressed (
i.e. q-value of <5% in multi-class significance analysis of microarrays (SAM)) between at least two conditions. This dataset was further reduced to those showing a >2-fold deviation from the mean of all five conditions in at least one sample, and by excluding all genes encoding VSG/VRs and
ESAG/GRESAGs (which were analyzed separately, see below). These 534 probe-sets were K-median clustered using settings indicated in the Methods to yield nine clusters which contained between 31 and 80 genes (Figure and Additional file
3). Each of these clusters represents a distinct pattern of gene expression, although some are similar. Overall, these highly regulated genes are enriched in those involved in metabolism, proteolysis, translation and
T. brucei-specific unknown functions, but under-represented in those genes that are conserved but have unknown function. However, individual clusters show differential enrichment in particular functional gene categories.
Figure shows the heatmap depicting gene clustering, while Figure shows overlaid expression patterns graphically for every gene in each cluster. Figure depicts specific examples that illustrate the expression patterns characteristic of the gene clusters. Genes in clusters 5, 6, 7 and 8 all have higher mRNA levels in BF than in PF, although the clusters each show subtle differences in expression pattern. For example, cluster 8 contains genes that encode mRNAs substantially down-regulated in PF-log and PF-stat, whereas in cluster 6 the down-regulation in PF is more modest and some genes begin to decrease expression in stumpy BF. Cluster 8 is enriched in genes involved in metabolism and adenosine transport and also includes several genes encoding 64-65 kDa invariant surface glycoproteins (ISGs) and procyclin-associated genes (PAGs). Cluster 6 contains genes encoding three 75 kDa ISGs, several proteases, and a substantial number of T. brucei-specific proteins of unknown function. Cluster 7 contains genes that are somewhat down-regulated in PF-log (relative to BF), but are expressed at even lower levels in PF-stat. Genes encoding proteins involved in interaction, metabolism, protein folding and protein transport or modification, are overrepresented in this cluster. Conversely, cluster 5 contains 60 genes that are down-regulated to a greater extent in PF-log than in PF-stat. This cluster is dominated by genes encoding proteases and T. brucei-specific proteins with unknown function. The former category includes several paralogues encoding homologues of the Leishmania gp63 surface protease.
Clusters 2, 3, and 4 show different patterns of up-regulation with respect to the two PF biological conditions. Cluster 2 contains genes that are up-regulated in PF-log, but not in PF-stat. For some genes this change in expression begins in stumpy BFs. The genes in this cluster are over-represented for those encoding proteins involved in interaction, metabolism, RNA processing, transcription, translation and cytoskeleton function. Their reduced expression in PF-stat is consistent with cessation of growth functions upon entry into stationary phase. Indeed we observed a significant accumulation of rRNA precursors in the PF-stat samples upon analysis on the Agilent BioAnalyzer (not shown). Conversely, cluster 3 contains genes that are up-regulated only in PF-stat and mostly have unknown function, including eight that are T. brucei-specific. It is possible that some of these gene products are involved in preparation for differentiation into epimastigotes, the next stage in the parasite life cycle. Finally, cluster 4 contains genes with higher expression levels in both PF-log and PF-stat and is enriched in genes encoding proteins involved in metabolism, proteolysis, or with unknown function but located on the cell surface or mitochondrion. As discussed in more detail below, this is consistent with the switch to mitochondrial pathways for energy generation in procyclics.
Cluster 1 contains the largest number of genes, with 82 members. The expression of these genes is lower in PF-stat (and stumpy BF in some cases), but the genes are expressed at higher levels in cBF and slender BF than seen in cluster 2. These genes are over-represented in those involved in metabolism, DNA replication/repair, protein folding, proteolysis and translation, consistent with their down-regulation in stationary-phase cells. Cluster 9, with 31 members, is the smallest cluster. The pattern of gene regulation is similar to cluster 1, but with some up-regulation in PF-log. Like cluster 1, this cluster is enriched in genes involved in DNA replication/repair.
The existence of these varied expression patterns implies a complex set of regulatory mechanisms operating at the RNA level to control the abundance of transcripts encoded by nuclear genes. The specific proteins involved in these processes are only beginning to be examined (see for example refs. [
13,
14,
17]).
In contrast to the analyses above that examined the transcripts showing the most variation in abundance, we also looked at the transcripts that showed the least variation. Genes such as these would provide excellent controls for studies of developmental changes in gene expression. We identified 830 genes with a maximum variation in expression between the five stages of <25% (see Additional file
2). As expected, genes encoding proteins involved in known stage-regulated processes such as glycolysis and electron transport are significantly under-represented. However, the group is slightly enriched for genes of unknown function. Genes encoding proteins involved in lipid or fatty acid metabolism are also over-represented, comprising 28% of the metabolic enzymes that showed little variation as opposed to 11% of all metabolic enzymes. Similarly, genes encoding proteins of the ubiquitin pathway represent 40% of all protease-related genes, but 85% of the subset of protease-related genes that showed little variation. Genes involved in histone acetylation or chromatin structure, such as Tb927.7.1690 and Tb927.4.2520 (which encode transcriptional silencer Sir2) also tended to maintain similar mRNA levels between life cycle stages.
Comparison of cBF and log phase PF forms
In order to identify differences in gene expression between specific conditions, we conducted pair-wise comparisons of specific datasets (including cBF versus slender BF, slender BF versus stumpy BF, cBF versus PF-log, and PF-log versus PF-stat) using SAM, setting the q-value to <5% and the fold-change to >2 (see Methods). Because VSG expression is both clonal and highly variable, VSGs are excluded from the gene tallies below, unless otherwise noted.
Comparing the signals between cBF and PF-log, 691 genes were found to be differentially expressed. When the stringency of the SAM analysis was reduced to a 1.7-fold change, 963 genes were detected. A further reduction to 1.5-fold identified 1508 genes--approximately 19% of the genome. Thus, a relatively large fraction of the genome encodes mRNAs that differ in abundance between these two stages. Figure shows a comparison of the functional categories of the genes showing >2-fold regulation; these are individually listed (along with their fold-changes in mRNA expression and q-values) in Additional file
4. Table itemizes those genes upregulated in cBF that have predicted functions (excluding
VSGs and
ESAGS, which are discussed below).
| Table 3Genes with functional annotation that are up-regulated in cBF as compared to PF-loga |
As can be seen in Figure , categories of genes where cBF show higher expression than PF-log cells include
ESAGs and
GRESAGs, uncharacterized proteins bearing interaction motifs (such as zinc fingers and leucine-rich repeats), and known surface and secreted proteins (in part because there are multiple distinct genes in several surface protein families). However, it is also interesting that a larger number of genes upregulated in cBF encode proteins with hypothetical status (conserved or
T. brucei-specific) that have signal sequences. This fits well with the finding that the secretory and endocytic systems are more active in BF than PF [
18]. However, unlike Koumandou et al. [
12], we did not find that mRNAs specifying proteins involved in secretory traffic were highly up-regulated in BF.
Categories with more representatives up-regulated in PF-log cells include those encoding mitochondrial proteins, metabolic proteins, and translation. It is known that the metabolism of PF (which can use both glucose and amino acids for energy metabolism) is more complex than BF (which are highly glycolytic) [
19], presumably accounting for the more diverse set of metabolic genes up-regulated in PF-log cells. In PF, the mitochondrion becomes enlarged with more fully developed cristae and the respiratory chain is active [
2]. These changes are reflected in the increased expression of a large number of genes (72) encoding products associated with the mitochondrion, including 27 that are of unknown function. In contrast, a single gene known to encode a mitochondrial protein is upregulated in cBF: the alternative oxidase. This oxidase is required for the glycerophosphate shuttle that allows glycolysis to continue [
20].
The comparison of mRNA abundance between these two stages led to the identification of several groups of interesting genes. These include those encoding nucleoside transporters NT2-NT7 which reside in an array immediately adjacent to the sub-telomeric
VSG cluster at the "right" end of chromosome 2. There they alternate with a set of iron-ascorbate oxidoreductase genes (see Tb927.2.6180, Tb927.2.6230, Tb927.2.6310) that have not been functionally characterized to our knowledge. The
NT genes were reported to be more highly expressed in BF than PF forms, a finding which we also observe [
21]. Interestingly, all of these oxidoreductase genes are also significantly more highly expressed in cBF than PF-log (3.3-13.2-fold, see Table ). Three additional iron/ascorbate oxidoreductase genes are found on chromosomes 5, 7, and 9 -- these are each expressed to similar levels in cBF and PF-log. Thus, the chromosome 2 region represents a rare cluster of genes encoding similarly regulated mRNAs. Another interesting case is that of
VSP1, an acidocalcisomal pyrophosphatase encoded by two tandemly linked genes (Tb11.02.4910 and Tb11.02.4930) with almost identical coding regions [
22]. The array data show that the two genes are reciprocally regulated, which may potentially be traced to their divergent 3' UTRs.
Comparison of gene expression in BF under different conditions
We compared the expression of all nuclear genes in slender BF isolated from infected animals with the expression in slender BF obtained by
in vitro culture (cBF) and the expression in stumpy BF from animals. In comparison of cBF and slender BF, other than a few
VSG genes, no gene showed a difference in expression that met our criteria of a 2-fold change and q-value < 5%. Additional file
5 lists those genes that showed more moderate (>1.5-fold) or less well-supported changes (q-value < 15). However, two
ISG64 genes (Tb927.5.1390 and Tb927.5.1430) showed a slightly lower (1.6-1.8-fold), but high-confidence increase in signal in slender BF. A few other genes showed similar (1.5-1.9-fold) changes in expression, but had somewhat lower confidence (q-value = 7.9). These included a CAMK group protein kinase (Tb927.7.6580), a nucleoside phosphorylase (Tb927.8.4430), a tryparedoxin (Tb927.3.5090) and two proteins with unknown function that had higher signals in cBF, and a
GRESAG4 that had a higher signal in slender BF. These data contrast with a previous microarray study examining 550 genes that found 35 were upregulated in cBF and 3 were upregulated in slender BF [
12]. None of those 38 genes correspond to the few genes that we identified above. Two sets of genes that we identified as modestly upregulated were on the previous array, but these were not observed to be upregulated in that analysis. The lack of consistency between the two studies in this regard could arise from differences in the strains or conditions (
e.g., medium, serum, use of intact
vs immunocompromised animals). Nonetheless both studies do suggest that
in vitro cultivation provides a reasonable model for analysis of most mRNAs in slender BF.
A comparison of the rapidly dividing slender BF with the non-dividing stumpy BF showed a total of 107 genes with at least a 2-fold change in signal in the arrays, not including
VSGs. About twice as many genes were up-regulated in slender forms (Table ) as were up-regulated in stumpy forms (Table ). The most prominent categories of genes showing increased signals in slender forms are those that are related to the cytoskeleton, including the flagellum (Figure ). Many of these genes are annotated as hypothetical proteins, but they were detected in the flagellar proteome [
23]. Non-dividing forms do not build new flagella or cytoskeleton. Additionally, several metabolic enzymes were up-regulated, predominantly those which are localized to the glycosome (a specialized peroxisome) or are involved in glycolysis. Conversely, the entire set of eight
ESAG9 genes in the 927 strain were upregulated in stumpy BF (from 2-fold to 30-fold), as were two genes that are related to
ESAG9. The function of
ESAG9 is not known; it was originally described as a gene found in a
VSG expression site (ES) in the closely related parasite
Trypanosoma equiperdum [
24]. At that time, the authors noted that a related
ESAG9 was transcribed independently of the
VSG ES. Seven of the eight annotated
ESAG9 genes encode proteins with a predicted signal sequence, but none of these contain predicted transmembrane domains, suggesting the
ESAG9s could encode a family of secreted proteins. The metabolic enzymes encoded by genes with higher mRNA levels in stumpy BF were predominantly mitochondrial, consistent with pre-adaptation for differentiation into insect forms. We also noted the increased mRNA for the
PAD1 and
PAD2 genes, which encode citrate transporters and were previously shown to be upregulated in stumpy forms of
T. brucei strain EATRO 2340 [
25].
| Table 4Genes showing increased expression in slender BF as compared to stumpy BFa |
| Table 5Genes showing increased expression in stumpy BF as compared to slender BF |
Comparison of gene expression of PF in different conditions
Unlike cBF,
in vitro cultured PF can be grown to stationary phase where they can persist for several days as viable cultures. Thus, we could directly compare the abundance of mRNAs in actively replicating (PF-log) and non-dividing (PF-stat) cells. A total of 895 genes showed differential expression (see Additional file
6), many more than the 107 genes differentially regulated between with the slender (log) versus stumpy (stationary) BF. About three times as many genes were up-regulated in PF-log as compared to PF-stat. As shown in Figure , this increase was reflected across almost all categories of genes, except for proteins categorized as unknown (both conserved and
T. brucei-specific). The most skewed group was genes annotated as encoding hypothetical proteins (conserved or
T. brucei-specific) that have predicted transmembrane domains -- many more such genes were upregulated in stationary phase than in log phase. For proteins with ascribed function, those associated with protein phosphorylation/dephosphorylation were enriched in stationary phase. As discussed above, some of changes in PF-stat may reflect the decrease in cellular growth functions, or perhaps preparation for development to epimastigotes. It is also possible that some transcripts with higher signals in PF-stat are simply those that decay most slowly.
Genes encoded by the mitochondrial genome
Several genes on the mitochondrial maxicircle genome are extensively remodeled by RNA editing to yield transcripts encoding components of mitochondrial respiratory complexes. Only 15 mitochondrial probe-sets could be designed (see Additional file
7). Four corresponded to both edited and unedited sequences, and six to never-edited sequences, including the two rRNAs. Three corresponded to edited sequences, two of which had corresponding unedited probe-sets. From this limited set, a few trends could be observed, which were compatible with prior literature [
26-
28]. For example, 12S and 9S rRNA, cytochrome b, cytochrome oxidase subunit I, and cytochrome oxidase subunit II (edited plus unedited) transcripts all increased in stumpy BF and further increased in PF-log, although some did not reach statistical significance until PF-log phase. Somewhat surprisingly, ATP synthase subunit 6 (edited), NADH dehydrogenase subunit 5, NADH dehydrogenase subunit 7 (edited) and NADH dehydrogenase subunit 8 (edited) all showed increased mRNA levels in stumpy BF, but decreased in PF-log. Unexpectedly, many of the signals reached their maximum in PF-stat. This could reflect a potential differential stability as compared to the nuclearly-encoded transcripts under conditions of growth arrest, and would be highlighted by the normalization procedure.
Gene families
We noted several tandem arrays of gene families containing non-identical genes that were differentially regulated. Three families encoding proteins with multiple transmembrane domains are depicted in Figure . The first cluster (Figure ) of genes are those in the recently described
PAD array of carboxylate transporters [
25]. In contrast to
PAD1 and
PAD2, which are induced in stumpy BF [
25], the other members of this gene family are either constitutively expressed at the mRNA level or more highly expressed in PF.
PAD5 and
PAD7 show an increase in expression from stumpy forms to PF-log and even higher expression in PF-stat.
PAD8 showed similar expression in all conditions except in PF-stat, which was ~1.5-fold increased over PF-log (q-value = 4.89). Figure shows an unrelated gene family on chromosome 10 that also encodes major facilitator proteins. These four genes show a high level of conservation with one another, with long stretches of amino acid identity. Two are most highly expressed in BF, whereas the other two show a more complex pattern of regulation. The final set of genes (Figure ) encodes a set of related proteins predicted to have four to five transmembrane domains, four of these genes are tandemly arrayed on chromosome 8. Here the mRNA abundances of the three most closely related genes are higher in the BF samples. In contrast, the first gene in the array and another more divergent, unlinked gene on chromosome 11 do not show this pattern, and have similar or higher expression in PF.
VSGs and ESAGs
The
T. brucei strain 927 genome contains approximately 1600
VSG genes (or pseudogenes), but each BF trypanosome expresses only a single VSG, which covers the surface of the parasite in a dense coat. Although
T. brucei possesses ~20
VSG ESs (located at telomeres of megabase- and intermediate-sized chromosomes) [
29], the expressed
VSG gene encoding the surface coat protein is located in the sole active ES. In BF, transcription initiates in all ESs, but attenuates rapidly in the inactive ESs, never reaching the downstream genes including the resident
VSG [
30]. Similarly, transcription of ESs initiates in PF, but transcript elongation is minimal [
31]. A relatively small number of apparently functional
VSG genes exist on the 11 megabase-sized chromosomes in
T. brucei. The minichromosomes also contain a reservoir of apparently functional
VSG genes, but only a few have been sequenced. In contrast, most
VSG genes reside in sub-telomeric arrays that are comprised of pseudogenes (which were not included on these microarrays) and atypical
VSG genes, which encode proteins that are neither clearly pseudogenes nor clearly functional [
11]. The pseudogenes provide the fuel for generating novel
VSG genes by mosaic gene conversion during antigenic variation, particularly later in infection [
32]. The VSG-related
VR genes are located not in the telomeric ESs or sub-telomeric arrays, but rather typically reside in chromosome-internal strand-switch regions and lack the 70-bp repeats typically found upstream of
VSG genes [
11,
32]. The telomeric ESs and sub-telomeric
VSG arrays also contain hundreds of
ESAGs, many of which are pseudogenes. However, a number of genes related to
ESAGs (
GRESAGs) have chromosomal-internal location (the nomenclature discriminating
ESAGs and
GRESAGs was not consistently applied as genes were named).
The microarray design used in this study, contained probes for 74
VSGs, 70 atypical
VSGs, and 46
VSGs that were unclassified on VSGdb [
33]; 21 sub-telomeric
ESAGs, 104 chromosome-internal
ESAGs and
GRESAGs, as well as 17
ESAGs from three
T. brucei strain 427 ESs (no
T. brucei strain 927 ESs have been annotated to date). This
VSG and
ESAG subset of genes was represented by a total of 357 probe-sets. Even though individual parasites express only one ES (containing a single
VSG and ~10
ESAGs) at a time, since the parasites have been maintained without regard for antigenic type, we expected that there would be diverse set of
VSG genes showing some expression at the population level. In addition, we expected that expression of these
VSGs and
ESAGs would vary between biological replicates, and indeed, a subset of
VSG and
ESAGs showed considerable variation in BF, but not PF (Figure ), probably reflecting antigenic variation within these populations. Thus, subsequent analyses were carried out on the 15 individual samples rather than on the mean of the biological conditions (see Additional file
8 for gene level data).
Hierarchical clustering of the 357 probe-sets (after log2-transformation of the normalized expression values) allowed us to define four distinct patterns of VSG gene and ESAG expression (marked A-D in Figure ). Interestingly, the distribution of VSG genes and ESAGs from different genomic locations within each group differed markedly (see Figure and ). Group A contained a large number (137) of VSGs not expressed in any sample, or only at low levels in some BF samples, exemplified by gene 1 in Figure . All these genes were located within sub-telomeric clusters and were likely not transcribed at any stages, except when translocated to the active expression site in small sub-populations of BFs. This group also included five ESAGs from T. brucei 427 ESs that presumably either reside in inactive expression sites or are not present in T. brucei 927.
A second group (B) contained 34 VSG genes and 54 ESAGs, which were expressed at substantially higher (but still relatively moderate) levels in BF and generally low levels in PF. Many of these showed variable expression levels in different biological replicates of the BF samples, indicative of expression from active ESs in sub-populations of BF. This group contained VSG and VR genes from sub-telomeric clusters (genes 2 and 3, in Figure ), as well as from chromosomal-internal locations (mostly VRs, e.g. gene 4). It also contained ESAGs and GRESAGs from the 427 ES, sub-telomeric clusters and chromosomal-internal loci. Of particular interest are several ESAG9 genes that are up-regulated only in stumpy BF (as discussed above). While this group of genes has many of the hallmarks of canonical VSG/ESAG expression from ESs, it should be noted that in many cases their signal levels in PF were substantially above background; suggesting that the genes are actively transcribed in PF, but the mRNAs are less stable than in BF.
Unexpectedly, a group (C) of 34 sub-telomeric
VSG and nine
ESAG genes showed variable expression levels in both BF and PF. The function of these
VSG genes is unclear, since they appear to encode both typical and atypical VSGs. In particular, one group of four tandemly-linked
VSG genes from an allele-specific region of chromosome 5 showed highest expression in PF-log cells (see gene 5 in Figure ). Group C contains both sub-telomeric and internal genes encoding ESAGs 2, 3, 5, and 11. Interestingly, several
ESAG11-related genes and a
VR are located in a tandem array on chromosome 4, where they are interspersed with genes encoding hypothetical proteins. Since the hypothetical proteins show very similar expression patterns to the adjacent
ESAG11-related genes, at least some may simply represent 3' UTRs of the neighboring genes. Interestingly, this cluster is located between rRNA and tRNA gene clusters, and would not be expected to be transcribed, since it appears to lack the modified chromatin found at typical RNA polymerase II transcription initiation sites [
15,
16]. The signal levels for all these genes is modest (<3000), and are lowest in stationary phase, suggesting that they may merely represent increased "background" transcription due their proximity to the actively transcribed RNA genes. However, this does not rule out functionality of this set of putative genes.
The final group (D) of
VSG and
ESAG genes were expressed in all life cycle stages. All nine
VSG/
VR genes in this group are located in chromosomal-internal loci: four are annotated as
VRs, four are atypical
VSGs and one is uncategorized. mRNA for several of the
VR genes has previously been detected in PF using PCR [
32]. One of the
VRs shows highest expression in PF-stat (gene 7 in Figure ). Three of the atypical
VSG genes show similar expression in all stages (e.g., gene 6 in Figure ). These genes, Tb927.4.5400, Tb927.4.5420 and Tb927.4.5430, along with a 4
th gene identical to Tb927.4.5420, are tandemly-linked to form a small cluster just 5' (and on the opposite strand) to the sub-telomeric
VSG cluster at the "right" end of chromosome 4. Interestingly, this cluster of genes is immediately downstream of a convergent strand-switch region that appears to contain an RNA polymerase transcription initiation site in both BF and PF [
16]. A large number of
ESAGs are also expressed in all life cycle stages; of these most are chromosomal-internal
GRESAG4 genes that have been shown previously to be expressed in PF [
34]. However, two
ESAG4s and an
ESAG7 from the 427 ES show this expression pattern, as do six sub-telomeric
ESAGs (two encode ESAG3, two ESAG5, one ESAG4 and one ESAG9-like). The functional significance of their expression in PF is unknown.
Of the 215
VSG genes examined, 43 showed expression levels above the bottom quartile (~4500) of all genes in at least one BF sample. These included 10 classified as encoding functional VSGs, and eight that were unclassified, but also included eight encoding atypical VSGs and 17
VRs. From these data it is apparent that at least some atypical
VSGs are expressed and hence likely to be functional. Indeed a query on GeneDB for
VSGs annotated as being detected in proteomic analysis of BF [
35] yielded seven genes, two of which are atypical VSGs. Only nine of the 43
VSG genes noted above were expressed below the 5
th percentile (~1300) in PF. These included five encoding typical VSGs (and two that were uncharacterized), but one gene encoded an atypical VSG, and one
VR gene also had this expression pattern. Thus, these data suggest that the functional diversity of
VSG and
VR genes is likely more complex that currently appreciated.