Pneumocystis carinii is a fungal microbe that is found in the lungs of laboratory rats [
1-
6].
P. carinii appears to be specific to rats because it is not found in other species of mammals and fails to establish itself when introduced into immunodeficient mice [
7], which have their own species of Pneumocystis, called
P. murina [
8].
P. carinii is morphologically and phylogenetically closely related to
P. murina, both of which are somewhat less closely related to the human pathogen,
Pneumocystis jirovecii, which causes Pneumocystis pneumonia in individuals with impaired immune function, such as patients suffering from Acquired Immunodeficiency Syndrome (AIDS) [
3,
9-
15].
P. carinii and
P. murina can cause pneumonia in their hosts, rats and mice, respectively, if these host animals lack a robust immune system [
16-
19].
While
P. carinii can cause disease in the absence of a normal immune system, rats that lack such a system are probably not its normal ecological niche. It has been established that
P. carinii organisms can persist for months in rats that are immunologically normal [
20]. Normal laboratory rats are often colonized by
P. carinii and show no obvious ill effects [
5,
6]. Likewise,
P. murina appears to be able to inhabit normal mice [
16,
17,
21-
23]. By analogy,
P. jirovecii would be expected to make its home in normal humans, and data showing colonization of healthy people by
P. jirovecii are accumulating [
24-
33].
None of the species of
Pneumocystis that have been studied have been observed to proliferate much outside of the airway of the mammalian host in which they are found, and Pneumocystis DNA is very scarce in environments apart from mammals [
34-
38]. Thus,
Pneumocystis species exhibit three features suggesting that they are obligate parasites of mammals: 1) They are extremely scarce outside of the mammalian lung. 2) They have fastidious growth requirements. 3) They can colonize immunocompetent hosts.
Parasites employ various methods to survive in the face of host defenses. One such method is programmed antigenic variation, which allows a population of parasites to quickly produce an organism whose surface differs from that of the others in the population. The VSG antigenic variation system in the protozoan parasite
Trypanosoma brucei illustrates how gene families can be used to create phenotypic diversity within a population of eukaryotic parasites [
39-
46]. There are thousands of different VSG genes in the
T. brucei genome [
47]. These genes tend to be clustered together near telomeres. Only one VSG gene is transcribed in a given cell. The gene that is expressed changes frequently enough to make it probable that the host immune response, which is directed against the version of VSG present on the majority of parasites, does not destroy all of the parasites in the host. Changing which gene is expressed in
T. brucei is often accomplished via DNA recombination, which alters an expressed VSG gene by replacing some or all of its DNA with DNA from a silent VSG gene [
46,
48,
49].
The
P. carinii MSG (Major Surface Glycoprotein) gene family is much smaller than the
T. brucei VSG gene family, but exhibits structural and functional features similar to it. The
P. carinii genome contains approximately 80 MSG genes, which are located at the ends of each of 17 chromosomes [
50-
55]. Pairwise comparisons of eleven complete
P. carinii MSG showed that they are between 5 and 19% divergent, but share a number of features including a length of approximately 3 kb, a lack of introns and the presence of an invariant 5' sequence element called the CRJE, which is discussed further below [
55]. Other short invariant sequence elements reside at multiple locations within the bodies of the 11 fully sequenced MSG genes, which tended to be least variable at their 3' ends. Most of the 11 genes have been shown to be members of gene clusters containing up to 3 MSG genes. The genes within a cluster were as different from one another as they were from genes in different clusters, suggesting that selection and or recombination has driven rapid diversification of
P. carinii MSG genes [
55].
MSG genes have been described in five other species of Pneumocystis, including the three that have received a species name other than " carinii",
P. murina (found in the laboratory mouse) [
56],
P. wakefieldiae (found in the laboratory rat) [
57] and
P. jirovecii (found in human beings) [
58,
59]. MSG sequences have also been reported from two additional presumptive Pneumocystis species (one from ferrets and one from a macaque) that do not yet have their own species name [
60,
61].
Studies on restriction enzyme fragment length polymorphism have shown that there is considerable variation in the MSG gene families present in
P. jirovecii organisms found in different human beings [
59]. These finding are consistent with the idea that MSG genes evolve rapidly. Compared to
P. jirovecii MSG genes, neither
P. carinii nor
P. murina MSG genes families exhibited much variation when studied by restriction enzyme analysis [
59]. Nevertheless, it is possible that MSG gene families are evolving relatively quickly in each species of
Pneumocystis, and that the more limited MSG diversity seen in
P. carinii and
P. murina reflects the isolation of laboratory rodents, a practice that would be expected to limit exposure to the populations of
Pneumocystis that live in wild rodents. While keeping rodents in vivaria keeps exogenous microbes out, it would also tend to trap any endogenous parasites. It is common to find
P. carinii at low levels in laboratory rats that have not been deliberately exposed to the fungus, indicating that a particular population of
P. carinii can propagate within colonies of laboratory rats [
5]. Therefore, the reason that
P. carinii found in laboratory rats tend to be relatively genetically uniform may be that these microbes descended from those that were captured along with the rats that were used to establish laboratory colonies. By contrast, human beings would be expected to encounter multiple wild strains of
P. jirovecii.
Expression of MSG gene families has been studied primarily in
P. carinii, where several lines of evidence indicate that a single MSG gene is transcribed in a given
P. carinii genome at a given time. Restricted transcription is accomplished via a cis-dependent mechanism that involves a unique telomeric site in the genome called the expression site. Only the MSG gene adjacent to the expression site is represented by messenger RNA [
52,
53,
62,
63]. The MSG protein on the surface of
P. carinii organisms has been shown to vary and to be encoded by the MSG gene that is at the expression site [
64,
65]. The expression site contains the UCS (Upstream Conserved Sequence), a sequence found at the beginning of messenger RNAs encoding diverse MSG proteins [
62,
63] (Figure ). Immediately adjacent to and downstream of the UCS, there is short sequence, called the CRJE, which is conserved among all MSG genes, by definition [
52,
62,
66].
CRJE stands for "Conserved Recombination Junction Element" because it could be involved in recombination events that cause the MSG gene at the expression site to change [
62]. The CRJE is present both at the expression site at the junction between the UCS and the expressed MSG gene, and in MSG genes that were not attached to the expression site (donor MSG genes) (Figure ). The location and conservation of the CRJE suggests that it could function as a target of a site-specific event, such as a double-stranded break, which would be expected to increase recombination between the expression site and a donor MSG gene. However, there is no direct evidence for such events and the role of the CRJE in recombination is still a matter of speculation. Nevertheless, the CRJE serves to identify MSG genes, which otherwise resemble MSR genes, a
P. carinii gene family that is not regulated by the MSG expression site [
66-
68].
A large variety of different MSG sequences have been observed at the expression site, indicating that recombination can install DNA from various silent donor MSG genes at this locus [
52,
53,
69]. The types and frequencies of the inferred recombination events are not clear because the fastidiousness of
P. carinii has prevented experiments
in vitro, and experiments in rats are complicated by their tendency to be colonized by
P. carinii, which has limited the utility of experiments that seek to observe phenotypic or genotypic switching by introducing into a rat a small population of
P. carinii expressing a known MSG gene [
69]. An alternative approach to understanding the MSG system would be to acquire a better understanding of the gene family. If all of the genes in the family are identified, it may be possible to infer how changes are produced at the expression site. For example, if recombination completely replaces the MSG gene at the expression site with an MSG gene from a donor site, then the sequences found at the expression site will match those in the donor gene database. However, if recombination were to alter a segment of the MSG gene that is at the expression site, then there will be sequences linked to the UCS that do not exactly match any of the donor genes.
Understanding the MSG gene family at the sequence level will also aid in assessing the function of this family. If its function is to confer variability, then MSG genes will have evolved under the influence of selection for variation in the proteins they encode (positive selection), a prediction that can be tested by sequence analysis.
Sequence data pertaining to the
P. carinii MSG gene family are available, but the vast majority of the available data has not been analyzed [
70]. Analysis of these data is challenging for several reasons. First, most of the data are in the form of shotgun reads which require assembly. However, standard assembly procedures are not designed to assemble genes from gene families, and might join reads that came from different genes. A second possible complication stems from the doubtful clonality of
P. carinii populations. The organisms used to obtain genome sequence data came from the lungs of immunosuppressed rats that had been infected by constant exposure to other infected rats [
63]. This system of obtaining infected rats has been in operation for decades. Hence, more than one genetic strain of
P. carinii could have contributed to the DNA used to obtain genomic sequences, and a given MSG gene could be represented by more than one allele. Alleles are defined as different versions of the sequence located at a particular location on a chromosome (i.e., at a genomic locus). In absence of gene flow between two populations that were genetically identical at separation, mutation will cause allelic polymorphism to arise over time. The formation of allelic polymorphism would be accelerated if selection were to favor cells that sustained mutation, as could be the case for MSG genes given their probable role in generating phenotypic variation.
P. carinii cells are thought to be haploid, and a given haploid cell can contain only one allele at each locus. Nevertheless, two strains might contain two different alleles of a particular gene at a particular locus. Therefore, if more than one genetic strain of
P. carinii contributed to the sequence data, then a given MSG gene could be represented by more than one allele. Assembly programs would tend to amalgamate alleles into a single consensus sequence, thereby obscuring an important aspect of the sequence data.
A third problem is posed by the presence of MSR genes in the P. carinii genome. MSR and MSG genes are distinct, but highly related, and analysis of sequence data must be performed in a way that avoids sequence reads from MSR genes.
In the studies described herein, the first 300 basepairs of MSG genes were selected for analysis. Although MSG genes are more than 3000 basepairs long, analyzing the first 300 bps of MSG genes offered two important practical advantages. First, this segment of an MSG gene is specifically amplifiable using the CRJE as a primer-binding site (Figure ). Hence, this approach avoids interference from MSR genes, which lack the CRJE. Second, the 300 bp amplicons are smaller than the average sequence read available in the largest database, that of the Pneumocystis genome project. Therefore, it seemed probable that sequence reads spanning the entire 300 bps would be numerous in the database, in which case it would be possible to cover the whole family without having to rely on assembly of contigs, which is problematic when dealing with gene families. Practical advantages aside, the 5-prime ends of MSG genes are of interest because recombination events that move DNA from donor genes to the gene at the expression site may be frequent in this region. Defining the full repertoire of donor MSG genes should allow this hypothesis to be tested in the future.