|Home | About | Journals | Submit | Contact Us | Français|
The complete genomic sequence of an intracellular bacterial pathogen, Mycoplasma penetrans HF-2 strain, was determined. The HF-2 genome consists of a 1 358 633 bp single circular chromosome containing 1038 predicted coding sequences (CDSs), one set of rRNA genes and 30 tRNA genes. Among the 1038 CDSs, 264 predicted proteins are common to the Mycoplasmataceae sequenced thus far and 463 are M.penetrans specific. The genome contains the two-component system but lacks the essential cellular gene, uridine kinase. The relatively large genome of M.penetrans HF-2 among mycoplasma species may be accounted for by both its rich core proteome and the presence of a number of paralog families corresponding to 25.4% of all CDSs. The largest paralog family is the p35 family, which encodes surface lipoproteins including the major antigen, P35. A total of 44 genes for p35 and p35 homologs were identified and 30 of them form one large cluster in the chromosome. The genetic tree of p35 paralogs suggests the occurrence of dynamic chromosomal rearrangement in paralog formation during evolution. Thus, M.penetrans HF-2 may have acquired diverse repertoires of antigenic variation-related genes to allow its persistent infection in humans.
The family Mycoplasmataceae, which belongs to the order Mycoplasmatales under the class Mollicutes, contains the genera Mycoplasma and Ureaplasma. It is thought that the Mollicutes, which possess a notably small genome, may have evolved from a common ancestral Gram-positive bacterium with low G + C content like the Lactobacillus group, including Bacillus subtilis, by losing a considerable region of the genome (1–3).
All species of Mycoplasmataceae are obligate parasites and host specificity is quite strict for hosts including humans, rodents, artiodactyla, perissodactyla and birds. Mycoplasmal infections are most frequently associated with disease in the urogenital or respiratory tracts and, in most cases, mycoplasmas infect the host persistently. Like other parasites, many mycoplasma species display antigenic diversity, which has been noted in a variety of protein profiles (2) or colony immunoblotting (4). Antigenic variation is considered to be a strategy for persistence in the face of immune responses by the host.
Mycoplasma penetrans, a species of Mycoplasmataceae, infects humans in the urogenital and respiratory tracts. A typical feature of M.penetrans is penetration into human cells. Internalization of the organisms into the urothelium was detected in autopsy samples from an acquired immunodeficiency syndrome (AIDS) patient (5). Intracellular replication and persistence for at least 6 months has been observed in cultured cells (6). Long-term persistence of M.penetrans in patients has also been documented on the basis of its isolation from urine specimens collected at different times over the period of 1 year from the same children with human immunodeficiency virus (HIV) infection (7). In human disease, M.penetrans is associated mainly with HIV-1 infection, particularly in adults among the homosexual population in Europe and the USA, and among both homosexual and heterosexual males in South America. Anti-M.penetrans antibodies are found even in asymptomatic HIV carriers and in patients in the process of developing AIDS (5,8–14). Both the rapid decline in CD4-positive lymphocyte counts in M.penetrans-seropositive HIV-infected individuals (15) and the mitogenic effects of this organism on lymphocytes (16) imply a contribution of persistent M.penetrans infection to the deterioration of the immune system in HIV infection. On the other hand, M.penetrans infection has also been suggested to be a primary cause of human disease in non-HIV-related urethritis and respiratory disease (8,17). The M.penetrans HF-2 strain was isolated from a previously healthy HIV-negative patient suffering from severe respiratory distress caused by M.penetrans infection-associated systemic disease (17).
The whole genome sequences of Mycoplasma pneumoniae, Mycoplasma genitalium, Mycoplasma pulmonis and Ureaplasma urealyticum have been reported previously and their sizes range from 0.57 to 0.95 Mb (18–21). In contrast, the size of the M.penetrans genome was estimated to be ~1.3 Mb (unpublished data), suggesting that this organism may possess additional genetic information involved in its unique infection process. Therefore, we determined the complete nucleotide sequence of the M.penetrans HF-2 strain. Analysis of the genomic sequence revealed that M.penetrans possesses a large number of paralogous gene repertoires and chromosomal structures allowing for antigenic variation, and hence persistent infection, in human hosts. Here, we present the complete genomic sequence of M.penetrans strain HF-2, together with computational analysis of the genome.
The M.penetrans HF-2 strain was kindly provided by Dr Antonio Yáñez, Centro de Investigación Biomédica de Oriente-IMSS, Puebla City, Mexico. The HF-2 strain was isolated from the tracheal aspirate of a patient with primary M.penetrans infection. The sixth passage of organisms after primary isolation from the patient was used as the source of DNA for sequencing.
Genomic DNA was extracted from the M.penetrans HF-2 strain cultivated in PPLO broth (Becton Dickinson Microbiology Systems) supplemented with 10% (v/v) heat-inactivated horse serum (GIBCO/BRL), 0.5% glucose and 100 U/ml penicillin G (Meiji Seika Kaisha, Japan). After lysis of harvested mycoplasma cells, by incubating at 50°C for 2 h in a lysis buffer (50 mM Tris, 20 mM EDTA, pH 9.0, containing 0.1% SDS and 100 µg/ml of proteinase K), the lysate was extracted with phenol–chloroform and the DNA was precipitated with ethanol. The DNA was treated with RNaseA (Boehringer Mannheim Biochemicals) and purified by phenol–chloroform extraction and ethanol precipitation.
For the construction of shotgun DNA libraries, genomic DNA was sheared mechanically using Hydro-Shear (Gene Machines). The sheared DNA was blunt-ended, ligated into the SmaI site of the pUC18 plasmid and introduced into the Escherichia coli DH5α strain. Short (1–2 kb) and long (4–5 kb) insert libraries were constructed. Template DNA was prepared by PCR amplification of the insert DNA from colonies (22) and sequenced using the DYEnamic ET Dye Terminator Cycle Sequencing Kit and MegaBACE1000 (Amersham Biosciences).
We obtained 26449 sequence reads from 1–2 kb insert templates and 9156 double-ended sequence reads from 4–5 kb insert templates. The data were assembled and edited using the Phred/Phrap/Consed package of base-calling, sequence-assembly and automated finishing software (University of Washington) (23). The assembly yielded data with an estimated redundancy of 11.73 and 35 gaps. The 35 gaps were resolved by sequencing PCR products amplified with appropriate primers. The DNA sequences covering gaps were assembled together with all other data to obtain a circular consensus sequence. For handling the sequence data, several shell and perl scripts were developed. The assembled data were validated by the pattern of pulse-field gel electrophoresis using BamHI and BglI.
Putative coding sequences (CDSs) were identified using the GLIMMER program (24), which was trained with CDSs of M.genitalium, M.pneumoniae and U.urealyticum in which UGA was allowed to be used as tryptophan. By using the program FramePlot (25), which was optimized for the mycoplasma genome, some predicted CDSs were excluded or modified because they either overlapped with other CDSs or had some additional sequence upstream of the predicted start codon. Homology for the amino acid sequences deduced from each CDS was checked using the program BLASTP (26) and the nr database from the National Center for Biotechnology Information (NCBI). The programs HmmerPfam (GCG Wisconsin package, Accelrys Inc.) with the Pfam motif database (27), TopPredII (28) and MacStripe (29) were used for the prediction of motifs, transmembrane domains and the membrane topology of proteins. All CDSs were classified into modified COGs (clusters of orthologous groups of proteins, NCBI) categories. For analyzing metabolic pathways, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database was used. tRNA genes were detected by the tRNAscan-SE program (30).
The sequence data have been submitted to the EMBL/GenBank/DDBJ database under accession numbers AP004170– AP004174.
More detailed information on the M.penetrans genome is available at our web site (http://www.nih.go.jp/Mypet/).
The size of the M.penetrans genome, 1358633bp, is larger than those of the four Mycoplasmataceae species, M.genitalium, M.pneumoniae, M.pulmonis and U.urealyticum, sequenced so far (18–20,31). The average G + C content is 25.7%, which is close to the 25.5% of the U.urealyticum genome, which has the lowest G + C content of all bacterial species published thus far. Mycoplasma penetrans has 1038 predicted CDSs, one set of 5S-23S-16S rRNA genes (MYPE20000–MYPE20020) and 30 tRNA genes (Fig. (Fig.11).
In M.penetrans, dnaA (MYPE10), dnaN (MYPE20), gyrB (MYPE30) and gyrA (MYPE40) are closely linked. The dnaA box, the specific recognition site for binding dnaA that is often present near the replication origin of most bacterial genomes, was not detected near the dnaA gene, as demonstrated in M.genitalium, M.pneumoniae and U.urealyticum. On the contrary, a poly(C) sequence was found from position 696255 to 696269 in the genome. Both the G + C skew analysis, which is a method for predicting replication origins, and the transcriptional direction in the genome, revealed that two inversion points are clearly located near the dnaA and the poly(C) sequences. These data suggest that the replication origin is located at the inverted point near dnaA, and that the poly(C) sequence near the other inversion point could be the replication terminator (Fig. (Fig.11).
For ortholog identification, reciprocal best-hit pairs of genes were identified by pairwise BLASTP comparisons of the predicted M.penetrans proteins with those in M.genitalium, M.pneumoniae, M.pulmonis, U.urealyticum and B.subtilis. The number of reciprocal best-hit genes was 586, 433, 400, 390 and 383 for B.subtilis, M.pneumoniae, M.pulmonis, U.urealyticum and M.genitalium, respectively. Of the predicted proteins, 264 were found to be common to Mycoplasmataceae including house-keeping products such as DnaA, DNA polymerase III, DNA topoisomerase, ribosomal proteins and ATP synthase. Conversely, there were 463 M.penetrans-specific proteins compared with the four Mycoplasmataceae species, and 311 when compared with B.subtilis plus the four Mycoplasmataceae species. In addition, the number of core proteome, which is defined as functionally distinct protein families in the theoretical proteome encoded in the genome (32,33), was 847 in M.penetrans, significantly greater than the other four Mycoplasmataceae species, where 469, 590, 721 and 580 proteins were represented in M.genitalium, M.pneumoniae, M.pulmonis and U.urealyticum, respectively. From these data, the core proteome in M.penetrans is clearly richer than that of the other four Mycoplasmataceae species. The M.penetrans-specific proteins in Mycoplasmataceae included the putative two-component response regulators (MYPE3960), possible sensory transduction histidine kinase (MYPE2360), deoxyuridine 5′-triphosphate nucleotidohydrolase (dUTPase, MYPE1000), guanosine 5′-monophosphate (GMP) synthase (MYPE1220), d-lactate dehydrogenase (D-LDH, MYPE 2650), aldehyde dehydrogenase (MYPE3920, MYPE4110), anaerobic ribonucleoside-triphosphate reductase (MYPE4960), adenylosuccinate synthetase (MYPE 5830), pyruvate decarboxylase (MYPE6890), enzymes for the orotate-related pathway (MYPE7840–7900), aspartate transcarbamoylase (MYPE 7890) and flavodoxin (MYPE9120). Since no two-component signal response regulator has been found in the other four Mycoplasmataceae species sequenced (18,19,34), it should be noted that M.penetrans is the first mycoplasma predicted to possess the two-component signal system that is commonly present in other bacteria.
Predicted products of M.penetrans CDSs were categorized according to function and compared with the other four Mycoplasmataceae species (Table (Table11 and Supplementary Material). The results revealed that 108 genes (10.4% of the CDSs) involved in the translation process, such as ribosomal proteins, and 23 genes (2.2% of the CDSs) involved in transcription are present in M.penetrans. The relatively high number of translation- or transcription-related genes in M.penetrans is similar to the other Mycoplasmataceae (99–102 genes for translation and 14–18 genes for transcription). Of the M.penetrans CDSs, 428 were not categorized in the COG database.
To evaluate evolutionary divergence of M.penetrans from the other species of Mycoplasmataceae, the orthologous genes in M.penetrans, M.pneumoniae, M.genitalium, M.pulmonis and U.urealyticum were plotted with respect to their location. The results showed no linearity (data not shown). Similar results have been reported from comparative analysis of the U.urealyticum genome with those of M.pneumoniae and M.genitalium (19). The results suggest that rapid chromosomal rearrangement may have occurred in the Mycoplasmataceae genome during evolution.
It has been proposed that 256 genes constitute the minimal gene set necessary and sufficient for sustaining a functioning cell. This estimate was based on a comparative analysis of the genomes of M.genitalium and Haemophilus influenzae, and the study of gene knockouts of M.genitalium (21,35,36). Upon screening for gene content, we found that M.penetrans lacks uridine kinase (udk), a member of the proposed minimal gene set. Uridine kinase is a phosphotransferase that mediates phosphorylation of both uridine and cytidine to produce uridine- and cytidine-monophosphate (UMP/CMP) (Fig. (Fig.2).2). Mycoplasma penetrans also lacks the 5′-nucleotidase that metabolizes both UMP and CMP to uridine and cytidine, respectively. A previous study indicated that uracil phosphoribosyltransferase (Upp) is utilized in the conversion of uracil to UMP in Mycoplasma mycoides (37). The M.penetrans genome contains the upp gene (MYPE10300), so Upp-dependent uracil conversion may be the main pathway for UMP production in M.penetrans. The other pathway, via orotate-related metabolism for converting carbamoyl-phosphate to UMP, was found in M.penetrans. Mycoplasma penetrans contains a set of genes for enzymes involved in orotate-related metabolism as follows: aspartate carbamoyltransferase (pyrB, MYPE7890), dihydroorotase (pyrC, MYPE7880), dihydoorotate oxidase (pyrD, MYPE7860 and 7870), orotate phosphoribosyltransferase (pyrE, MYPE7840), orotidine-5′-phosphate decarboxylase (pyrF, MYPE7850) and the pyrimidine operon regulatory protein (pyrR, MYPE7900). Orotate-related metabolism is linked to the arginine dihydrolase pathway, which is an ATP synthesis system. Mycoplasma penetrans contains all the genes for enzymes involved in the arginine dihydrolase pathway, namely, arginine deaminase (MYPE6080), ornithine carbomoyltransferase (MYPE6090) and carbamate kinase (MYPE6100). The four Mycoplasmataceae species analyzed previously lacked either uridine kinase or 5′-nucleotidase; however, M.penetrans lacks both enzymes but has an orotate-related pathway, suggesting that UMP may be converted from both uracil and carbamoyl phosphate in this species. It should be noted that M.penetrans is the only member of Mycoplasmataceaethus far to possess orotate-related metabolism linked to the arginine dihydrolase pathway and pyrimidine metabolism (Fig. (Fig.22).
Essential enzymes in other pathways are also missing for pyrimidine metabolism in M.penetrans. For example, the presence of pyrimidine-5′-nucleotide nucleosidase for converting cytosine to CMP remains an open question. As in other species of Mollicutes, M.penetrans also lacks nucleoside-diphosphate kinase (ndk) that converts uridine/cytidine diphosphate (UDP/CDP) to uridine/cytidine triphosphate (UTP/CTP). Non-orthologous gene displacement (38) of these essential enzymes in Mollicutes is speculated.
One of the remarkable features of the M.penetrans genome is the existence of a number of paralogous gene families. In a BLAST score-based single-linkage clustering search using BLASTCLUST, proteins with >30% amino acid identity over 70% of their length were defined as paralogs. Search results showed that 264 (25.4%) of the 1038 proteins formed 63 families, ranging from two to 44 proteins per family (Table (Table2).2). Under the same BLAST analysis conditions, the M.genitalium, M.pneumoniae, M.pulmonis and U.urealyticum genomes were found to contain 27 (5.5%), 132 (19.1%), 100 (12.7%) and 56 (8.1%) paralogs, respectively. The relatively large number of paralogs could be one of the reasons for the larger genomic size of M.penetrans compared with the other four Mycoplasmataceae species. These paralog families include P35 lipoprotein homologs, CDSs of functionally unknown proteins, the ABC transporter ATP-binding protein, oxidoreductase, predicted permease and putative lipoproteins.
The largest paralogous gene family in the M.penetrans genome is the p35 gene family. The P35 lipoprotein is a surface-exposed molecule that is immunodominant and is used as the major antigen for serological diagnosis of M.penetrans infection (9,14,39,40). P35 has also been recognized as a phase-variable protein that has on and off phases with a frequency of variation ranging from 1.5 × 10–2 to 4 × 10–3 per cell per generation (40). In addition to the P35 protein, several different-sized lipid-associated membrane proteins are also phase variable and contribute to M.penetrans antigenic variation (39,41). In the M.penetrans genome, the gene for a unique bona fide P35 lipoprotein was detected and has been designated MYPE6810. A total of 44 CDSs, homologs to p35, including the p35 gene itself, were identified in M.penetrans. Of these 44 CDSs, 38 were predicted from their sequence similarity to the p35 lipoprotein gene to encode lipoproteins. More precisely, these 38 CDSs share a highly conserved amino acid sequence at the N-terminal containing a cysteine residue that binds to a fatty acid chain which anchors to the lipid bilayer of the cell membrane; however, the remaining six CDSs lack this sequence and are designated as signal-peptide-less p35 gene homologs. The 44 CDSs are clustered at four different loci in the genome. Among the 38 signal-peptide-containing CDSs, four are at the first locus from 335 914 to 355 279, the largest cluster consisting of 30 CDSs are at the second locus from 831 015 to 881 313, and the remaining four CDSs are at the third locus from 966 342 to 975 013. The six signal-peptide-less p35 gene homologs reside at the fourth locus, from 918 826 to 932 427 (Fig. (Fig.1).1). The degree of identity of each CDS to the P35 lipoprotein ranged from 34 to 70% of the amino acid sequence. The genetic tree for this p35 gene family indicated several points as follows: (i) 30 CDSs in the largest locus are closely related to each other, (ii) three CDSs (MYPE2690, 2700 and 7400) in the first and the third loci are also closely related to the 30 CDSs in the largest locus and (iii) six signal-peptide-less p35 gene homologs (MYPE7020–7070) are positioned at a great distance from p35 itself (MYPE6810), but are close to the remaining five p35 homologs (MYPE2620, 2630, 7330, 7370 and 7380) in the first and the third loci (the genetic tree is visible on our web site). In paralogous gene families other than p35 (i.e. MYPE2560 family with 11 CDSs and MYPE8480 family with 7 CDSs), there is no correlation between sequence similarity and gene cluster in the chromosome. These data imply that gene duplication may have occurred at separate positions in the chromosome and dynamic chromosomal rearrangement, including recombination, may be involved in paralog cluster formation during evolution.
One of the major antigenic epitopes in the M.penetrans P35 lipoprotein has been characterized previously and was shown to have the amino acid sequence, FTGEAYSVWSAK (42). The epitope is recognized in ~66% of M.penetrans-seropositive individuals and two of three macaques infected with M.penetrans. The present study identified this sequence only in the p35 lipoprotein gene, suggesting that p35 paralogs identified in the genome may contain antigenic epitopes different from the P35 lipoprotein. Consequently, M.penetrans cells with a variety of antigenic epitopes may be able to escape from host antibody response, thus allowing persistent infection in humans.
In the M.penetrans genome, at least 50 lipoproteins (4.8% of all CDSs), including the P35 lipoprotein homologs were predicted. This number is comparable to the 4.3–7.0% of lipoproteins observed in the other four Mycoplasmataceae species.
Since much of the data shown here suggested chromosomal rearrangement for paralog cluster formation in the M.penetrans genome, transposable elements as potential chromosome modulators were sought. Twenty-one putative transposases for insertion sequences (ISs) and five copies of IS232 ATP-binding proteins were found in the M.penetrans genome. Their types were distinguishable from each other and were categorized into four different groups as follows: orthologs of transposase for (i) IS1202 (1556 bp), (ii) IS232 (1412 bp), (iii) IS1630 (1016 bp) and (iv) other ISs including truncated forms. In all, ~24 591 bp of sequence corresponding to 1.8% of the genome was identified as transposases for ISs. Inactivation of paralogous gene families by IS insertion was found in CDSs in paralog clusters from MYPE7110 to MYPE7140 and from MYPE9600 to MYPE9630, function unknown. Two CDSs (MYPE2900 and MYPE8180) for integrase/recombinase are present in the genome, both of which are thought to be involved in chromosomal rearrangement. As observed in many bacterial genomes, there is the possibility that horizontal transfer is involved in the emergence of paralogous gene families including the p35 family in M.penetrans. However, p35 and other paralogous gene clusters in M.penetrans are associated with neither integrase nor the tRNA gene loci that are, in some cases, hallmarks for phage-based horizontal transfer. The G + C content of these paralogous loci is similar to that of the whole chromosome. These data imply no association of mobile element such as phages with the emergence of these paralogs or, if any were involved, drastic rearrangement of the genome might have occurred after horizontal transfer.
Several factors related to bacteria–host cell interactions were identified in the M.penetrans genome. Like M.pneumoniae and M.genitalium, M.penetrans is flask shaped and attaches to host epithelial cells with its attachment organelle, the tip structure. High molecular weight (HMW) proteins locate to the tip structure and associate with cytadherence in M.pneumoniae (43–46). Mycoplasma penetrans contains CDSs (MYPE470 and MYPE1570) that have similarities to either the cytadherence accessory protein HMW1 or the HMW2 ortholog in M.genitalium, but neither of them are HMW proteins. The CDS (MYPE1550), encoding a large predicted protein of 364 kDa, flanked by MYPE1570, is also an ortholog of both HMW2 of M.pneumoniae and the rhoptry protein of Plasmodium yoelli, which is involved in attachment to and invasion of red blood cells (47). Predicted proteins for both MYPE1550 and MYPE1570 contain numerous coiled-coil structures that are also found in HMW2 (48). MYPE1550 also contains Pfam domain spectrin repeats that are found in several proteins involved in cytoskeletal structure, so it might be a cytoskeletal component. FtsZ is speculated to interact with the cytoskeleton-like structure during cell division; the predicted size of ftsZ (MYPE8370) of M.penetrans is longer than that of other bacteria and both the N- and C-terminal sequences are M.penetrans specific. We also identified several candidates for adhesin including one CDS (MYPE1950), which has some similarity to M.pneumoniae’s adhesin P1 ortholog in Mycoplasma pirum (49). Mycoplasma penetrans also possesses CDSs encoding cytotoxic factors such as one similar to a predicted vacuolation-inducing cytotoxin-related molecule CagA of Helicobacter pylori (MYPE4730), hemolysin (MYPE1500), protease (MYPE700, MYPE8500) and endonuclease (MYPE1190). Induction of vacuolation by M.penetrans in HeLa cells was reported previously (50). We also observed induction of vacuolated degeneration in epithelium cells of macaque trachea a few hours after infection with the HF-2 strain (unpublished data). The function of the CagA-like protein is interesting and remains to be analyzed. The production of hydrogen peroxide has been suggested to injure host cells by inhibiting catalase activity of the host cells during mycoplasma infection (51). As for antioxidants, M.penetrans possesses orthologs for thiol peroxidase (MYPE3980) and glutathione peroxidase (MYPE5120). These enzymes may protect M.penetrans from oxidative molecules of its own making and are produced by host cells.
The M.penetrans HF-2 strain possesses a 1.3 Mb genome, which is the largest among the Mycoplasmataceae species thus far analyzed. We identified 1038 CDSs, among which 463 are M.penetrans specific. The relatively large M.penetrans genome may be accounted for by both its rich core proteome and the presence of paralog families, corresponding to 25.4% of all CDSs. The largest paralog family is the p35 gene family, consisting of 44 genes for surface antigenic lipoproteins, leading us to hypothesize that possession of a number of paralogs would enable this mycoplasma to express a large repertoire of different antigenic epitopes, presenting a sophisticated strategy for eluding the host immune system to maintain persistent infection. The genetic tree of the p35 gene family and dot-plot analysis of the chromosome suggested the occurrence of dynamic chromosomal rearrangement in paralogous gene cluster formation. Mycoplasma penetrans lacks the uridine kinase gene, suggesting the involvement of uracil phosphoribosyltransferase and an orotate-related pathway for UMP production in pyrimidine metabolism in this species. Several candidates for virulence factors such as a cytoskeletal component of the attachment organella, adhesin and cytotoxic factors were also identified.
The data obtained in this study should be of great use for understanding the mechanism of M.penetrans infection of humans and will also provide new insights into the regulation of virulence factors in M.penetrans as well as other Mycoplasmataceae.
Supplementary Material is available at NAR Online.
The authors would like to show their appreciation to Dr A. Blanchard at INRA, Centré de Recherche de Bordeaux, France, and Dr K. Hashimoto and Dr Y. Arakawa at the National Institute of Infectious Diseases for their helpful suggestions. The authors are grateful to Mr S. Minns and Dr T. D. Taylor for careful reading of the manuscript. This work was supported in part by the Research for the Future Program from the Japanese Society for the Promotion of Science (JSPS).
DDBJ/EMBL/GenBank accession nos+