|Home | About | Journals | Submit | Contact Us | Français|
Streptococcus pneumoniae is among the most significant causes of bacterial disease in humans. Here we report the 2,038,615-bp genomic sequence of the gram-positive bacterium S. pneumoniae R6. Because the R6 strain is avirulent and, more importantly, because it is readily transformed with DNA from homologous species and many heterologous species, it is the principal platform for investigation of the biology of this important pathogen. It is also used as a primary vehicle for genomics-based development of antibiotics for gram-positive bacteria. In our analysis of the genome, we identified a large number of new uncharacterized genes predicted to encode proteins that either reside on the surface of the cell or are secreted. Among those proteins there may be new targets for vaccine and antibiotic development.
Worldwide, approximately 1.1 million deaths annually are attributed to Streptococcus pneumoniae infection (22), accounting for 9% of all deaths in underdeveloped countries (37). S. pneumoniae disease is not limited to the developing world. Despite the availability of a broad arsenal of antibiotics and a vaccine, S. pneumoniae remains one of the top 10 causes of death in the United States (22). Furthermore, nearly one-third of the S. pneumoniae isolates obtained from patients in the United States are resistant to penicillin (11, 45) and the incidence of strains resistant to multiple antibiotics is increasing, making infections caused by this organism more difficult to treat.
S. pneumoniae is a gram-positive coccus and a member of the lactic acid bacteria, so named for their primary metabolic byproduct. The lactic acid bacteria include the lactococci, a group important in food and dairy industries, and the genera Enterococcus and Streptococcus. Bacteria belonging to the genus Streptococcus live in association with animal hosts, as either pathogenic or commensal organisms. Human pathogens include the beta-hemolytic species, such as Streptococcus pyogenes (Lancefield group A) and Streptococcus agalactiae (group B), as well as the human cariogenic species Streptococcus mutans. A number of commensal species of streptococci can occasionally cause opportunistic infections. S. pneumoniae (also known as pneumococcus or Diplococcus pneumoniae) is the major cause of acute bacterial pneumonia and otitis media. S. pneumoniae is also a transient commensal, colonizing the throat and upper respiratory tract of 40% of humans. S. pneumoniae isolates vary in their polysaccharide capsule, and at least 90 different capsule types have been identified. Specific capsule types are associated with the capacity to cause severe disease.
To aid the search for new therapies, we determined the entire genomic DNA sequence of S. pneumoniae strain R6. S. pneumoniae R6 is a descendant of the type 2 capsule S (smooth or encapsulated) clinical isolate used by Avery and coworkers to demonstrate the genetic function of DNA (2), and it is used worldwide as a standard laboratory strain. The lack of a polysaccharide capsule in R6 renders it avirulent and a safe strain with which to work. The essential utility of the strain is its genetic malleability.
The S. pneumoniae R6 isolate was obtained from Alexander Tomasz (Rockefeller Institute, New York, N.Y.). The strain is hex+, not a hex mutant as had been reported previously (3). The parental S. pneumoniae strain for R6 is R36A, which is a nonencapsulated strain derived from the capsular type 2 clinical isolate strain D39. R36A has multiple interruptions in the type 2 capsular locus inherited from D39 (21). Rollin Hotchkiss assayed single R36A colonies for competence in transformation. S. pneumoniae R6 was selected based on a high capacity to be transformed to penicillin resistant by using DNA from a laboratory-constructed isolate of penicillin-resistant S. pneumoniae. The sequenced isolate of S. pneumoniae R6 is available from the American Type Culture Collection (ATCC BAA-255).
Genomic DNA was isolated from bacteria grown in brain heart infusion medium (Becton Dickinson, Franklin Lakes, N.J.). The purification process included multiple phenol extractions, ethanol precipitations, and spoolings. DNA was sheared, size fractionated, and used to create plasmid and fosmid libraries. Clones from those libraries were end sequenced using both dye-primer and dye-terminator DNA sequencing methods. In the random shotgun phase of the project, ≈44,000 sequences were obtained. Gaps were closed either by sequencing spanning PCR products or by directly sequencing from the ends of contigs using custom primers and genomic DNA as a template (18). DNA sequences were analyzed and assembled using PHRED, PHRAP, and CONSED (http://www.phrap.org/) (13, 17). Insertion sequences and rRNA operons present in multiple copies created sequence assembly problems because no single DNA sequence covered the entire repetitive element. We developed a high-scoring-pairs algorithm to correctly assemble contigs flanked by these large repetitive elements (unpublished data). The sequence assembly was confirmed by a combination of Southern blotting, PCR, and comparison of the electrophoretically measured insert sizes to map locations of the end sequences from fosmid and plasmid inserts.
One open reading frame (ORF), encoding a hypothetical surface protein (spr0075), was predicted by this assembly to contain five copies of a 456-bp nearly perfect repeat. Southern blot analysis of this region of the genome suggested that there were seven copies of this repeat present within this gene (data not shown). We predicted that this gene should be approximately 912 bp larger than indicated; however, the reported sequence for the complete genome did not include the additional predicted but unsequenced 912 bp.
Annotation of the S. pneumoniae genome was performed by utilizing a combination of programs for gene prediction, similarity searching, and functional assignment. The information from these analyses was imported into a relational database based upon Microsoft SQL Server. The user interface for this database was a series of web pages that accessed the SQL Server database and allowed us to query available analysis data. Additional pages allowed us to directly hand-annotate the individual gene records, thus allowing us to refine start sites and add functional descriptions and notes. This web-based client interface was developed using Microsoft Active Server Pages technology to directly query and update the database records. Basic sequence analysis tools were provided by the Genetics Computer Group (GCG) package of programs (Wisconsin Package version 10.0; GCG, Madison, Wis.).
Determination of potential protein-encoding sequences utilized Glimmer (http://www.tigr.org/softlab/glimmer/glimmer.html) (9) to create organism-specific ORF models that could then be used to search the entire genome for ORFs matching the predictive models. The genome was first arranged such that the initial base of the ATG start codon of the putatively identified dnaA gene was base number 1 of the forward strand. Each predicted ORF was then assigned an S. pneumoniae identification number. ORF spr0001 was assigned to the putative dnaA gene, and each subsequent ORF was then numbered consecutively according to its left-most base (the start codon for ORFs on the forward strand, and the stop codon for genes on the reverse strand).
BLAST searches were performed on all predicted ORFs using a blastp search of amino acid similarities to sequences in the GenBank nonredundant protein database. The BLAST data were parsed using the blast modules of the BioPerl tool kit (http://www.bioperl.org) and then imported into SQL server tables for analysis. In addition to BLAST similarity searching, we also tentatively identified functional domains within the S. pneumoniae ORFs by searching for similarities to the Prosite motif library (20) and the Blocks database of protein families (19). Programs from the GCG package provided composition and hydrophobicity analyses along with scanning for potential signal peptide domains. The results of these additional analyses allowed us to refine the gene assignments initially made with BLAST. Furthermore, alignments with known proteins provided assistance with start-codon prediction.
The results of all of these searches were used to provide putative identification of each S. pneumoniae ORF when a significant hit between an S. pneumoniae sequence and GenBank sequence was found. A combination of computer-aided gene prediction along with human inspection of each gene record was then used to finalize gene assignments for each S. pneumoniae ORF.
To identify genomic sequences that code for tRNAs, the set of programs that encompass the software package tRNAscan-SE (24) was used. rRNAs were identified by their similarity to the corresponding genes in the Ribosome Database Project sequence database (25). The sequences for tmRNA (51), the 4.5S signal recognition particle (53), and RNase P (28) were also identified based upon sequence similarities with known representatives of these RNA genes.
The genomic sequence was assigned accession number AE007317 in the GenBank data base. The annotated genome and supplementary data are available on the World Wide Web at http://www.lilly.com/s.pneumoniae.
The S. pneumoniae single circular chromosome of 2,038,615 bp (40% G+C content) contains 2,043 predicted protein coding regions and 73 noncoding RNA genes, which include four rRNA operons. The genomic origin of replication has not been experimentally identified in S. pneumoniae; however, based on the presence of clusters of DnaA boxes and other genomic features, we hypothesize that the S. pneumoniae origin of DNA replication is upstream of dnaA, which is gene spr0001 in our nomenclature system (16).
As anticipated from the work of Iannelli and colleagues (21), relative to strain D39, the encapsulated strain from which R6 was ultimately derived, we noted a 7,504-bp deletion within the ≈18-kbp region that encodes the capsule biosynthesis genes. This deletion results in the absence of seven complete genes as well as the 3′ end of cps2A and the 5′ end of cps2H.
Other than genes associated with capsule synthesis, the genes encoding several putative virulence functions are present in the R6 genome (Table (Table1).1). These include the genes for previously described S. pneumoniae surface proteins, secreted proteins and bacteriocins, and all previously reported two-component response regulator systems (13 potential histidine protein kinases and response regulator pairs plus an unpaired 14th response regulator) (46).
Drug resistance via efflux pumps is an important contributor to virulence in Staphylococcus aureus and other gram-positive pathogens. Although S. pneumoniae contains 14 genes that are possible antibiotic efflux pumps (Table (Table2),2), these efflux pump genes may not be significant contributors to S. pneumoniae virulence. Antibiotic extrusion is not as common a source of resistance in S. pneumoniae as it is in S. aureus (e.g., quinolones). S. pneumoniae is not intrinsically resistant to most classes of agents, and among the exceptions (e.g., aminoglycosides and quinolones) resistance is not the result of drug efflux pumps.
Surface proteins are of special interest because of their potential role in virulence and their possible utility in vaccine development and also because of their potential accessibility to antimicrobial agents. The R6 genome includes single copies of the previously described virulence-associated genes, including four that encode proteins that are under study as vaccine candidates (PspA, PsaA, CbpA, and pneumolysin) (5).
Based on sequence analyses, we predict that a large number of proteins either reside on the S. pneumoniae cell surface or are secreted from the cell. These proteins include 471 with predicted signal peptide sequences, 109 possessing lipoprotein lipid attachment sites, and 10 that are recognized by choline-binding domains, an unusual means of surface attachment found in S. pneumoniae (Fig. (Fig.1).1). We could predict no function for ≈23% of these potential surface-located or secreted proteins, which likely play roles in pneumococcal cell surface biology.
In S. aureus, a transpeptidase called sortase anchors exported proteins containing an LPXTG motif followed by a C-terminal hydrophobic domain and a charged tail to the cell wall peptidoglycan (29, 41). Although sortases are predicted to be present in all gram-positive bacteria, previously no sortase ortholog had been identified in S. pneumoniae, nor could we identify one using BLAST searches. Using a Smith and Waterman algorithm (43), we determined that a single S. pneumoniae R6 gene, spr1098, likely coded for a sortase. Furthermore, we identified 13 genes that coded for proteins containing the LPXTG motif and other sortase substrate features. Six of these proteins are known to be present on the cell surface, while seven are novel and currently categorized as hypothetical with no known function (Fig. (Fig.1).1). Our observations about sortase in strain R6 conflict with a report by Pallen and colleagues, who identified four sortase-like protein genes and “many” potential sortase substrates in the genome of a virulent S. pneumoniae strain with a type 4 capsule (34).
Lipoteichoic acid (LTA) is another cell surface component that contributes to the bacterium's interaction with the human host. Biochemical studies have not detected the presence of d-alanine, a key component of LTA in many organisms, in S. pneumoniae LTA (14). In conflict with the apparent absence of d-alanine in S. pneumoniae LTA, we found an apparently complete dltABCD operon that is homologous to those responsible for the addition of d-alanine to LTA in Bacillus subtilis and in Lactobacillus casei. Previously we suggested that this operon may be silent or defective or that these genes may be active under specific physiological conditions (3). Gene expression studies (data not shown) revealed expression of mRNA from each of these genes under normal laboratory growth conditions. The precise role of the dltABCD operon in the biology of S. pneumoniae remains unknown, although inactivation of this operon in S. mutans confers increased acid sensitivity (4).
S. pneumoniae competence, i.e., its natural capacity to take up DNA, has been studied in detail and a number of competence-specific operons have been identified (23). S. pneumoniae R6 contains all of the genes induced during competence as noted by Lee and Morrison (23), including two identical copies of comX (each adjacent to a ribosomal operon). Additional genes reported to be induced during competence, but whose role in this process remains unknown, are also present (36, 40). These 49 putative competence genes are grouped in 30 apparent operons and are found mostly on the leading strands extending away from the putative origin of replication (as are ≈80% of all S. pneumoniae genes).
As might be predicted as a consequence of the capacity of S. pneumoniae to take up DNA, its genome is littered with genes that are apparently derived from other bacteria. Horizontal gene transfer is clearest for those genes that have been found only in gram-negative bacterial genomes. There are 40 ORFs that are similar to genes in gram-negative bacteria and that have not been found in other gram-positive genome sequences. This is not surprising, because S. pneumoniae occupies the same niche in the human respiratory system as several gram-negative species. Additionally, at least 2% of S. pneumoniae genes are significantly truncated relative to orthologous genes characterized in other bacteria (Table (Table3).3). Many of the deletions are at the 5′ ends of the ORFs, which suggests that the ORFs may be nonfunctional remnants of their parental genes. This incidence may also be a consequence of competence. Coding regions may be missing from these genes because only part of the ORFs were acquired during the assimilation of foreign DNA, or because the genes were not essential to the pneumococcus and mutations are of no consequence. Transporters are the most frequently truncated genes. Among that set are five ORFs that are similar to genes encoding drug efflux pumps.
In most respects, the S. pneumoniae gene complement is very similar to that of the prototypic gram-positive bacterium B. subtilis. More than 53% of the S. pneumoniae genes have highly similar counterparts in the B. subtilis genome (Fig. (Fig.2).2). Systems for cell division, DNA replication and repair, translation, cell wall biosynthesis, and some central catabolic and biosynthetic pathways are basically the same as in B. subtilis. Major cellular systems and features that are notably different include energy metabolism, transport, amino acid biosynthesis, transcription termination, intracellular proteases, and the presence of three large sets of S. pneumoniae-specific repetitive elements in the genome.
As is characteristic of the lactic acid bacteria, S. pneumoniae is a nutritionally fastidious facultative anaerobe requiring a complex medium for growth. This bacterium obtains energy strictly via fermentation and is incapable of respiratory metabolism, either aerobically or anaerobically, as is true of all streptococcal species (38). The only nutrients from which the streptococci can obtain sufficient energy to support growth and cell division are carbohydrates, which are oxidized to pyruvate via glycolysis (with the exception of a few species that can ferment arginine). We identified a large set of genes that encode enzymes necessary for transport of at least 12 different carbohydrates into the cell and for their subsequent conversion to an intermediate in glycolysis.
S. pneumoniae R6, as expected, encodes all genes necessary for the oxidation of carbohydrates to pyruvate via glycolysis and would be expected to reoxidize most, if not all, of the NADH produced by the reduction of pyruvic acid to lactic acid. R6 contains genes for the synthesis of phosphotransacetylase, acetokinase, and NADH oxidase, which would allow it to convert pyruvate to acetate with concomitant production of an additional ATP, and the reoxidation of NADH (38). Although fermentation is the least energy efficient of oxidative processes, S. pneumoniae did not maximize this energy production by exclusively using the phosphoenolpyruvate-dependent phosphotransferase system to import carbohydrates. Five sugar species are imported by using energetically less-efficient ABC transporters. We found no genes that might encode cation antiporters of sugars, although we identified several amino acid/cation symport systems (Fig. (Fig.1).1). All genes necessary for synthesis of the major ATPase of lactic acid bacteria, the F0F1-ATPase, are present. This proton pump works at the expense of ATP, but it can also serve as an ATP synthase, as well as serving as the major regulator of intracellular pH among lactic acid bacteria. We did not find the genes required for a complete electron transport chain that might be associated with either aerobic or anaerobic respiration.
No lactic acid bacterial species encodes a complete tricarboxylic acid (TCA) cycle, and the S. pneumoniae R6 genome contains none of the 18 genes comprising this aerobic oxidative pathway. In other organisms, including those without complete TCA cycles, some of the TCA enzymes also have roles in the synthesis of certain amino acid precursors. As a result, S. pneumoniae R6 is incapable of synthesizing aspartate (and hence lysine, methionine, threonine, and isoleucine) from oxaloacetate, nor can it synthesize glutamate (and hence arginine) via α-ketoglutarate. A defined medium developed specifically for S. pneumoniae contains those amino acids, so the incomplete biosynthetic pathways were expected. We were unable to identify complete pathways for the synthesis of glycine, histidine, and leucine, all of which are included in the S. pneumoniae defined medium. Valine is also included in the S. pneumoniae defined medium, so identification of an apparently complete pathway for valine biosynthesis was unexpected (42).
The presence in S. pneumoniae of an ortholog to the Cercospora nicotianae pdx1 gene suggests S. pneumoniae may have a pyridoxal biosynthetic pathway (12). The biosynthesis pathways for other required cofactors (biotin, choline, pantothenate) are either incomplete or absent. Presence of partial pathways for the synthesis of many of these amino acids and cofactors in S. pneumoniae is not surprising. In many cases these enzymes make possible the conversion of molecules imported into the cell into other necessary metabolic components. Glutamine is an example of this type of metabolic conversion. Although S. pneumoniae cannot make the starting material for glutamate, α-ketoglutarate, it does encode the enzymes needed to utilize glutamine as a nitrogen source. In S. mutans, glutamine has been shown to be a principal source of nitrogen (8). There are 22 genes encoding the elements of 7 different ABC transporters predicted to transport glutamine. That represents 10% of the transport genes in S. pneumoniae. The allocation of the S. pneumoniae genome capacity to glutamine transport suggests that glutamine is also needed for more than its role as a component of proteins.
Hydrogen peroxide is produced by S. pneumoniae through the action of pyruvate oxidase (SpxB) under conditions of aerobic growth. This may be a mechanism by which the pneumococcus inhibits the growth of other common pathogens of the human upper respiratory tract such as Haemophilus influenzae, Moraxella catarrhalis, and Neisseria meningitidis, which are infrequently cocultured with S. pneumoniae from patient samples (35). Unlike those gram-negative species, S. pneumoniae has the capacity to resist oxidative stress caused by H2O2. In Escherichia coli, oxidative stress induces the expression of a set of ≈30 proteins under the transcriptional control of OxyR (6). Based on their similarity to OxyR, either spr0593 or spr0828, which are bacterial regulatory proteins of the LysR family, might possibly regulate the S. pneumoniae enzymes synthesized in response to H2O2. These include superoxide dismutase, glutathione reductase, glutaredoxin, DNA-binding stress protein, and two thioredoxin reductases. Although there are also genes for several peroxidases that can be used to ameliorate oxidative stress, S. pneumoniae does not encode catalase.
Bacterial energy-dependent intracellular proteases perform a variety of tasks, possibly including that of the proteasome, which degrades aberrant and nonfunctional proteins in eukaryotes and archaea (10). S. pneumoniae R6 possesses single copies of the genes encoding the ClpP and FtsH proteases, but it is notably deficient of the genes encoding HslV and the ubiquitous Lon protease. While some bacteria with relatively large genomes, such as E. coli and B. subtilis, encode all four of these energy-dependent proteases, most eubacteria encode only a subset (10). The S. pneumoniae energy-dependent protease gene set appears to be characteristic of the gram-positive genera Enterococcus, Streptococcus, and Staphylococcus, but not of the mycoplasmas.
DNA sequences from three classes of repetitive elements, BOX, RUP, and IS, comprise >3% of the S. pneumoniae genome. These kinds of repetitive elements make up more of the S. pneumoniae genome than of any other bacterial genome sequenced to date. Functions for some of these sequences are controversial. The BOX elements are predicted to form stable secondary structures that may serve as the binding site for a protein responsible for modulating the expression of downstream genes (27). Insertion of heterologous DNA into the BOX element upstream of the comA gene produces S. pneumoniae incapable of competence (27). Additionally, Weiser showed that insertion of a BOX element upstream of a locus apparently involved in phase variation increases expression of downstream genes encoding the opacity phenotype (50). The 107-bp RUP elements, predicted to form stable secondary structures, are proposed to be active insertion elements transactivated by the transposase of IS630-Spn1 (32).
Almost all BOX and RUP elements are entirely in intergenic spaces. We hypothesized that analysis of the locations of the BOX and RUP elements relative to the transcriptional orientation of the genes surrounding them might offer clues about their potential regulatory roles. In S. pneumoniae between adjacent genes in the same transcriptional orientation, the boundaries of transcriptional units are often unclear; accordingly, promoters and transcription termination signals are difficult to identify. Between pairs of adjacent genes oriented 5′ end to 5′ end on opposite strands of the chromosome, or at least in the vicinity of those genes, there are likely to be pairs of transcriptional promoters. Likewise, there must be a transcription termination signal or signals between pairs of adjacent genes oriented 3′ end to 3′ end on opposite strands of the chromosome (factor-independent transcription termination signals have been identified in streptococci that can function bidirectionally ). Almost 3 times as many BOX elements and 1.5-fold more RUP elements are located between the genes oriented 3′ to 3′ than between genes oriented 5′ to 5′, and the insertion of IS elements flanking some of these elements may artificially deflate those ratios (Table (Table4).4). This suggests a role for the RUP and BOX elements in transcription termination. Another possible function is suggested by the fact that secondary structures that RUP and BOX elements would assume at the 3′ ends of mRNAs are more complex than those observed for other factor-independent transcription termination signals (27, 32), and S. pneumoniae does not encode rho factor. These elements might enhance gene expression by either stabilizing mRNAs or serving as binding sites for regulatory proteins.
Previous analyses indicate numerous IS elements were present in the DNA of various strains of S. pneumoniae. The genome of R6 contains at least 60 complete or partial copies of 10 different IS elements, representing the families ISL3, IS5, IS630, IS3, IS30, and IS605. We identified three novel IS elements. We did not find an IS1202, which was previously identified in a progenitor of the R6 strain, S. pneumoniae D39 (31). Most of the copies of the IS elements appear to be only remnants, as only seven possess the expected full-length sequences of putative transposase genes. The remaining copies all contain frameshifts, stop codons, or both within the ORF, and many have substantial amino acid substitutions, suggesting that they are no longer active. It is possible that these inactive insertion elements still play an important role in the evolution of this genome. For example, they may provide regions of homology that are sites for homologous recombination in the acquisition of genes from related organisms carrying these same insertion sequences but different flanking genes (7).
The capacity to identify all potential genes within this pathogen should greatly facilitate the identification of novel targets for antibiotic discovery as well as new candidates for vaccine development. This process will be significantly enhanced by the comparison of the S. pneumoniae R6 sequence to that of pathogenic strains of S. pneumoniae (www.tigr.org and http: //genome.microbio.uab.edu/strep/). These comparisons, in concert with genetic and gene expression studies, should catalyze expansion of S. pneumoniae biology.
We thank Amy Hahn, Travis Bennett, Lara Braverman, Joanne Dyer, Bruce Glover, Ken Holstein, Dennis Howell, Ivan Jenkins, Tammy Jones, Rebecca Leonard, Melud Nabavi, Regine Porter, Patricia Solenberg, and Angie Wu for their technical assistance, Ron Swanson for advice on the sequencing project, and Janet Yother and Dalai Yan for their advice on the manuscript.
Principal authors JoAnn Hoskins and John Glass contributed equally to this project.
This work was supported by Eli Lilly and Company and the Incyte Genomics, Inc. Pathoseq database program.