|Home | About | Journals | Submit | Contact Us | Français|
We hypothesize that Haemophilus influenzae, as a species, possesses a much greater number of genes than that found in any single H. influenzae genome. This supragenome is distributed throughout naturally occurring infectious populations, and new strains arise through autocompetence and autotransformation systems. The effect is that H. influenzae populations can readily adapt to environmental stressors. The supragenome hypothesis predicts that significant differences exist between and among the genomes of individual infectious strains of nontypeable H. influenzae (NTHi). To test this prediction, we obtained 10 low-passage NTHi clinical isolates from the middle ear effusions of patients with chronic otitis media. DNA sequencing was performed with 771 clones chosen at random from a pooled genomic library. Homology searching demonstrated that ~10% of these clones were novel compared to the H. influenzae Rd KW20 genome, and most of them did not match any DNA sequence in GenBank. Amino acid homology searches using hypothetical translations of the open reading frames revealed homologies to a variety of proteins, including bacterial virulence factors not previously identified in the NTHi isolates. The distribution and expression of 53 of these genes among the 10 strains were determined by PCR- and reverse transcription PCR-based analyses. These unique genes were nonuniformly distributed among the 10 isolates, and transcription of these genes in planktonic cultures was detected in 50% (177 of 352) of the occurrences. All of the novel sequences were transcribed in one or more of the NTHi isolates. Seventeen percent (9 of 53) of the novel genes were identified in all 10 NTHi strains, with each of the remaining 44 being present in only a subset of the strains. These genic distribution analyses were more effective as a strain discrimination tool than either multilocus sequence typing or 23S ribosomal gene typing methods.
Haemophilus influenzae is a small, naturally transformable, gram-negative coccobacillus that is associated etiologically with acute and chronic infections of the upper and lower human respiratory tracts. H. influenzae strains can be divided into two groups, according to the presence or lack of a polysaccharide capsule. The encapsulated strains are typed as a to f, depending on the antigenicity of their capsule, while the unencapsulated strains are referred to as nontypeable (nontypeable H. influenzae; NTHi). H. influenzae enters the body through respiration and establishes either an asymptomatic colonization or a frank infectious process within the host respiratory mucosa. It is estimated that up to 80% of the population are chronic H. influenzae carriers. Infections may be local, contributing to diseases such as otitis media, sinusitis, bronchitis, pneumonia, and chronic obstructive pulmonary disease (COPD), or they can be systemic when bacteria spread through the bloodstream to distant organs, causing other infections, such as meningitis (46, 70). NTHi strains are associated with chronic infections of the respiratory tree, including otitis media with effusion (OME) and COPD, both of which have been hypothesized to be biofilm illnesses (17, 27); however, definitive findings regarding the phenotypic nature of the NTHi strains in natural infections are still forthcoming.
NTHi is the predominant colonizer of the nasopharyngeal mucosa in patients with otitis media (38, 61). Characterization of NTHi populations in the nasopharynx has suggested the possibility that children and adults sometimes harbor multiple distinct strains concurrently (27, 57-59). In other cases, it appears that children will carry a given strain for several weeks to months, lose the first strain, and then acquire a new strain (19, 34). The number of separate times a child is colonized with a strain of NTHi correlates directly with the frequency of H. influenzae-associated otitis media. After an episode of otitis media, immunity against a specific NTHi isolate does not prevent recurrence caused by a different strain (9). Smith-Vaughan et al. (57-59) described simultaneous carriage and horizontal gene transfer among multiple NTHi strains in Australian Aboriginals. They documented numerous cases of horizontal transfer among individual strains within infectious isolates of the gene encoding the major outer membrane protein P2. Similarly, Hiltke et al. demonstrated in situ transfer of the P2 gene between NTHi strains during cocolonization of the respiratory tract of a patient with COPD (27). In another series of studies, by Murphy et al. (49), it was reported that multiple strains of NTHi were present simultaneously in the sputum of 26.3% of adults with COPD. Moreover, the authors suggested that these numbers likely underestimated the true frequency of the presence of multiple strains of NTHi, as they had sampled an average of only 6.3 colonies per isolate. These data, along with numerous specific reports of genic differences among the NTHi strains, support the concept that there is significant heterogeneity among the NTHi strains (12, 39, 51, 54, 65), which we have postulated may be important for chronic persistence in the host (11, 15, 52, 55). Thus, a better understanding of the genomic engine that drives pathogenicity of NTHi is important.
A significant number of genomic sequences identified in clinical strains of NTHi are absent from the genome of H. influenzae Rd KW20 (Rd), a laboratory strain with reduced virulence (22, 23). These sequences include genes encoding virulence-associated autotransporters (14, 48), hemagglutinating pili (66), novel NTHi sequences expressed during interaction with human epithelial cell lines (67), genes that are upregulated in the middle ear in a chinchilla model of acute otitis media (42), and uncharacterized sequences predicted to encode hypothetical or phage-like proteins (47). Such sequences have come to be recognized as components of a set of contingency genes that are available to H. influenzae at a population level and are distributed among the clinical strains; however, no single strain has the full complement of genes present in the population-based supragenome (15, 21, 24).
The ability to take up DNA provides a means for the exchange of these novel sequences among the H. influenzae. Competent H. influenzae preferentially take up DNA from highly related organisms via the recognition of an overrepresented uptake signal sequence (USS) (13, 56). The highly efficient exchange of genes among competent strains is important; e.g., the Rd genome contains 1,465 USSs (56). Evidence that bacteria take up DNA more efficiently while growing as a biofilm (45), accompanied by the preliminary observation that H. influenzae grows as a biofilm in an experimental model of OME (17) supports, but does not prove, the hypothesis that horizontal gene transfer is an important pathogenic process for persistence during the course of a chronic infection. Such an increase in gene transfer would theoretically contribute to additional genetic heterogeneity (allelic differences) and genomic plasticity (genic differences) among the strains in a polyclonal infection, thereby further increasing the fitness of the bacterial population as a whole through the continual generation of novel recombinant strains. These results underscore the importance of elucidating the full repertoire of virulence genes available to NTHi at a population level.
Ten H. influenzae strains were obtained as first-plate isolates on chocolate agar from the middle ear effusions of 10 children undergoing myringotomy for the treatment of otitis media at Children's Hospital of Pittsburgh. Each of these strains was streaked for single colonies and then grown in broth for the preparation of aliquots for cryopreservation. The strains, designated PittAA through PittJJ, were all determined to be nontypeable (NTHi) by slide agglutination assay-based typing at the Pittsburgh Public Health Laboratory, with the exception of AA, which was initially reported to be weakly reactive when tested with polyclonal sera for all capsular types, a to f. This specimen was nonreactive with individual anti-A and anti-B sera. This specimen was sent for typing to the New York State Department of Health's Wadsworth Laboratory, where it was determined to be nontypeable. In addition, a PCR-based capsular typing method (20), which has previously been used to resolve discrepancies in slide agglutination serotyping results (36), demonstrated that all 10 strains were nontypeable.
All NTHi strains were cultured in brain heart infusion broth (Becton Dickinson, Sparks, MD) supplemented with 10 μg/ml hemin (Fisher Scientific, Pittsburgh, PA), 2 μg/ml NAD (Sigma, St. Louis, MO), and 20 μg/ml thiamine HCl (Sigma) and grown at 37°C in a humidified 5% CO2 atmosphere. Escherichia coli TOP10 cells were grown in LB broth (Becton Dickinson) supplemented with 50 μg/ml kanamycin, when necessary, at 37°C.
Bacterial genomic DNA was extracted using a modification of the method described by Ausubel et al. (2). The DNA was analyzed by UV spectrophotometry and agarose gel electrophoresis.
A pooled genomic library DNA was constructed (18) to maximize efficiency and to minimize strain-specific biases which could result from the construction of multiple independent libraries. Individual genomic DNAs from each of the 10 clinical strains were isolated and hydrodynamically sheared to an average fragment length of 1.5 kb using a HydroShear (GeneMachines, San Carlos, CA) according to the manufacturer's instructions. Ten-microgram aliquots of each of the sheared DNAs were pooled, end repaired using T4 and Klenow polymerases (Invitrogen), ligated into the plasmid pCR4Blunt-TOPO, and transformed into E. coli TOP10 cells according to the manufacturer's protocol (Invitrogen) (18). A Q-bot multitasking robot (Genetix Limited, United Kingdom) was used to construct an addressable array containing 76,800 transformants, which were replica plated and stored in 10% glycerol at −80°C. Clones in the library were chosen randomly for further analysis.
Plasmid DNA templates were prepared for sequencing using QIAprep Miniprep kits (QIAGEN, Inc., Valencia, CA) with a Beckman FX robot (Beckman Instruments). Prior to sequencing, plasmid preparations were digested with EcoRI (Invitrogen) and analyzed on ethidium bromide-stained 1% agarose gels in Tris-acetate-EDTA buffer. Only constructs containing inserts larger than 0.5 kb were used as templates. Dideoxy sequencing was performed according to standard protocols for both IR2 Gene ReadIR instruments (LiCor, Inc., Lincoln, Nebraska) and Beckman CEQ 2000 XL automated capillary electrophoresis sequencing instruments (Beckman Coulter, Inc., Fullerton, CA). Additional sequencing was performed using an ABI 3730 DNA analyzer in which 3-μl reactions were prepared, using a Parallab 350-nanoliter genomic workstation (Brooks Automation, Inc., Chelmsford, MA), consisting of (i) 1.4 μl of plasmid template (approximately 100 ng DNA), (ii) 0.5 μl of primer (10 pmol/μl), and (iii) 1.1 μl of a BigDye Terminator v.3.1 cycle sequencing kit (Applied Biosystems, Inc.) (63). Reaction aliquots of 500 nl were then thermal cycled and purified within the Nano-Pipetter of the Parallab. Cycling conditions were 35 cycles with a 0-s denaturation step at 96°C, a 0-s annealing step at 50°C, and a 60°C extension step for 45 sec. The purified samples were then run on the ABI 3730 Analyzer and analyzed using ABI analysis software v.5.0.
Sequences were analyzed and contigs formed using Sequencher (v. 4.1.4) (Gene Codes Corporation, Ann Arbor, MI). DNA sequence similarity searches using the BLASTn and BLASTx algorithms (1) were performed using the Center for Genomic Sciences high-speed BLAST cluster (G. Erdos, unpublished results) that is connected to the NCBI website (http://www.ncbi.nlm.nih.gov/). This system, including a custom-designed software package, automatically performs vector trimming, sequence quality checks, and BLAST homology searches, permitting fast and accurate analyses of thousands of clones daily. Details of these programs and the BLASTn and BLASTx searches will be described elsewhere (J. Gladitz et al., unpublished). The novel nucleotide sequences reported in this paper have been deposited with GenBank.
Primers for distribution and expression studies were designed based on the most likely open reading frame (ORF) in each clone or contig. The primer sequences are posted at http://www.centerforgenomicsciences.org (go to Public Documents, Supplementary Table 1). Primers specific for the H. influenzae glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene served as positive controls for all strains. PCR was performed using an Eppendorf MasterTaq kit (Brinkman Instruments, Inc., Westbury, NY) in a 25-μl reaction mixture containing 0.6 units of Taq DNA polymerase, 50 ng of template DNA, 20 pmol of each primer, 1.5 mM MgCl2, and 0.2 mM deoxynucleoside triphosphates. PCR was carried out by using Perkin Elmer 9600 thermal cyclers beginning with a 10-min denaturation step at 95°C and followed by 35 cycles of 30 s at 94°C, 1 min at 55°C, and 1 min at 72°C. Cycling was followed by a final extension step of 7 min at 72°C and then a 4°C hold. Reactions were analyzed on 1.7% ethidium bromide-stained agarose gels.
Gene expression assays were designed for each of the non-Rd clones solely to determine if the novel sequences were part of functional transcriptional units. Total RNA for each of the 10 NTHi isolates was extracted using the hot phenol method (Invitrogen) from bacterial cells grown in supplemented brain heart infusion broth that had reached mid-exponential phase. Samples were treated with Turbo DNase (Ambion, Inc., Austin, TX) and then used as templates for PCR and reverse transcription (RT)-PCR. RNA quality was checked on an Agilent 2100 Bioanalyzer using an RNA 6000 Nano Assay kit (Agilent Technologies, Palo Alto, CA). All RNA preparations were assayed in RT-PCR-based assays, and sham (−RT) PCR-based assays, the latter to test the efficacy of the DNase treatment. For RT-PCR, random hexamers were mixed with 4 μg of total RNA from each strain, heat denatured, chilled on ice, and then divided equally into two tubes. Two master mixes were prepared for the reverse transcription step, one containing all of the RT components, including Moloney murine leukemia virus (Invitrogen) (+RT) and the other one lacking the RT enzyme (−RT). Each pair of RNA specimens received an aliquot of the +RT mixture in one tube and the -RT mixture in the other. Following the RT reaction, 2.5 μl of the first-strand cDNA from each reaction was used as a PCR template. For all strains, GAPDH mRNA was amplified as a positive control. Negative controls consisted of reaction mixtures prepared with no template nucleic acid.
High-molecular-weight bacterial genomic DNAs were isolated from each of the 10 NTHi strains and Rd as described previously (18), digested to completion with EcoRI, and electrophoresed into a 1% agarose gel. Each gel also contained as a positive control one lane with a mixture of the unique plasmid clones that we were probing for. The DNAs were then transferred to positively charged nylon membranes (Amersham Pharmacia Biotech, Buckinghamshire, United Kingdom) by capillary blotting according to the method of Southern (60a) using 0.4 M NaOH. Probe production was accomplished by PCR of the plasmid inserts corresponding to the unique genes under study, followed by purification using QIAquick PCR purification kits (QIAGEN). Labeling of the probe consisted of denaturing 25 ng of each amplimer at 95°C for 5 min; quenching on ice; the addition of random primers buffer mixture (Random Primers DNA labeling system; Invitrogen), 6.0 μl of a dATP-dGTP-dTTP mixture, 50 μCi of 32P dCTP, and 3 U of Klenow fragment; and incubation at 25°C for 1 h, after which the reaction was stopped with 5 μl stop buffer. Purification of the radiolabeled probe was performed using gel exclusion chromatography (G-50 Sephadex columns, Roche Diagnostics, Indianapolis, IN), and the specific activity was measured with a QC4000XER counter (Bioscan, Washington, DC). Membranes were prehybridized at 42°C for 30 min in hybridization tubes using formamide prehybridization solution (5× SSC [1× SSC is 0.15 M NaCl plus 0.015 M sodium citrate], 5× Denhardt solution, 50% (wt/vol) formamide, 1% (wt/vol) sodium dodecyl sulfate [SDS]), with the addition of heat-denatured sheared salmon sperm DNA (Sigma) immediately before incubation in the hybridization oven (Biometra, Goettingen, Germany). After prehybridization, 2 × 107 dpm of heat-denatured probe was added, and hybridization was carried out at 42°C overnight. The membranes were washed with 2× SSC-0.1% (wt/vol) SDS for 5 min at room temperature three times, 0.2× SSC-0.1% (wt/vol) SDS for 15 min at room temperature, 0.2× SSC-0.1% (wt/vol) SDS for 15 min at 42°C, and 0.1× SSC-0.1% SDS for 15 min at 68°C, followed by autoradiography.
DNA sequencing was performed as described previously (http://www.mlst.net) for 7 H. influenzae housekeeping genes, adk, atpG, frdB, fucK, mdh, pgi, and recA, from each of the 10 NTHi clinical strains to determine their multilocus sequence types (MLSTs) (44). PCRs for each gene were performed using the published primers in a 96-well microtiter plate format using Perkin Elmer 9600 thermal cyclers. The PCR conditions were as follows: initial denaturation at 95°C for 4 min; 30 cycles of 95°C for 30 sec, 55°C for 30 sec, and 72°C for 60 sec; and 72°C for 10 min followed by a 4°C hold. All long-term sample storage was at −20°C. The PCR products were purified using the QIAquick PCR purification kit (QIAGEN) according to the manufacturer's protocol. DNA fragments were sequenced bidirectionally on the Beckman capillary DNA analysis system and the sequences were analyzed as described above. The alignment of multiple sequences was made using Clustal X. The sequence of each locus in each strain was then compared with the MLST database to obtain an allelic profile. If a locus sequence had no identical allele in the database, the sequence trace files were sent to the MLST curators for a final determination as to whether it represented a new allele. All confirmed new alleles were assigned a unique allele number for that gene locus. The combined allelic profiles of the seven gene loci for each strain resulted in a sequence type (ST) that was then compared with the existing ST profiles in the MLST database. A new ST number was assigned to a strain with no match in the database. Phylogenetic trees were generated using the unweighted-pair group method with arithmetic mean program MEGA v.2.1 software (35). Trees based on the average distance method were generated using the on-line java script available at the MLST website (http://www.mlst.net).
Total bacterial RNA from each of the NTHi strains was analyzed for 23S fragmentation patterns using an RNA 6000 Nano Assay chip and an Agilent 2100 Bioanalyzer. The sizes of the bands were measured by the software with the aid of an RNA marker ladder. Primers for amplifying fragments containing intervening sequences (IVSs) were used to detect the existence of the IVS1 and IVS2, according to the method of Song et al. (60). The PCR conditions were as follows: 3 min of denaturation at 94°C, followed by 30 cycles of 1 min at 94°C, 1 min at 55°C and 1 min at 72°C, with a final extension at 72°C for 7 min. The PCR products were run on 1% agarose gels.
The novel nucleotide sequences reported in this paper have been deposited in GenBank under accession numbers AY599423 to AY599486.
DNA sequencing was performed for 771 clones from a pooled genomic DNA library prepared from 10 NTHi clinical isolates obtained from children with OME (18). All unique sequences (non-Rd) were evaluated for their potential to encode virulence factors and to determine their distribution and expression among the strains that make up the library. The average insert size was 1.5 kb, and the average read length from each end was 650 bases, providing ~87% coverage from the initial read. The remaining sequence for each clone was determined by primer walking.
Clones that revealed a minimum of 350 bp of contiguous homology (at least 75% nucleotide identity) to Rd at each end and that produced a DNA fragment size in accordance with that predicted by the Rd genome were classified as being Rd-like. This method identified 699 of 771 clones (90.7%) as being Rd-like. Many of these Rd-like clones were demonstrated to contain small insertions and deletions and numerous point mutations, but in general, they represent allelic variations of known genes.
BLASTn analyses of 72 of 771 clones (9.3%) indicated that they contained inserts that were unique with respect to the Rd genome. Each of these clones was then compared to all of the other novel clones to identify overlaps. A total of five contigs were constructed from 12 clones. These contigs, together with the 60 nonoverlapping clones, made up the 65 sequences that were analyzed. Twenty-one of the 65 novel sequences contained a contiguous Rd-like sequence at one end, indicating a chromosomal origin ((1).1). The novel sequences that could be anchored with known H. influenzae genes were demonstrated to be distributed evenly around the Rd genome. This distribution of novel sequences throughout the genome suggests that strain evolution in H. influenzae is mostly incremental and is not associated with the acquisition of large pathogenicity islands, such as has been documented for the enteric pathogens (43). It further suggests that the pooled genomic library was unbiased in its coverage. This was a concern because of reports that genes closer to the origin of replication tend to be overrepresented in libraries prepared from rapidly doubling bacteria (23). The remaining 44 sequences shared no nucleotide-level homology with the Rd genome, consistent with a chromosomal insertion size greater than that of the clone, or of episomal or phage origin.
BLASTx searches were carried out to identify proteins with similarity to the conceptual protein translations of these novel DNA sequences. These novel DNA sequences and their cognate translations have been deposited in GenBank (accession numbers AY599423 to AY599486). Eighteen of the 65 novel sequences are most likely restriction/modification genes (n = 7) or phage genes (n = 11) and are not considered further here. The results of the remaining protein similarity searches produced a number of candidate virulence genes (Table (Table11).
The H. influenzae USS is a conserved 9-bp core sequence within a 29-bp consensus sequence (5′-aAAGTGCGGTnRW5n6RW5-3′, where boldface indicates invariants and lowercase indicates consensus sequences). DNA containing this sequence is preferentially taken up by competent H. influenzae (13, 25, 56). We identified 36 USSs distributed among 30 of the 65 novel sequences, suggesting that this subset of novel sequences has been in the NTHi gene pool for an extended time. The actual number of USSs among these novel loci is likely greater than 36, since the majority of the clones contain only partial ORFs. Table Table11 includes the occurrences of USSs among the novel ORFs that are presented in more detail below.
Interspecies exchange of these genes may also occur, given that the H. influenzae USS is also overrepresented in the genomes of other members of the Pasteurellaceae, including Haemophilus somnus, Pasteurella multocida, and Actinobacillus actinomycetemcomitans (3, 68). Moreover, DNA transfer between the upper respiratory pathogens H. influenzae and Neisseria meningitidis has been previously documented (14, 33). Although USS-mediated DNA uptake was not implicated in the exchanges with Neisseria, transformation with heterologous DNA has been suggested to play an important role in establishing chromosomal mosaicism in these organisms (33). Based on these findings, it seems plausible that the novel H. influenzae sequences identified herein, particularly those that harbor a USS, could be readily transferred not only to other strains of H. influenzae but also to other genera of the Pasteurellaceae.
Twelve of 65 (18%) of the non-Rd sequences showed amino acid (aa) homology to proteins from the hif, hmw, tna, and lex2 operons that have been previously identified for other H. influenzae strains but are absent from Rd. Gene distribution and expression assays were performed among the clinical isolates for these four gene clusters (Table (Table2).2). Three clones (0004_E21, 0133_D06, and 0152_N02) contained genes from the hemagglutinating pilus (hif) gene cluster. We detected transcripts for these hif genes in 2 of the 10 isolates (Table (Table2)2) and a hicB transcript in 7 isolates, including the 2 isolates that expressed the hif genes. Three clones (0120_C11, 0036_E20, and 0170_J08) displayed aa homology to the hmw gene cluster that encodes the high-molecular-weight adhesions HMW1 and HMW2 (4, 5, 62). Two clones (0009_E14 and 0013_D09) contained genes from the tryptophanase operon (tnaABC), and 9 of 10 NTHi isolates tested positive for a tnaB expression and indole production, which has been associated with virulence (32, 41). Finally, three clones (0083_M12, 0047_C18, and 0093_M17) contained genes from the lex2AB locus, which controls virulence through the variable expression of lipooligosaccharide epitopes (10, 30). These genes possessed tetranucleotide repeats associated with phase variation.
We identified three novel hemoglobin binding protein (HGBP) genes that displayed high degrees of homology over short distances to known HGBP genes that alternated with regions that contained no identifiable homology. This mosaic pattern of gene evolution is likely reflective of the fact that heme acquisition is vital for NTHi, but that heme acquisition proteins are in continual contact with the host immune response, which puts enormous selective pressure on them. Thus, novel genes arise through domain swapping, making it difficult to establish evolutionary relationships among them. One of these genes (GenBank accession number AY599483) contained a conserved TonB-dependent receptor domain and was most similar to an HGBP gene of A. actinomycetemcomitans (26).
GenBank accession number AY599439 contains a USS and is predicted to encode a class of autotransporter (Las) that mediates its own outer membrane secretion (6, 14, 37). It is interesting that the only homology between this clone and other members of this family resides in the C-terminal region of the protein (96% aa identity), which is responsible for secretion, whereas the upstream sequence was completely unique and had a lower G+C content (35%) than the Las-like domain, indicating chimerism. Davis et al. and Loveless and Sair also observed that Las homologs often contain variant passenger domains acquired by lateral gene transfer that are linked to the conserved β domain (14).
GenBank accession number AY599442 contains the 3′ end of the purD allele, and a unique gene with seven GTTT tandem repeats located 5′ to the novel ORF. The tetranucleotide repeats indicate that this gene is likely phase variable through slip strand mispairing during replication. A USS in both purD alleles was identified at the point where identity ends, suggestive of a recombination event. We identified a consensus −10 promoter sequence and a potential PUR box (5′-AgGgcAACGTTTaCGa-3′) 58 nucleotides upstream of the start codon for the novel ORF, suggesting that this gene is also under control of the PUR repressor. The most significant match (47% identity) was with the YhbX protein of N. meningitidis (64), which is a member of a family of proteins predicted to be membrane-associated metal-dependent hydrolases (43) that are associated in E. coli with the ability to penetrate the blood-brain barrier of newborn rats.
Clone 0125_L02 contained two regions, each of ~75 bp, with 83% nucleotide homology to the Shigella resistance locus-pathogenicity island (SRL PAI). The SRL PAI carries genes for antibiotic resistance, iron uptake, and at least 22 prophage-related ORFs (40). The conceptual translation of our clone revealed three proteins that showed limited homologies to the proteins of ORFs 7, 8, and 9 of the SRL PAI, respectively, including a LysR-like transcriptional regulator, an aspartate racemase, and an anaerobic decarboxylate transporter (40). It is likely that this group of genes, which is present in only 50% of the NTHi clinical strains, was only recently transferred as a group into H. influenzae. The G+C content (33%) of these genes is much lower than either H. influenzae (38%) or Shigella (45%).
Clone 0167_A16 contained a >2.5-kb insert with two short regions of nucleotide homology to Rd of 96% and 90% and three ORFs. The recently available unfinished sequence for NTHi strain R2846 (A. L. Erwin, A. Smith, M. Kibukawa, Y. Zhou, R. K. Kaul, and M. V. Olson, GenBank accession number NZ_AADO00000000) revealed the presence of a virtually identical locus. However, the start codon that we identified for ORF 2 is 465 nucleotides upstream of the start codon predicted for the corresponding R2846 gene and downstream of a consensus RBS. Our start site prediction was supported by RT-PCR-based analyses. The three ORFs include a putative 260-aa protein product predicted to be a member of the metallo-beta-lactamase superfamily and two others that both encode proteins containing tetratricopeptide repeat (TPR) domains of the SEL-1 family of proteins. TPRs occur in a wide array of organisms, including bacteria, fungi, plants, and humans, where they participate as scaffolds for the mediation of protein-protein interactions associated with numerous cellular processes, including cell-cycle control, transcription, chaperone assistance and protein transport, and DNA uptake and recombination (8). These three gene products appeared to be coordinately expressed, based on RT-PCR studies.
Clone 0179_D14 showed 88% nucleotide homology to an Azotobacter vinelandii gene and to a gene of the plant pathogen Ralstonia solanacearum. The deduced aa sequence was 91% identical to the A. vinelandii Flp pilus assembly protein CpaF and 92% identical to the R. solanacearum conjugal transfer protein TrbB. CpaF and TrbB belong to a superfamily of proteins that spans both the archaea and the bacteria and that are involved in the formation of surface-associated protein complexes that mediate diverse processes, such as pilus biosynthesis, DNA transport, and the secretion of virulence factors, including the type II and IV secretion system NTPases (28, 31, 53, 69). The high degree of homology between the 0179_D14 sequence and the Azotobacter and Ralstonia genes, combined with its very high G+C content (68%) suggests that this gene has entered the H. influenzae supragenome relatively recently from a genus outside the Pasteurellaceae. This gene contains multiple tandem pentanucleotide (CCCGG) repeats, which in the present clone interrupt the reading frame. We are confident that this is not due to a sequencing error and propose that this repeat region plays a role in regulation through slip strand mispairing at the translational level, because our RT-PCR results identified transcript sequences extending 451 bp 3′ of the premature stop codon. We speculate that this gene encodes part of a virulence secretion system responsible for the transport of macromolecules. It is noteworthy that NTHi strain PittGG was the only isolate to express both this sequence and the above-described autotransporter sequence, as PittGG has been demonstrated to be the most virulent of the 10 NTHi strains in a blinded in vivo pathogenicity study (G. D. Ehrlich et al., unpublished).
We identified an Hib-like sequence (0162_D23) that displayed 96% nucleotide homology to the genetic island (HiGI7) of the Eagan strain (7). The clone included the 3′ end of holA, suggesting that this genetic island was inserted into the tRNAIle gene found between holA and glyS, as in the Eagan strain. A notable difference between the Eagan sequence and ours is that ours contains seven repeats of an 11-bp sequence (GGAATTATTTG) that occurs only once in the Hib strain. Although HiGI7 was previously thought to be specific to type b strains, it is also present in strain R2846.
The distribution and expression patterns for 53 of the novel DNA sequences among the 10 NTHi clinical strains were studied using PCR- and RT-PCR-based assays (Table (Table2).2). The restriction/modification and phage clones were not evaluated. In all cases, genomic DNA and RNA from Rd were used as a negative control. The GAPDH gene served as a positive control for both the DNA and RNA assays. All assays were performed in triplicate, and a positive call required at least two out of three results to be positive. One possible source of error with regard to the PCR-based analyses for distribution of the novel sequences is that different strains may possess different alleles of the same gene, thus preventing amplification with the primers designed from the sequenced clone. Thus, we performed Southern blotting for 7 of the 53 clones to confirm the PCR results. Overall the results of the Southern blots corroborated the PCR results; however, there were a few minor differences (online Supplementary Table 2).
Three hundred fifty-two of 530 (66%) DNA assays were positive, with at least two strains harboring each of the 53 novel sequences, indicating that all of the unique clones corresponded to sequences actually present in the NTHi strains under study. It is possible that a larger survey would have identified genes that are present in only a single strain, as our survey covered only ~1% of the library. Nine (17%) of the novel sequences were present in all 10 strains; the PittAA strain had the most novel sequences (50 of 53), and the PittJJ had the least (24 of 50). The average number of novel sequences across the 10 strains was 35, and the majority of the sequences (35 of 53) were found in the majority (6 or more) of the strains.
In approximately 50% (177 of 352) of the genic occurrences, RNA transcripts were also detected; RNA expression was observed for at least one of the strains harboring the gene for each of the novel sequences under study, indicating that all of these unique genes are transcriptionally active. In contrast to the DNA results, only 10 of 53 (19%) of the genes were expressed in a majority of the isolates under in vitro planktonic conditions, and no single novel transcript was found to be expressed in all of the strains. Only five clones gave positive RNA results for each genic occurrence. As with the DNA results, the PittAA strain demonstrated the highest rate of expression of these novel sequences (26 of 51), the strain expressing the least of its novel genes was PittCC (12 of 33). Expression was limited to one of the strains for 11 of the genes (online Supplementary Table 3).
Pairwise comparisons of the PCR results were performed for the 53 novel DNA sequences among all 10 strains to evaluate the overall level of genomic diversity (online Supplementary Table 4). The greatest degree of difference was observed between strains PittFF and PittJJ, which varied on 36 sequences. The most closely related strains using this metric were PittAA and PittDD which differed at only eight loci; interestingly, in an independent test of pathogenicity, these two strains were also found to be very similar (Ehrlich et al., unpublished). Overall, PittJJ was found to have the greatest number of differences (240) compared to all other strains, while PittEE had the fewest differences (152). A similar comparison was made using the RT-PCR results (online Supplementary Table 5). These distribution studies demonstrate that the novel genes were not uniformly distributed among the 10 clinical strains, nor were they universally expressed under in vitro planktonic growth conditions. The latter observation is not surprising, as many of these genes are predicted to encode virulence factors that would not necessarily be constitutively expressed under laboratory conditions. The fact that all sequences were found to be expressed in at least one of the strains argues that the novel sequences are functional genes and do not represent junk DNA.
Our investigations were designed to test the hypothesis that among natural infecting populations of H. influenzae, there exists a substantial number of genes that are not represented in the genome of the laboratory strain Rd and that no two clinical strains (obtained from different patients) would be found to be identical in terms of their genic content. Our finding that ~10% of the clones examined were novel with respect to Rd is in concordance with those of Davis et al. (2001). The most parsimonious explanation for the genomic plasticity observed among the 10 NTHi strains is horizontal gene transfer. The consequence of frequent horizontal gene transfers is that the set of genes (not alleles) that any given strain or isolate contains is unique with respect to all other strains. Our observations support the concept that a bacterial species is defined by a minimal gene set possessed by all organisms within that species (29, 50) and that each strain within that species has a unique distribution of contingency genes derived from a population-based supragenome, the latter of which is much larger than the genome of any single organism (11, 21, 55).
We applied the MLST method, which is used to classify strains of a bacterial species based on the degree of genetic similarity among a set of housekeeping genes, to our 10 NTHi strains. The sequences of seven housekeeping loci (adk, atpG, frdB, fucK, mdh, pgi, recA) were obtained and used as described previously (44). All sequences were compared with the MLST database (http://www.mlst.net), and new numbers were assigned to unique alleles and STs (online Supplementary Table 6). PittBB was given new allele numbers for adk (allele 40), and frdB (allele 38); PittHH and PittJJ received the same new fucK allele (allele 31); and PittII received a new fucK allele (allele 32). New STs were assigned to PittHH and PittJJ (ST112), PittBB (ST113), and PittII (ST187). Strains PittFF and PittGG were both ST43. Unweighted-pair group method with arithmetic mean-based phylogenetic trees were generated based on each of the individual housekeeping genes, using all of the loci concatenated together. The seven trees based on the individual genes showed that except for the two strain pairs (PittFF/PittGG and PittHH/PittJJ), there was no common set of relationships among the strains; thus, a completely unique tree was constructed for each gene examined (online Supplementary Fig. 1). An average distance phylogenetic tree was generated using the online Java program at http://www.mlst.net. This tree shows the genetic distances, using the MLST data among all 187 known H. influenzae sequence types compared with the locations of our strains as indicated in Fig. Fig.2.2. Our MLST study suggests that the STs are not, in fact, useful as NTHi haplotypes, because the phylogenetic trees built using the seven genes and our 10 clinical strains each produced a completely different branching pattern. This heterogeneity among the trees is supportive of the concept that the primary means of H. influenzae strain evolution is horizontal gene transfer. We did not observe a correlation between MLST typing and our novel gene distribution data. For example, strains PittGG and PittFF are both ST112 (suggesting they are closely related); however, we observed 12 differences in the distribution patterns of the novel genes and 18 differences in their expression of the novel genes in vitro. Moreover, the pathogenic character of these two strains is vastly different as assessed in a series of blinded OME animal model experiments in which a quantitative trait approach was used to characterize virulence. Therefore, MLST would appear to be of limited value in estimating the phylogenic relationships among strains and for estimating strain virulence.
The intact 2.9-kb 23S rRNA molecule was observed only in Rd; all 10 of the NTHi clinical strains demonstrated fragmented 23S rRNA patterns (online Supplementary Table 7). Seven of our strains demonstrated the presence of two intervening sequences (IVS1 and IVS2) in both copies of their 23S rDNA. The other strains showed heterogeneous cleavage patterns for the two 23S genes; strains PittAA and PittII had one 23S rRNA gene that contained both IVS1 and IVS2 and one that contained only IVS2. Strain PittEE had one 23S rRNA gene with both IVS1 and IVS2 and one with IVS1 only. It is clear that this methodology has insufficient discriminatory power to accurately type the NTHi strains.
The distributed genome hypothesis states that there exists a population-based supragenome for pathogenic bacteria, which is made up of a set of contingency genes from which each strain has a unique distribution compared with all other component strains of the species (15, 55). The virulence corollary of the distributed genome hypothesis holds that the autocompetence and autotransformation mechanisms of these pathogenic bacteria have been selected for in vivo to provide these pathogens with a rapid means of generating diversity as a way to persist in the face of myriad host defense mechanisms. In other words, the contingency genes, through reassortment during chronic infectious processes, provide for an increased number of genetic characters that enable the population as a whole to adapt rapidly to environmental factors, such as those experienced in the host. The recent understanding that many chronic bacterial infections are caused by biofilms (for reviews, see references 11, 15, 16, and 71) and the fact that bacteria in biofilms have been demonstrated to exchange DNA at rates several orders of magnitude greater than planktonic bacteria (45) provide us with a new rubric for understanding chronic bacterial pathogenesis. This pathogenesis model also helps to explain the difficulty in establishing good animal models of chronic infection, as nearly all such studies start with a single clonal isolate, and we would predict that anagenesis does not produce sufficient genetic heterogeneity for persistence in the face of a concerted host response. The data presented in this paper support the concept that the H. influenzae possesses a population-based supragenome and that no two strains have the same complement of genes. Moreover, it would appear as if the H. influenzae supragenome is necessarily substantially larger in size than the genomes of individual bacteria, but without far more extensive surveys than the one conducted for this study, it will not be possible to accurately estimate the genome space for the H. influenzae on a worldwide pathogen basis.
This work was supported by Allegheny General Hospital, Allegheny-Singer Research Institute, and by National Institute on Deafness and Other Communication Disorders grants DC 02148 (G.D.E.) and DC 04173 (G.D.E.).
Editor: J. N. Weiser