Identification of heterogeneous regions within the H. influenzae genome.
To scan the genome of the Hib Eagan strain for unique sequences, we first used a set of primers originally synthesized for application of the GAMBIT approach to genome-scale analysis of essential genes in H. influenzae
). These GAMBIT primers amplify overlapping 5- and 10-kb sections of the H. influenzae
genome and were previously found to generate products of the appropriate sizes (with a success rate of >99%) when used in PCR with H. influenzae
Rd as the template (3
The primers in the GAMBIT set are arranged such that they amplify either 5- or 10-kb sections of the genome, depending on which partners are used. To detect heterogeneity in the Eagan strain compared with Rd, we utilized pairs that yield 5-kb products in Rd to facilitate detection of size variations. By this approach, we were able to consistently detect differences as small as 500 bp.
The 5-kb PCRs were carried out in a high-throughput, microtiter plate-based format, and these segments covered approximately 91% of the genome relative to Rd. The gaps between these sections, comprising the remaining 9% of the genome, were scanned by using primers predicted to yield 10-kb PCR products in Rd KW20 and a separate set of primers initially designed to amplify each ORF of the Rd KW20 genome (hereafter “ORF primers,” as described in Materials and Methods). These reactions, together with the 5-kb PCRs, covered the entire genome.
A summary of the results from the initial scan of the Hib Eagan genome is shown in Table . Of the 554 reactions covering the chromosome, 428 (77.0%) amplified a product in Hib that was the same size as predicted based on the Rd genome sequence, suggesting that these segments of DNA were likely to be identical or highly similar in both strains. We tested this hypothesis by using 10 primer pairs arbitrarily distributed around the genome that amplified products of identical sizes by using the two different templates. The resulting PCR products were digested with a restriction enzyme (or enzymes) predicted to cleave each product into multiple fragments of distinct sizes, and the patterns from equivalent PCRs were compared (Fig. ). The restriction patterns from equivalent PCRs appeared to be identical in 8 of 10 cases, and similar (differing by the acquisition or loss of one or two sites) in the remaining 2. Assuming that a single-nucleotide difference is sufficient for loss (or acquisition) of a restriction site, these data suggest a single-nucleotide variation rate of roughly 2% between the two strains of H. influenzae, which is consistent with sequencing data from analysis of regions conserved in both Eagan and Rd (N. H. Bergman and B. J. Akerley, data not shown). These data also indicate that segments of the Eagan genome that amplify as predicted based on the Rd genomic sequence are very similar to their Rd counterparts and are not likely to contain genetic elements absent in Rd.
Summary of PCR screen results for heterogeneous sequences
FIG. 1. Restriction digests of identically sized, corresponding PCR products from the Rd and Eagan genomes. PCRs that amplified products with apparently identical electrophoretic mobilities from either genomic template were purified and digested with an enzyme (more ...)
The remaining 23% of the initial PCRs either amplified a product that differed in size by at least 500 bp from that of the reference strain or did not amplify a discrete product. We anticipated that the failure of a given reaction to amplify a product might be due to several factors, including large-scale DNA rearrangements (including relative insertions or deletions) and sequence changes on a smaller scale, such as single-nucleotide differences that result in weakened primer binding. With these possibilities in mind, we sought to resolve the ambiguities in these regions.
The ORF primer set was used for the second stage in our search for heterogeneous elements in the Hib genome. Because the ORF primers are distributed more densely throughout the genome, this set allowed us to more closely map the size differences, and eliminate segments that may not have amplified because of primer-binding problems. ORF primer pairs were used to fill in the gaps left by the GAMBIT primer pairs to generate overlapping PCR products representing the entire genome. This approach increased our coverage of the Eagan genome (relative to Rd) to >96%, detected essentially all of the previously reported genomic differences between these two strains, and mapped and confirmed seven previously unidentified DNA segments in the Eagan genome that are absent in Rd (Table ). We also verified one ~35-kb segment present in Rd but not in Eagan that corresponds to a cryptic Mu-like phage. However, our primary analysis was focused on segments unique to the virulent Eagan strain. These unique sequences possess characteristic features of “genetic islands.” Specifically, they contain features of pathogenicity islands, but they may play roles in either pathogenesis or in other aspects of microbial evolution. Following the recently set precedent (11
), we propose that they be designated H. influenzae
genetic islands 2 to 8 (HiGI2 to HiGI8) (Table ).
Summary of genetic islands within the Hib Eagan genome
A small island between genes HI1261 and HI1262.
The smallest genetic island identified in our PCR scanning, HiGI2, was found in the Eagan genome between the two genes HI1261 and HI1262 relative to Rd. This sequence is 600 bp long and is flanked by two direct repeats of the 10-base sequence 5′-TTGTGAGTTC. There are no ORFs greater than 24 amino acids (aa) in length, and BLAST and translated BLAST searches detected no homology between this segment or peptides potentially encoded by this segment and any other known sequences. The functional or evolutionary significance of this island is thus unknown, and it may be the remnants of a phage or other mobile genetic element.
HiGI3 is flanked by 7-bp DNA repeats and encodes a putative phosphomannose isomerase.
A 662-bp island (HiGI3) was located in the Eagan genome between the two ORFs designated HI1710 and HI1711 in Rd. This segment is flanked by direct repeats of the 7-base sequence 5′-GTAAGTA and contains an ORF that extends through the repeat sequence in the direction of HI1710 and is comprised of 554 bp unique to Hib Eagan and 73 bp conserved in both strains (Fig. ). This ORF was designated Eag0001 (ORFs identified in this study were all given the type b Eagan-specific identifier EagXXXX), and homology searches showed that this peptide sequence is highly similar to those of a family of phosphomannose isomerases found in a wide variety of bacteria. The closest match (69% identical, 80% similar) was to the protein encoded by the pmi
locus of Pasteurella multocida
. This homology extends through the entire length of the newly identified ORF Eag0001, although we note that the Eag0001 sequence is shorter than the other enzymes (208 aa versus 380 to 410 aa in other species). Even so, most of the characteristic “PMI-type I” motif (34
) that is found in both eukaryotic and prokaryotic members of this family is also found in the Eag0001 sequence, raising the possibility that this protein might be functional despite its shorter length. Members of this family of enzymes in bacteria catalyze the interconversion of fructose-6-phosphate and mannose-6-phosphate, providing precursors for exopolysaccharide synthesis in various bacterial species (21
). If the newly identified ORF encodes a functional protein, it may play a role in utilizing host-derived mannose as an energy source or in determining the unique structure of the type b outer surface.
FIG. 2. HiGI3 to HiGI6. The dashed lines specify the junctions between sequences found in both strains and those found only in Eagan. (A) HiGI3 is shown between the Eagan loci corresponding to the genes HI1710 and HI1711 of Rd. The Eagan ORF (Eag0001) is shown, (more ...) A genetic island comprising the 5′ end of a contiguous ORF in Eagan containing sequences similar to those of the HI1266, HI1267, and HI1268 genes of Rd.
The genes HI1266 and HI1268 in the Rd KW20 genome are hypothetical genes of 387 and 180 bp, respectively, and each has an unassigned function. HI1267 is an 84-bp ORF present in Rd, but removed from the current genome annotation. The Eagan genome has an island (HiGI4) that interrupts a sequence similar to that of Rd HI1266 between nucleotides (nt) 40 and 41 of its predicted coding sequence. This Eagan ORF, designated Eag0002, extends through the HI1267 sequence and encodes additional 203-aa N-terminal extension relative to Rd (Fig. ). In addition, Eagan contains a UGG codon in place of the UGA stop codon of Rd HI1267, so that a contiguous ORF runs through a 9-bp intergenic region and the downstream sequences corresponding to Rd HI1268. Therefore, Eag0002 totals 294 aa, including coding sequences similar to those of both HI1267 and HI1268 as well as a large N-terminal sequence unique to Eagan.
BLAST searches revealed a high degree of similarity of Eag0002 to the ATP-binding subunits of a variety of ABC transporters. The highest degree of similarity (43% identical, 60% similar) was to a putative ATP-binding protein of an ABC transporter from Escherichia coli
O157:H7. Although the protein with the highest similarity is annotated as an ATP-binding protein, the Eag0002 protein sequence shares the characteristic conserved domain (periplasmic binding protein 2, pfam01497.5) common to the periplasmic binding proteins of ABC transport systems, and its size (293 aa) is within the range previously reported for this family of proteins (266 to 413 aa for currently known examples). These findings suggest the possibility that this protein is part of an ABC transport apparatus. Members of this family of proteins are often components of critical nutrient uptake pathways, such as iron acquisition (43
), an important process for pathogenic bacteria growing in the host, where iron is typically sequestered. Furthermore, we note that the E. coli
protein to which Eag0002 shows the highest similarity is found in the pathogenic O157:H7 strain of E. coli
), but not in the avirulent K-12 strain (7
), which is also consistent with a potential virulence-associated role for this protein.
A genetic island containing three genes not found in Rd, two of which are similar to phage structural genes.
The HI1403 gene in the Rd genome is a hypothetical ORF of 183 aa. In the Eagan genome, however, the sequence corresponding to HI1403 contains an additional 1,047 bp (HiGI5), extending it by 80 aa relative to the Rd gene. HiGI5 also contains two additional ORFs of 169 and 80 aa, respectively (Fig. ). Although the Rd version of HI1403 has not yet been assigned a function, the extended version of the HI1403 gene, designated Eag0003, shares extensive homology with several phage tail fiber proteins. The closest matches were to probable tail fiber proteins from the Haemophilus phages HP1 (58% identical, 72% similar) and HP2 (47% identical, 63% similar) and to a putative tail fiber protein from Neisseria meningitidis (50% identical, 66% similar). This homology is relatively high, although these proteins are all longer (>650 aa) than the Eag0003 sequence (262 aa), and the extent to which these proteins can be truncated before loss of activity occurs is not known.
The second of the two downstream ORFs (designated Eag0004) also shows homology to putative phage tail fiber proteins, although to a slightly different set. This peptide sequence does not show homology to either of the Haemophilus phages, but is quite similar (38% identity, 61% similarity) to the same N. meningitidis protein with which the extended Eag0003 peptide shares homology. The Eag0004 peptide sequence is relatively short, however (169 aa), so although the homology is reasonably high and extends over the entire Eag0004 sequence, it does not seem to be a functional homologue. One possibility is that a frameshift in a longer ORF may have occurred to generate the two ORFs. In support of this idea, we note that the extended Eag0003 peptide sequence shows homology to the N-terminal portion of the Neisseria tail fiber protein, while the Eag0004 peptide sequence shows homology to the C terminus.
The third and smallest ORF within this island, designated Eag0005, is only 80 aa long. This peptide sequence shows a high degree of homology (54% identity, 66% similarity) over its entire length to conserved hypothetical proteins from both fully sequenced N. meningitidis genomes. Weaker homology (32% identity, 53% similarity) was found between the Eag0005 protein sequence and several phage tail fiber assembly proteins from S. flexneri, although these proteins are both 167 aa in length and the 80-aa Eag0005 may lack essential functional domains relative to these potential homologues. We did not detect additional phage-related genes in this region in either Rd or Eagan, suggesting that this segment may represent an inactive, evolutionary remnant of an integrated phage. Alternatively, these proteins may have evolved functions in Eagan that are independent of a colinear phage genome.
A probable LPS-synthesis cluster inserted between HI0548 and HI0549.
A 1,929-bp island (HiGI6) was identified at the site in Eagan corresponding to the Rd genome between HI0548 (infA
) and HI0549 (ksgA
). A small segment (72 bp) is absent with respect to the Rd sequence, and a larger sequence is found in Eagan. This island was found to be identical to the sequence reported for the lic2B/orf3
cluster isolated from H. influenzae
type b, strain RM7004 (20
). The lic2B
gene encodes a galactosyl transferase functionally similar to the paralogous lic2A
gene. However, the 5′ coding region of lic2A
contains repeats of the sequence 5′-CAAT that promote “on-off” phase variation, whereas lic2B
lacks the repeats (19
). The orf3
gene is less well understood and was postulated to be involved in lipopolysaccharide (LPS) biosynthesis. Interestingly, a recent study found this locus in a number of clinical nontypeable (NTHi) isolates, and found that the presence of the lic2B
locus is linked to virulence (33
A genetic island inserted into the tRNAIle gene between HI0923 and HI0924.
Eagan contains 2,725 bp of additional DNA compared with Rd in the region between HI0923 (holA
) and HI0924 (glyS
), and this island has been designated “HiGI7.” In Rd, the region between these two genes is roughly 230 nt long and contains a tRNAIle
gene. In Eagan, the 5 nt upstream of the tRNA gene and the first 5 nt of the tRNA gene (5′-AATGGTCCCC) are duplicated. The 2,725-bp HiGI7 sequence is located between these two 10-bp repeats (Fig. ). The nucleotide sequence of this region in Eagan matches very closely (97% identity) to the sequence of a DNA fragment previously discovered in the Brazilian purpuric fever-associated clone BPF3031 (48
). This sequence was detected in a subtractive cDNA hybridization screen designed to isolate genes expressed specifically when the bacterium is in contact with host cells (48
). Unlike the infections typically caused by NTHi strains, Brazilian purpuric fever is a fulminant systemic infection in which bacteria multiply in the bloodstream, leading to fever, purpura, vascular collapse, and death. Discovery of this locus in Eagan, a strain that readily infects the bloodstream, further supports the idea that this locus could contribute to the atypical ability of the nonencapsulated BPF3031 strain to cause vascular infections.
FIG. 3. HiGI7 is found at a position corresponding to the Rd HI0923 and HI0924 intergenic region. The island between HI0923 and -4 is expanded at the bottom, with the flanking direct repeats indicated. ORFs encoding predicted proteins longer than 25 aa are shown, (more ...)
Although this sequence showed a high degree of identity to the Brazilian purpuric fever sequence, it showed no similarity at the nucleotide level to any other sequences in the GenBank, PDB, SwissProt, PIR, and PRF databases as of the submission date of this article. We detected 11 small ORFs >25 aa long within the region. Protein BLAST searches with the majority of these ORFs failed to detect any potential homologues. However, two adjacent ORFs near the middle of this island were found to be very similar (52% identical and 75% similar and 60% identical and 96% similar, respectively) to the stbDE
cluster identified on plasmid R485 from Morganella morganii
). These genes constitute a segregational stability system that has also been found in the chromosome of Vibrio cholerae
) and in the enterotoxigenic plasmid P307 (39
) and that appears to be similar to previously reported toxin-antitoxin stability cassettes (17
). The homology between the M. morganii
protein sequences and the sequences located in the type b genome is quite strong and extends over the full lengths of the two ORFs. As previously reported, both stbD
are short proteins (83 and 93 aa, respectively), and the lengths of the two ORFs located in the type b genome match well with those of the other reported sequences.
A third ORF (Eag0008) from within this island showed weaker homology (36% identical, 57% similar) to a small portion of several CP4-57 phage integrases. An example of this type of integrase was also found in HiGI1 (11
), although in that case, a full-length protein (391 aa) was found, while Eag0008 is only 53 aa long and is presumably not functional. Even so, this finding seems to suggest acquisition of this locus by an integrating mobile genetic element, such as a phage.
A genetic island in Eagan located between sequences corresponding to HI1192 and HI1193 in Rd.
The features of the genes within HiGI8 suggest a potential role for this locus in H. influenzae
-host interactions. This island, totaling 2,799 bp, is located in Eagan in a location corresponding to the position of a stem-loop structure in Rd composed of a tandem, inverted pair of Haemophilus
uptake sequences (5′-AAGTGCGGT). This arrangement of uptake sequences is quite common in Haemophilus
and other bacteria, and these stem-loop structures are often found at the 3′ ends of genes and may function as transcriptional terminators in addition to recognition sites for the DNA uptake pathway (42
). In the Eagan genome, a segment of 34 bp (containing one of the paired uptake sequences) is absent relative to Rd, and the 2,799-bp island is found at this location (Fig. ). This arrangement is reminiscent of the tna
cluster in Eagan located between sequences corresponding to the HI706 and HI707 genes of Rd, because that locus is also located between paired, inverted uptake sequences (26
FIG. 4. A genetic island between HI1192 and HI1193. The top sequence represents the Rd genome in the region of HI1192 and -3. The 3′ coding sequences of genes flanking the island are shown within arrows that indicate the direction of transcription for (more ...)
The island itself contains three ORFs (designated Eag0009, Eag0010, and Eag0011) and four additional uptake sequences. The two largest ORFs (Eag0010 and Eag0011) appear to be part of the same transcriptional unit, while the third ORF (Eag0009) is located on the opposite strand, 126 bp upstream of Eag0010. All three of these protein sequences are previously uncharacterized, although they all have enough homology with known proteins to predict possible functions.
Eag0009 encodes a 130-aa protein containing a 55-aa canonical helix-turn-helix motif (HTH-XRE) at the N terminus. This protein is similar (47% identical, 71% similar) to that coded for by an H. influenzae Rd hypothetical gene (HI1458m), as well as to two conserved hypothetical proteins from N. meningitidis (42% identical, 65% similar). None of these genes have been assigned a function, and the H. influenzae Rd gene would require a frameshift with respect to the currently annotated sequence for expression of the full-length protein. Even so, all of these examples share an N-terminal HTH-XRE motif, and the homology they have with each other throughout this motif is quite high (Fig. ). For instance, Eag0009 is 63% identical and 88% similar to the HI1458m sequence throughout the helix-turn-helix motif and is most similar to the HTH domains of the N. meningitidis proteins as well. The putative C-terminal domains of each protein are approximately the same length (between 55 and 60 aa) and exhibit more variability between family members. These data suggest that Eag0009 may encode a transcriptional regulator related to the HI1458m and N. meningitidis proteins based on the putative DNA-binding domains, although variability in the C-terminal domain could signify different activation properties.
Alignment of Eag0009 with homologous proteins from H. influenzae and N. meningitidis. Shading indicates homology between Eag0009 and the other three proteins, and the HTH-XRE motif is boxed (residues 6 to 61).
Eag0010 encodes a 245-aa protein containing a domain (Lipoprotein_5) common to a large family of transferrin-binding proteins. This sequence showed relatively strong homology to members of this family, and the closest matches were to the H. influenzae transferrin-binding protein TfbA (38% identical, 58% similar) and to proteins from Actinobacillus pleuropneumoniae, Moraxella catarrhalis, and N. meningitidis. This homology only extends through the C-terminal 145 aa of Eag0010, whereas the N-terminal 100 residues are apparently unique to this protein.
Most of the members of the transferrin-binding protein (TbpB) family consist of ~700 aa arranged into two similar domains of ~350 aa each (27
). Each domain contains six characteristic motifs, and it has been postulated that the two domains may cooperate in binding the bilobed transferrin substrate (36
). It is not clear, however, that a complete set of two domains is necessary for transferrin-binding function, because several examples of smaller transferrin-binding proteins have been found and described. Similar to Eag0010, Exl2 and Exl3 from N. meningitidis
are shorter (269 and 455 aa, respectively) and contain only one full set of the six motifs (22
Alignment of Eag0010 with Exl3 of N. meningitidis indicates that the homology extends throughout the six previously defined motifs (Fig. ). Furthermore, the homology within these motifs is stronger than the homology in the intermotif stretches. (33% identical and 59% similar inside versus 13% identical and 31% similar outside of previously defined motifs). Although Eag0010 is slightly shorter than the previously reported members of this class (245 aa versus 269 aa for Exl2 and 455 aa for Exl3), the N termini of these proteins are quite variable in length and content, and Eag0010 may be a new member of the class of truncated transferrin-binding proteins.
FIG. 6. Alignment of Eag0010 with partial Exl3 protein sequence from N. meningitidis. Shading indicates homology between the two proteins, and the six motifs defined in reference 22 are boxed.
A third gene, Eag0011, is directly downstream of Eag0010 and encodes a 422-aa protein that does not appear to contain any recognizable motifs. BLAST searches showed that the peptide sequence has some homology to a range of outer membrane proteins from P. multocida, H. influenzae, and N. meningitidis. Although the closest match (31% identical, 53% similar) was to a hypothetical protein from P. multocida, PHI/PSI-BLAST searching suggested that Eag0011 belongs to a large class of outer membrane proteins, and that the nearest relative to the Eag0011 sequence might be OmpU from N. meningitidis, a putative outer membrane protein of unknown function.
Although the levels of homology to characterized proteins are not sufficient to directly infer functions for these proteins, the genetic organization of HiGI8 makes it interesting to speculate that Eag0011 could partner with Eag0010 to constitute a transferrin receptor, which is typically formed by the association of an integral membrane protein (Tbp1) with a transferrin-binding protein (Tbp2) (12
). Multiple gene clusters encoding related forms of such proteins are often found in the genomes of bacterial pathogens, possibly for reasons of substrate specificity or antigenic variation. Moreover, since regulatory proteins are often located near the genes they control, these proteins represent likely regulatory targets of Eag0009.
The relatively low percent G+C content of HiGI8 (Table ) suggests that this locus may have been acquired by horizontal transfer to H. influenzae
. Of note, the gene encoding Exl3 is known to be part of an exchangeable genetic island that contributes to the virulence and genetic diversity of N. meningitidis
), and Eag0010 could potentially play a similar role in H. influenzae