|Home | About | Journals | Submit | Contact Us | Français|
Bacteria exhibit extensive genetic heterogeneity within species. In many cases, these differences account for virulence properties unique to specific strains. Several such loci have been discovered in the genome of the type b serotype of Haemophilus influenzae, a human pathogen able to cause meningitis, pneumonia, and septicemia. Here we report application of a PCR-based scanning procedure to compare the genome of a virulent type b (Hib) strain with that of the laboratory-passaged Rd KW20 strain for which a complete genome sequence is available. We have identified seven DNA segments or H. influenzae genetic islands (HiGIs) present in the type b genome and absent from the Rd genome. These segments vary in size and content and show signs of horizontal gene transfer in that their percent G+C content differs from that of the rest of the H. influenzae genome, they contain genes similar to those found on phages or other mobile elements, or they are flanked by DNA repeats. Several of these loci represent potential pathogenicity islands, because they contain genes likely to mediate interactions with the host. These newly identified genetic islands provide areas of investigation into both the evolution and pathogenesis of H. influenzae. In addition, the genome scanning approach developed to identify these islands provides a rapid means to compare the genomes of phenotypically diverse bacterial strains once the genome sequence of one representative strain has been determined.
Haemophilus influenzae is a gram-negative bacterium that is able to cause a variety of diseases in humans. Encapsulated strains, and most frequently those expressing the type b capsule, cause invasive infections, including meningitis, pneumonia, and septicemia. Nontypeable H. influenzae (NTHi) strains are more commonly found as asymptomatic residents of the nasopharynx, but are also a primary cause of otitis media, conjunctivitis, and sinusitis (44). Occasionally, NTHi strains are also isolated in the course of invasive infections (29), and a subset of NTHi strains (of the aegyptius biogroup) has been implicated as the causative agent of Brazilian purpuric fever, a fulminant systemic infection in which bacteria multiply in the bloodstream, leading to fever, purpura, vascular collapse, and death (8, 9). Although the development of a vaccine specific to the type b capsule has dramatically decreased the incidence of H. influenzae type b (Hib)-associated diseases in developed countries (25), it remains prevalent in the developing world (31). Other serotypes and NTHi strains remain a significant cause of disease in the United States and worldwide (1, 13, 18, 30, 47).
Perhaps the best-studied pathogenic strain of H. influenzae is the Hib Eagan strain, an encapsulated strain shown to be virulent in an infant rat model (41). Conversely, H. influenzae strain Rd KW20 represents an important resource for H. influenzae genomics, because its complete genome sequence has been determined (14). Rd KW20 is an avirulent strain derived by in vitro passage and isolation of a capsule deficient variant of H. influenzae type d (51). In addition to the loss of capsular genes, the Rd KW20 strain lacks several other virulence-associated loci present in the Eagan strain, including fimbrial genes, the tryptophanase gene cluster, and a haemocin gene (14, 26, 28, 49).
The possibility that other pathogenicity-associated loci are present in the type b genome has not been fully explored, and an estimate based on pulsed-field gel electrophoresis has suggested that the type b genome is approximately 270 kb larger than that of the Rd strain (10). Comparisons within other bacterial genera have typically revealed a large number of strain-specific genetic islands (7, 32, 38, 46). Overall, these findings support the utility of genomic comparison of virulent and avirulent strains of H. influenzae for identifying new virulence-associated loci and contributing to our understanding of H. influenzae biology.
Here, we report the use of a position-based approach to find genetic islands within the Hib Eagan genome. We scanned over 96% of the genome by a high-throughput PCR procedure and identified seven previously uncharacterized DNA segments that are present in Eagan and absent in the Rd KW20 genome. These segments vary in size, and the majority of them are located between direct repeats, within a tRNA, or within predicted stem-loop structures. In addition, several of these loci contain potential virulence genes, providing new areas of inquiry into the pathogenesis of H. influenzae.
H. influenzae strains type b Eagan (5, 35) and Rd KW20 (51) were grown in brain heart infusion (BHI) broth or on agar plates supplemented with 5% Levinthal's base or both hemin (10 μg/ml) and NAD (10 μg/ml).
Two genome-scale primer sets were used in this study. The set designated “GAMBIT primers” has been described previously (2, 3). A custom Perl script was kindly provided by Dana Boyd to design the second set, termed “ORF (open reading frame) primers.” ORF primers were designed in pairs to amplify each Rd KW20 ORF from the putative translational initiation codon to the codon immediately preceding the termination codon. The sequences of these sets of primers are available upon request.
PCRs contained primers, H. influenzae genomic DNA, and a mixture of Taq and DeepVent polymerases at a ratio of 10:1 based on the unit definitions of the manufacturers Invitrogen (Carlsbad, Calif.) and NEB (Beverly, Mass.), respectively. Reactions were done in buffer containing 10 mM KCl, 10 mM (NH4)2SO4, 20 mM Tris-HCl (pH 8.8 at 25°C), 2 mM MgSO4, 0.1% Triton X-100, and 0.25 mM each of the four deoxynucleoside triphosphates (dNTPs). Reaction cycles with GAMBIT primers were as described previously (3). The following reaction cycles were used with ORF primers: an initial 2-min denaturation step at 95°C was followed by 30 cycles of amplification (95°C for 30 s, 52°C for 30 s, and 72°C for 4 min). The 72°C extension step was extended by an additional 10 s with each cycle. The reaction amounts were either 20 μl (when done in microtiter plates) or 40 μl (when done individually) in an Eppendorf Mastercycler thermal cycler (Eppendorf Scientific, Westbury, N.Y.). Amplified products were analyzed on 0.8% agarose gels, and sizes of DNA fragments were determined by comparison to an Invitrogen 1-kb Plus DNA ladder. Purification of PCR products, when necessary, was done with Edge gel filtration cartridges (Edge Biosystems, Gaithersburg, Md.). Restriction analysis was carried out with enzymes from NEB according to the manufacturer's directions.
PCR products were sequenced at the DNA Sequencing Core Facility, University of Michigan, Ann Arbor. Sequences were assembled with DNAStar (DNAStar, Inc., Madison, Wis.), and various programs within this suite were used for analysis of ORFs, codon usage, G+C content, and restriction sites. These sequences as well as predicted peptide sequences were used in searches of the GenBank, PDB, SwissProt, PIR, and PRF databases by using the BLAST 2.0 and PSI-BLAST programs (4). Putative genes were identified as the longest predicted protein with similarity to a previously identified known or putative protein. Putative gene assignments were not made in regions encoding diverse putative proteins lacking similarity to previously assigned proteins. All BLASTN, BLASTP, and TBLASTX searches were performed in accordance with the default parameters from the National Center for Biotechnology Information/National Library of Medicine BLAST website (http://www.ncbi.nlm.nih.gov/BLAST/).
Sequences determined in this study have been submitted to the GenBank database under the following accession numbers: AF542611, AF542612, AF542613, AF542614, AF542615, and AF542616.
To scan the genome of the Hib Eagan strain for unique sequences, we first used a set of primers originally synthesized for application of the GAMBIT approach to genome-scale analysis of essential genes in H. influenzae Rd (2, 3). These GAMBIT primers amplify overlapping 5- and 10-kb sections of the H. influenzae genome and were previously found to generate products of the appropriate sizes (with a success rate of >99%) when used in PCR with H. influenzae Rd as the template (3).
The primers in the GAMBIT set are arranged such that they amplify either 5- or 10-kb sections of the genome, depending on which partners are used. To detect heterogeneity in the Eagan strain compared with Rd, we utilized pairs that yield 5-kb products in Rd to facilitate detection of size variations. By this approach, we were able to consistently detect differences as small as 500 bp.
The 5-kb PCRs were carried out in a high-throughput, microtiter plate-based format, and these segments covered approximately 91% of the genome relative to Rd. The gaps between these sections, comprising the remaining 9% of the genome, were scanned by using primers predicted to yield 10-kb PCR products in Rd KW20 and a separate set of primers initially designed to amplify each ORF of the Rd KW20 genome (hereafter “ORF primers,” as described in Materials and Methods). These reactions, together with the 5-kb PCRs, covered the entire genome.
A summary of the results from the initial scan of the Hib Eagan genome is shown in Table Table1.1. Of the 554 reactions covering the chromosome, 428 (77.0%) amplified a product in Hib that was the same size as predicted based on the Rd genome sequence, suggesting that these segments of DNA were likely to be identical or highly similar in both strains. We tested this hypothesis by using 10 primer pairs arbitrarily distributed around the genome that amplified products of identical sizes by using the two different templates. The resulting PCR products were digested with a restriction enzyme (or enzymes) predicted to cleave each product into multiple fragments of distinct sizes, and the patterns from equivalent PCRs were compared (Fig. (Fig.1).1). The restriction patterns from equivalent PCRs appeared to be identical in 8 of 10 cases, and similar (differing by the acquisition or loss of one or two sites) in the remaining 2. Assuming that a single-nucleotide difference is sufficient for loss (or acquisition) of a restriction site, these data suggest a single-nucleotide variation rate of roughly 2% between the two strains of H. influenzae, which is consistent with sequencing data from analysis of regions conserved in both Eagan and Rd (N. H. Bergman and B. J. Akerley, data not shown). These data also indicate that segments of the Eagan genome that amplify as predicted based on the Rd genomic sequence are very similar to their Rd counterparts and are not likely to contain genetic elements absent in Rd.
The remaining 23% of the initial PCRs either amplified a product that differed in size by at least 500 bp from that of the reference strain or did not amplify a discrete product. We anticipated that the failure of a given reaction to amplify a product might be due to several factors, including large-scale DNA rearrangements (including relative insertions or deletions) and sequence changes on a smaller scale, such as single-nucleotide differences that result in weakened primer binding. With these possibilities in mind, we sought to resolve the ambiguities in these regions.
The ORF primer set was used for the second stage in our search for heterogeneous elements in the Hib genome. Because the ORF primers are distributed more densely throughout the genome, this set allowed us to more closely map the size differences, and eliminate segments that may not have amplified because of primer-binding problems. ORF primer pairs were used to fill in the gaps left by the GAMBIT primer pairs to generate overlapping PCR products representing the entire genome. This approach increased our coverage of the Eagan genome (relative to Rd) to >96%, detected essentially all of the previously reported genomic differences between these two strains, and mapped and confirmed seven previously unidentified DNA segments in the Eagan genome that are absent in Rd (Table (Table1).1). We also verified one ~35-kb segment present in Rd but not in Eagan that corresponds to a cryptic Mu-like phage. However, our primary analysis was focused on segments unique to the virulent Eagan strain. These unique sequences possess characteristic features of “genetic islands.” Specifically, they contain features of pathogenicity islands, but they may play roles in either pathogenesis or in other aspects of microbial evolution. Following the recently set precedent (11), we propose that they be designated H. influenzae genetic islands 2 to 8 (HiGI2 to HiGI8) (Table (Table22).
The smallest genetic island identified in our PCR scanning, HiGI2, was found in the Eagan genome between the two genes HI1261 and HI1262 relative to Rd. This sequence is 600 bp long and is flanked by two direct repeats of the 10-base sequence 5′-TTGTGAGTTC. There are no ORFs greater than 24 amino acids (aa) in length, and BLAST and translated BLAST searches detected no homology between this segment or peptides potentially encoded by this segment and any other known sequences. The functional or evolutionary significance of this island is thus unknown, and it may be the remnants of a phage or other mobile genetic element.
A 662-bp island (HiGI3) was located in the Eagan genome between the two ORFs designated HI1710 and HI1711 in Rd. This segment is flanked by direct repeats of the 7-base sequence 5′-GTAAGTA and contains an ORF that extends through the repeat sequence in the direction of HI1710 and is comprised of 554 bp unique to Hib Eagan and 73 bp conserved in both strains (Fig. (Fig.2A).2A). This ORF was designated Eag0001 (ORFs identified in this study were all given the type b Eagan-specific identifier EagXXXX), and homology searches showed that this peptide sequence is highly similar to those of a family of phosphomannose isomerases found in a wide variety of bacteria. The closest match (69% identical, 80% similar) was to the protein encoded by the pmi locus of Pasteurella multocida. This homology extends through the entire length of the newly identified ORF Eag0001, although we note that the Eag0001 sequence is shorter than the other enzymes (208 aa versus 380 to 410 aa in other species). Even so, most of the characteristic “PMI-type I” motif (34) that is found in both eukaryotic and prokaryotic members of this family is also found in the Eag0001 sequence, raising the possibility that this protein might be functional despite its shorter length. Members of this family of enzymes in bacteria catalyze the interconversion of fructose-6-phosphate and mannose-6-phosphate, providing precursors for exopolysaccharide synthesis in various bacterial species (21, 37). If the newly identified ORF encodes a functional protein, it may play a role in utilizing host-derived mannose as an energy source or in determining the unique structure of the type b outer surface.
The genes HI1266 and HI1268 in the Rd KW20 genome are hypothetical genes of 387 and 180 bp, respectively, and each has an unassigned function. HI1267 is an 84-bp ORF present in Rd, but removed from the current genome annotation. The Eagan genome has an island (HiGI4) that interrupts a sequence similar to that of Rd HI1266 between nucleotides (nt) 40 and 41 of its predicted coding sequence. This Eagan ORF, designated Eag0002, extends through the HI1267 sequence and encodes additional 203-aa N-terminal extension relative to Rd (Fig. (Fig.2B).2B). In addition, Eagan contains a UGG codon in place of the UGA stop codon of Rd HI1267, so that a contiguous ORF runs through a 9-bp intergenic region and the downstream sequences corresponding to Rd HI1268. Therefore, Eag0002 totals 294 aa, including coding sequences similar to those of both HI1267 and HI1268 as well as a large N-terminal sequence unique to Eagan.
BLAST searches revealed a high degree of similarity of Eag0002 to the ATP-binding subunits of a variety of ABC transporters. The highest degree of similarity (43% identical, 60% similar) was to a putative ATP-binding protein of an ABC transporter from Escherichia coli O157:H7. Although the protein with the highest similarity is annotated as an ATP-binding protein, the Eag0002 protein sequence shares the characteristic conserved domain (periplasmic binding protein 2, pfam01497.5) common to the periplasmic binding proteins of ABC transport systems, and its size (293 aa) is within the range previously reported for this family of proteins (266 to 413 aa for currently known examples). These findings suggest the possibility that this protein is part of an ABC transport apparatus. Members of this family of proteins are often components of critical nutrient uptake pathways, such as iron acquisition (43), an important process for pathogenic bacteria growing in the host, where iron is typically sequestered. Furthermore, we note that the E. coli protein to which Eag0002 shows the highest similarity is found in the pathogenic O157:H7 strain of E. coli (32), but not in the avirulent K-12 strain (7), which is also consistent with a potential virulence-associated role for this protein.
The HI1403 gene in the Rd genome is a hypothetical ORF of 183 aa. In the Eagan genome, however, the sequence corresponding to HI1403 contains an additional 1,047 bp (HiGI5), extending it by 80 aa relative to the Rd gene. HiGI5 also contains two additional ORFs of 169 and 80 aa, respectively (Fig. (Fig.2C).2C). Although the Rd version of HI1403 has not yet been assigned a function, the extended version of the HI1403 gene, designated Eag0003, shares extensive homology with several phage tail fiber proteins. The closest matches were to probable tail fiber proteins from the Haemophilus phages HP1 (58% identical, 72% similar) and HP2 (47% identical, 63% similar) and to a putative tail fiber protein from Neisseria meningitidis (50% identical, 66% similar). This homology is relatively high, although these proteins are all longer (>650 aa) than the Eag0003 sequence (262 aa), and the extent to which these proteins can be truncated before loss of activity occurs is not known.
The second of the two downstream ORFs (designated Eag0004) also shows homology to putative phage tail fiber proteins, although to a slightly different set. This peptide sequence does not show homology to either of the Haemophilus phages, but is quite similar (38% identity, 61% similarity) to the same N. meningitidis protein with which the extended Eag0003 peptide shares homology. The Eag0004 peptide sequence is relatively short, however (169 aa), so although the homology is reasonably high and extends over the entire Eag0004 sequence, it does not seem to be a functional homologue. One possibility is that a frameshift in a longer ORF may have occurred to generate the two ORFs. In support of this idea, we note that the extended Eag0003 peptide sequence shows homology to the N-terminal portion of the Neisseria tail fiber protein, while the Eag0004 peptide sequence shows homology to the C terminus.
The third and smallest ORF within this island, designated Eag0005, is only 80 aa long. This peptide sequence shows a high degree of homology (54% identity, 66% similarity) over its entire length to conserved hypothetical proteins from both fully sequenced N. meningitidis genomes. Weaker homology (32% identity, 53% similarity) was found between the Eag0005 protein sequence and several phage tail fiber assembly proteins from S. flexneri, although these proteins are both 167 aa in length and the 80-aa Eag0005 may lack essential functional domains relative to these potential homologues. We did not detect additional phage-related genes in this region in either Rd or Eagan, suggesting that this segment may represent an inactive, evolutionary remnant of an integrated phage. Alternatively, these proteins may have evolved functions in Eagan that are independent of a colinear phage genome.
A 1,929-bp island (HiGI6) was identified at the site in Eagan corresponding to the Rd genome between HI0548 (infA) and HI0549 (ksgA). A small segment (72 bp) is absent with respect to the Rd sequence, and a larger sequence is found in Eagan. This island was found to be identical to the sequence reported for the lic2B/orf3 cluster isolated from H. influenzae type b, strain RM7004 (20). The lic2B gene encodes a galactosyl transferase functionally similar to the paralogous lic2A gene. However, the 5′ coding region of lic2A contains repeats of the sequence 5′-CAAT that promote “on-off” phase variation, whereas lic2B lacks the repeats (19, 20). The orf3 gene is less well understood and was postulated to be involved in lipopolysaccharide (LPS) biosynthesis. Interestingly, a recent study found this locus in a number of clinical nontypeable (NTHi) isolates, and found that the presence of the lic2B locus is linked to virulence (33).
Eagan contains 2,725 bp of additional DNA compared with Rd in the region between HI0923 (holA) and HI0924 (glyS), and this island has been designated “HiGI7.” In Rd, the region between these two genes is roughly 230 nt long and contains a tRNAIle gene. In Eagan, the 5 nt upstream of the tRNA gene and the first 5 nt of the tRNA gene (5′-AATGGTCCCC) are duplicated. The 2,725-bp HiGI7 sequence is located between these two 10-bp repeats (Fig. (Fig.3).3). The nucleotide sequence of this region in Eagan matches very closely (97% identity) to the sequence of a DNA fragment previously discovered in the Brazilian purpuric fever-associated clone BPF3031 (48). This sequence was detected in a subtractive cDNA hybridization screen designed to isolate genes expressed specifically when the bacterium is in contact with host cells (48). Unlike the infections typically caused by NTHi strains, Brazilian purpuric fever is a fulminant systemic infection in which bacteria multiply in the bloodstream, leading to fever, purpura, vascular collapse, and death. Discovery of this locus in Eagan, a strain that readily infects the bloodstream, further supports the idea that this locus could contribute to the atypical ability of the nonencapsulated BPF3031 strain to cause vascular infections.
Although this sequence showed a high degree of identity to the Brazilian purpuric fever sequence, it showed no similarity at the nucleotide level to any other sequences in the GenBank, PDB, SwissProt, PIR, and PRF databases as of the submission date of this article. We detected 11 small ORFs >25 aa long within the region. Protein BLAST searches with the majority of these ORFs failed to detect any potential homologues. However, two adjacent ORFs near the middle of this island were found to be very similar (52% identical and 75% similar and 60% identical and 96% similar, respectively) to the stbDE cluster identified on plasmid R485 from Morganella morganii (17). These genes constitute a segregational stability system that has also been found in the chromosome of Vibrio cholerae (6) and in the enterotoxigenic plasmid P307 (39, 40) and that appears to be similar to previously reported toxin-antitoxin stability cassettes (17). The homology between the M. morganii protein sequences and the sequences located in the type b genome is quite strong and extends over the full lengths of the two ORFs. As previously reported, both stbD and stbE are short proteins (83 and 93 aa, respectively), and the lengths of the two ORFs located in the type b genome match well with those of the other reported sequences.
A third ORF (Eag0008) from within this island showed weaker homology (36% identical, 57% similar) to a small portion of several CP4-57 phage integrases. An example of this type of integrase was also found in HiGI1 (11), although in that case, a full-length protein (391 aa) was found, while Eag0008 is only 53 aa long and is presumably not functional. Even so, this finding seems to suggest acquisition of this locus by an integrating mobile genetic element, such as a phage.
The features of the genes within HiGI8 suggest a potential role for this locus in H. influenzae-host interactions. This island, totaling 2,799 bp, is located in Eagan in a location corresponding to the position of a stem-loop structure in Rd composed of a tandem, inverted pair of Haemophilus uptake sequences (5′-AAGTGCGGT). This arrangement of uptake sequences is quite common in Haemophilus and other bacteria, and these stem-loop structures are often found at the 3′ ends of genes and may function as transcriptional terminators in addition to recognition sites for the DNA uptake pathway (42, 45). In the Eagan genome, a segment of 34 bp (containing one of the paired uptake sequences) is absent relative to Rd, and the 2,799-bp island is found at this location (Fig. (Fig.4).4). This arrangement is reminiscent of the tna cluster in Eagan located between sequences corresponding to the HI706 and HI707 genes of Rd, because that locus is also located between paired, inverted uptake sequences (26).
The island itself contains three ORFs (designated Eag0009, Eag0010, and Eag0011) and four additional uptake sequences. The two largest ORFs (Eag0010 and Eag0011) appear to be part of the same transcriptional unit, while the third ORF (Eag0009) is located on the opposite strand, 126 bp upstream of Eag0010. All three of these protein sequences are previously uncharacterized, although they all have enough homology with known proteins to predict possible functions.
Eag0009 encodes a 130-aa protein containing a 55-aa canonical helix-turn-helix motif (HTH-XRE) at the N terminus. This protein is similar (47% identical, 71% similar) to that coded for by an H. influenzae Rd hypothetical gene (HI1458m), as well as to two conserved hypothetical proteins from N. meningitidis (42% identical, 65% similar). None of these genes have been assigned a function, and the H. influenzae Rd gene would require a frameshift with respect to the currently annotated sequence for expression of the full-length protein. Even so, all of these examples share an N-terminal HTH-XRE motif, and the homology they have with each other throughout this motif is quite high (Fig. (Fig.5).5). For instance, Eag0009 is 63% identical and 88% similar to the HI1458m sequence throughout the helix-turn-helix motif and is most similar to the HTH domains of the N. meningitidis proteins as well. The putative C-terminal domains of each protein are approximately the same length (between 55 and 60 aa) and exhibit more variability between family members. These data suggest that Eag0009 may encode a transcriptional regulator related to the HI1458m and N. meningitidis proteins based on the putative DNA-binding domains, although variability in the C-terminal domain could signify different activation properties.
Eag0010 encodes a 245-aa protein containing a domain (Lipoprotein_5) common to a large family of transferrin-binding proteins. This sequence showed relatively strong homology to members of this family, and the closest matches were to the H. influenzae transferrin-binding protein TfbA (38% identical, 58% similar) and to proteins from Actinobacillus pleuropneumoniae, Moraxella catarrhalis, and N. meningitidis. This homology only extends through the C-terminal 145 aa of Eag0010, whereas the N-terminal 100 residues are apparently unique to this protein.
Most of the members of the transferrin-binding protein (TbpB) family consist of ~700 aa arranged into two similar domains of ~350 aa each (27). Each domain contains six characteristic motifs, and it has been postulated that the two domains may cooperate in binding the bilobed transferrin substrate (36). It is not clear, however, that a complete set of two domains is necessary for transferrin-binding function, because several examples of smaller transferrin-binding proteins have been found and described. Similar to Eag0010, Exl2 and Exl3 from N. meningitidis are shorter (269 and 455 aa, respectively) and contain only one full set of the six motifs (22).
Alignment of Eag0010 with Exl3 of N. meningitidis indicates that the homology extends throughout the six previously defined motifs (Fig. (Fig.6).6). Furthermore, the homology within these motifs is stronger than the homology in the intermotif stretches. (33% identical and 59% similar inside versus 13% identical and 31% similar outside of previously defined motifs). Although Eag0010 is slightly shorter than the previously reported members of this class (245 aa versus 269 aa for Exl2 and 455 aa for Exl3), the N termini of these proteins are quite variable in length and content, and Eag0010 may be a new member of the class of truncated transferrin-binding proteins.
A third gene, Eag0011, is directly downstream of Eag0010 and encodes a 422-aa protein that does not appear to contain any recognizable motifs. BLAST searches showed that the peptide sequence has some homology to a range of outer membrane proteins from P. multocida, H. influenzae, and N. meningitidis. Although the closest match (31% identical, 53% similar) was to a hypothetical protein from P. multocida, PHI/PSI-BLAST searching suggested that Eag0011 belongs to a large class of outer membrane proteins, and that the nearest relative to the Eag0011 sequence might be OmpU from N. meningitidis, a putative outer membrane protein of unknown function.
Although the levels of homology to characterized proteins are not sufficient to directly infer functions for these proteins, the genetic organization of HiGI8 makes it interesting to speculate that Eag0011 could partner with Eag0010 to constitute a transferrin receptor, which is typically formed by the association of an integral membrane protein (Tbp1) with a transferrin-binding protein (Tbp2) (12). Multiple gene clusters encoding related forms of such proteins are often found in the genomes of bacterial pathogens, possibly for reasons of substrate specificity or antigenic variation. Moreover, since regulatory proteins are often located near the genes they control, these proteins represent likely regulatory targets of Eag0009.
The relatively low percent G+C content of HiGI8 (Table (Table2)2) suggests that this locus may have been acquired by horizontal transfer to H. influenzae. Of note, the gene encoding Exl3 is known to be part of an exchangeable genetic island that contributes to the virulence and genetic diversity of N. meningitidis (22, 24), and Eag0010 could potentially play a similar role in H. influenzae.
We have identified a set of DNA segments that are present in the genome of the virulent type b Eagan strain of H. influenzae and not in the genome of the laboratory strain Rd KW20. These elements range in size from 600 to 3,000 bp and contain a variety of ORFs that have homology to known proteins involved in transport, LPS biosynthesis, iron uptake, and other processes. These islands also highlight the likely role of horizontal gene transfer in H. influenzae evolution, because each of these segments possesses characteristics of mobile DNA elements. They either differ in percent G+C content from the rest of the H. influenzae genome; are located within a stem-loop structure, such as a tRNA gene or an inverted pair of uptake signal sequences; have their closest homologs within mobile DNA elements of other species; or are between direct repeat sequences, suggesting tranfer by a phage or other mobile element. Several HiGIs contain genes likely to be useful for survival in the host. In addition, we have begun to examine the distribution of these DNA segments throughout a range of H. influenzae strains. We have found that at least two (HiGI7 and HiGI8) appear to be specific to type b strains (N. H. Bergman and B. J. Akerley, unpublished data). It will be of interest to determine whether their presence in the highly virulent Hib Eagan strain is a coincidence or whether these loci directly contribute to its pathogenic potential.
In this study, we report a PCR-based scan for previously uncharacterized genetic islands in the Hib Eagan genome. This scan covered over 96% of the genome based on the Rd KW20 genome sequence. The remaining 3.9% is contained in 12 locations that have been preliminarily mapped, but will require additional strategies for full characterization (Table (Table3).3). Several of these regions correspond to loci in the Rd genome containing genes likely to be heterogeneous among different strains, such as restriction-modification systems. Three gaps include one or more tRNA genes each, and two other gaps contain two of the six known rRNA operons within the Rd genome. The structure of tRNA and rRNA sequences might be expected to make these regions resistant to PCR. However, these regions are of particular interest for further characterization, because they are typical insertion points for genetic islands in diverse bacteria (15, 16). We expect that this map will facilitate characterization of additional genetic islands contributing to the estimated 270 kb of genome size difference between Eagan and Rd (10).
In addition to identifying a new set of genetic elements specific to a virulent strain of H. influenzae, we have shown that this rapid, PCR-scanning procedure is an efficient means of comparing genomes of different strains of the same species. In addition, it appears that the method was relatively comprehensive in its detection of genetic islands. We detected and confirmed the location of every previously described genetic island specific to the type b genome, including the capsular gene cluster (23), the fimbrial gene cluster (14, 50), HiGI1 (11), and the tryptophanase cluster (26; data not shown). These findings suggest that we have detected the majority of the remaining genetic islands in the Eagan genome, at least at the level of identifying their positions relative to the Rd genome.
This procedure can be adapted to any species for which a genomic sequence is known and a set of primers spanning the genome has been synthesized. With increasingly efficient sequencing projects and availability of genome-scale oligonucleotide primer sets for many bacteria, this procedure is already applicable to many species. We expect that a PCR-based screen for genetic islands would work best with strains that are closely related, because increasing evolutionary distance would increase the frequency at which PCR primer-binding sites are disrupted. Even so, preliminary evidence indicates that this method works well with strains that are more divergent than the two compared in this study. A scan of roughly 400 kb of the H. influenzae genome was applied to two additional strains, NTHi 1947 and BPF3031. This experiment yielded a PCR success rate of slightly below 80% even in the more distant Brazilian purpuric fever and NTHi strains and revealed DNA segments unique to each strain (J. Quinn, R. Tyler, and B. J. Akerley, unpublished data). These findings show that this strategy provides a rapid assessment of genetic complexity within a species, even among distantly related strains.
We thank Jillian Quinn for technical assistance in performing high-throughput PCR, and we thank members of the Akerley laboratory for valuable discussion and comments.
This work was supported in part by grants from the NIH (AI49437), Philip Morris, Inc., and the American Heart Association (B.J.A.) and a postdoctoral research grant from the Pharmaceutical Research and Manufacturers of America Foundation (N.H.B.).
Editor: D. L. Burns