|Home | About | Journals | Submit | Contact Us | Français|
Simple sequence repeat (SSRs) of DNA are subject to high rates of mutation and are important mediators of adaptation in Haemophilus influenzae. Previous studies of the Rd KW20 genome identified the primacy of tetranucleotide SSRs in mediating phase variation (the rapid reversible switching of gene expression) of surface exposed structures such as lipopolysaccharide. The recent sequencing of the genomes of multiple strains of H. influenzae allowed the comparison of the SSRs (repeat units of one to nine nucleotides in length) in detail across four complete H. influenzae genomes and then comparison with a further 12 genomes when they became available. The SSR loci were broadly classified into three groups: (1) those that did not vary; (2) those for which some variation between strains was observed but this could not be linked to variation of gene expression; and (3) those that both varied and were located in regions consistent with mediating phase variable gene expression. Comparative analysis of 988 SSR associated loci confirmed that tetranucleotide repeats were the major mediators of phase variation and extended the repertoire of known tetranucleotide SSR loci by identifying ten previously uncharacterised tetranucleotide SSR loci with the potential to mediate phase variation which were unequally distributed across the H. influenzae pan-genome. Further, analysis of non-tetranucleotide SSR in the 16 strains revealed a number of mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs which were consistent with these tracts mediating phase variation. This study substantiates previous findings as to the important role that tetranucleotide SSRs play in H. influenzae biology. Two Brazilian isolates showed the most variation in their complement of SSRs suggesting the possibility of geographic and phenotypic influences on SSR distribution.
Haemophilus influenzae (Hi), a common commensal bacterium of the upper respiratory tract of humans, is an important cause of diseases that include otitis media, pneumonia, meningitis, and septicaemia. The genome sequence of Hi strain Rd KW20, the first completed for a free-living organism, revealed a high prevalence of simple sequence repeats (SSRs) (Fleischmann et al., 1995; Hood et al., 1996b). SSRs are usually defined as direct, perfect DNA repeats consisting of repeat units (the smallest repeating DNA motif of the SSR) of between one and nine nucleotides in length. In many organisms, taking into account the nucleotide sequence composition of their respective genomes, SSRs are found less frequently than predicted (Mrázek et al., 2007). SSRs are hypermutable (e.g. tetranucleotide SSRs lose and gain units at a rate of 1 × 10−4 per generation (De Bolle et al., 2000) compared with a basal mutation rate of approximately 1 × 10−9) and, therefore, it has been suggested that their decreased prevalence reflects natural selection because the higher rates of mutation of these loci would be more often detrimental to fitness than beneficial. However, in some prokaryotes, predominantly host-adapted organisms, some SSRs are found in greater numbers than would be expected by chance (Mrázek et al., 2007). Analysis of SSRs in the Hi strain Rd KW20 genome revealed that long tracts of tetranucleotides were over-represented (Hood et al., 1996b). A striking feature of these tetranucleotide SSRs is their frequent association with genes whose functions are associated with microbial-host interactions relevant to commensal and virulence behaviour (Hood et al., 1996b).
SSRs can be located in promoter regions or within open reading frames and changes in their length can result in the random, high frequency, reversible loss, gain or modulation of gene expression (phase variation). Since these regions of localised hypermutation, often termed ‘contingency loci’, can each independently result in altered gene expression, a repertoire of phenotypic variants is generated (Moxon et al., 2006). Through selection of these variants, the adaptation of the bacterial population to changes in the host environment is facilitated. It has been suggested that this strategy has particular survival value when bacterial populations are subjected to periodic selection during transmission between genetically distinct hosts (Wolf et al., 2005).
The advent of the genomic sequencing of multiple strains of the same species has revealed that the genomic sequence of a particular strain may not reflect the diversity and variety of the entire species. The term ‘pan-genome’ has been used to describe the superset of genes of a species (Tettelin et al., 2005). The characterisation of a pan-genome describes the core (genes contained in all genomes of a species) and dispensable genes (those genes absent from one or more strains or unique to each strain) of a species. We suggest that the concept of a pan-genome should also include explicit recognition of differences in gene sequence, organisation and variation that may better describe the adaptive and evolutionary potential of the species (Caporale, 2006). In this study, we have sought to identify the potential repertoire of variation mediated by SSRs in the currently available Hi pan-genome.
Prior to this study, our understanding of SSRs in Hi has been predominantly based on analysis of the strain Rd KW20 genome sequence. Whilst selective studies of other Hi strains have provided some evidence to suggest variation in the number, location and nature of the SSRs compared to that seen in the Rd KW20 genome (Fox et al., 2005; van Belkum et al., 1997), the recent availability of a number of completely sequenced Hi genomes has provided us with the opportunity for a much more extensive analysis of SSRs in Hi.
We describe in detail 223 SSRs identified in the four complete genome sequences of strains RdKW20, 86-028NP, R2846 and R2866 plus 765 SSRs identified in the complete or partial genome sequences of a further 12 Hi strains. Previous reports of SSRs in Hi have been predominantly of tetranucleotide repeats. From these 16 genomes we describe 199 tetranucleotide SSRs in 28 different loci including 10 which have not previously been described. However, we have also identified a number of mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs with a putative role in phase variable gene regulation. A preponderance of the novel SSRs identified occur in only two strains, F3031 and F3034 of the Hi biogroup aegyptius, suggesting that the distribution of SSRs across the Hi pan-genome may be linked with geographic and phenotypic profiles.
The four Hi genome sequences that were available at the commencement of this study formed the basis of the Hi four genome study. Details of these genomes are given in Table 1. A list of SSRs with repeat unit lengths of between one and nine nucleotides, and a number of repeat unit iterations above an empirically determined threshold value (see below) was compiled for each of these genomes using a PERL script that we have developed and have called HiSSRfinder. The results from HiSSRfinder were used to generate an annotated EMBL file for each of the genomes which then allowed manual analysis and curation of the SSRs identified in each genome using the Artemis and ACT genome viewing, annotation and comparison programs (Rutherford et al., 2000). Each SSR was manually evaluated with regards to its position relative to open reading frames (ORFs), whether an equivalent SSR was present or not in the other three strains and whether there was any variation in the SSR between strains (see Supplementary Table 1).
The threshold values, i.e. the minimum number of repeat units required to be present in an uninterrupted tandem arrangement within a genome in order for that sequence to be counted as an SSR and included in further analysis, was determined for each different length of repeat unit from a comparison of the number of SSRs of different lengths and the frequency of polymorphisms between the four genomes (see Section 3 and Table 2). The thresholds determined in this way for this study were as follows: 1 (repeat unit length), >8 (threshold value of repeat units); 2, >4; 3, >3; 4, >2, 5, >2; 6, >2; 7, >2; 8, >2 and 9, >2.
A database, named SSR_Hi_4G, was constructed to contain the nucleotide sequences of each of the tetranucleotide SSR, together with their 500 bp upstream and downstream flanking sequences, identified in the four genome study. This database was assembled using the formatdb program (SSR_Hi_4G is available at http://users.ox.ac.uk/~oxmicro/ssrblast.html, formatdb program is available at http://www.ncbi.nlm.nih.gov/blast/download.shtml).
A second collection of Hi genomes (herein termed the further 12 genome study) was then examined using the information from the initial four genome survey to guide analysis. Details of these additional genomes are given in Table 1. The SSRs in these genomes were identified using the HiSSRfinder script as described above. In order to determine which of the tetranucleotide SSRs identified in the 12 further genome sequences were equivalent to the tetranucleotide SSRs previously identified in the four genome study, each was compared to the SSR_Hi_4G database using the BLASTN program. Data and boxplot analysis of SSR data was performed using the R statistical package (http://www.r-project.org/) and Microsoft Excel. Transmembrane helicies were predicted using the TMHMM web server v2.0 (http://www.cbs.dtu.dk/services/http://www.cbs.dtu.dk/services/TMHMM; Moxon et al., 2006; Sonnhammer et al., 1998).
Previous studies on Hi have described the SSRs present within the genome of strain Rd KW20 (Hood et al., 1996b). Our aim was to extend the analysis of SSRs by comprehensively investigating the repertoire present in the four complete Hi genome sequences that were available for different strains of Hi at the commencement of this study (Four Genome Analysis; see Table 1). In this study, SSRs are defined as tandem repeats of a repeat unit that consists of between one and nine nucleotides. In order to attain maximum sensitivity for the detection of SSRs the threshold values (see Section 2) were set as low as practically possible. Our rationale for adopting threshold values is described in Table 2, which shows the number of SSRs identified in the genome of strain Rd KW20 at the threshold values adopted and also the number of SSRs that would have been included in the subsequent analysis if the threshold value had been set one unit higher or lower for each repeat unit length. It can be seen from Table 2 that, for all but the tetranucleotide SSRs, increasing the threshold value by one substantially decreased the number of SSRs detected and resulted in a total of only 19 SSRs being identified in the Rd KW20 genome. In contrast, decreasing the threshold by one resulted in a large number of SSRs being identified which would be impractical for manual analysis (3924 in strain Rd KW20). At the adopted threshold values, 60 SSRs were identified in strain Rd KW20. The thresholds used in this study for all repeat unit lengths included at least all of the statistically unexpected SSRs determined by hidden Markov model analysis of the Hi genomes (see Table 2; Paul Swift, Oxford, personal communication). This further substantiates the threshold values chosen as being permissive for having a high degree of sensitivity in identifying SSRs with potential roles in mediating phase variation. Additionally, if SSRs were found to be above threshold in at least one genome the corresponding regions in the other genomes were also characterised.
A total of 223 SSRs were identified in the four genome sequences when the threshold values described in Table 2 were applied and these 223 SSRs are summarised in Table 3. Comparison of the SSRs across the four Hi genomes for each of the repeat unit lengths, reveals that their numbers are not substantially different between the strains. Also, the total number of repeats found within any one strain is not substantially different from the others (total number of SSRs ranged between 53 and 60), despite the differences in the origin, associated disease and date of isolation of the four strains (see Table 1).
SSRs have previously been associated with hypermutation, as loss or gain of repeat units occur at high frequency due to replication slippage (Moxon et al., 2006). Loss or gain of repeat units from an SSR located within an ORF may result in a frameshift mutation if the length of the repeat unit is not a multiple of three. The position of each of the 223 SSRs identified in this study was manually curated and the proportion of SSRs that were located within ORFs was recorded (see Table 3). Hi has a coding density of approximately 88% of the genome sequence and, of the repeats examined, only the trinucleotide (83%), tetranucleotide (100%) and hexanucleotide (87%) SSRs occur within ORFs at approximately this frequency, whilst SSRs with repeat unit lengths of one, two, five and seven nucleotides were all found to be located within ORFs with a frequency of less than 88%. SSRs with longer repeat unit lengths were not included in this analysis due to their low frequency in the genomes.
This suggests that the selective pressure against trinucleotide and hexanucleotide SSRs occurring within an ORF may not be as high as that against SSRs of other repeat unit lengths whose expansion or contraction would result in inactivation of an ORF by frameshift mutation. It is noteworthy that tetranucleotide SSRs are found exclusively within ORFs, consistent with the known importance of this class of repeat in mediating phase variable expression at contingency loci in Hi.
Manual curation of each of the 223 SSRs allowed us to assess the likelihood of each SSR playing a role in modulating gene expression. Comparison of equivalent SSR loci (those located in the same relative genomic location) allowed the classification of each SSR into one of three categories: (1) SSRs that did not vary in length, sequence or position between the four genomes (invariable), (2) SSRs for which some variation between strains was observed but the variation was not considered likely to result in variation of gene expression (variable) and (3) SSRs that both varied in length and were located in regions consistent with mediating phase variation (potentially phase variable; see Table 4). Careful manual examination of each of the repeat associated loci was necessary to classify the SSRs into the above categories. Factors such as the location of the SSR within a gene, length of the SSR and replacement of a whole or partial tract of an SSR by another sequence contributed to the assessment of whether or not any observed variation in the SSR was likely to mediate phase variation. SSRs located outside ORFs were generally more difficult to assess as to their likely involvement in phase variable modulation of gene expression. SSRs have previously been shown to be mediators of phase variation through modulation of promoter activity and gene transcription (Dawid et al., 1999; Martin et al., 2005; van Ham et al., 1993), but promoter regions of individual genes often cannot be accurately defined. Thus, the influence of variation in SSRs located in non-coding regions on expression of adjacent genes is difficult to predict. The full assessment of the 223 SSRs identified within the four genomes can be found in Supplementary Table 1; a summary of the data is provided in Table 4.
The manual classification of the SSRs into the three categories indicated that despite the considerable variation seen between strains for many of the SSRs (especially the mononucleotide SSRs), the tetranucleotide, pentanucleotide and heptanucleotide repeat tracts were the only types of SSR considered to have a potential role in mediating phase variable gene expression in these four strains of Hi (Table 4). The potentially phase variable ORFs associated with each of these types of repeat are detailed below.
A previous analysis of the Rd KW20 genome sequence identified the primacy of tetranucleotide SSR in mediating phase variation in Hi (Hood et al., 1996b). This study extends that work by comparing the tetranucleotide SSRs across four Hi genomes. We identified 18 different tetranucleotide SSR loci that are distributed fairly uniformly between genomes with each genome containing from 12 to 14 tetranucleotide SSR loci (Table 5). Eight of the tetranucleotide SSR loci were found in all four of the strains, two of the loci were found in three of the strains, three of the loci in two of the strains and five of the loci were unique to one strain (two unique loci in each of the strains Rd KW20 and 86-028NP and one unique locus in R2846).
Two of the tetranucleotide SSR loci that we have identified in the four genome analysis have not previously been described as potential mediators of phase variation in Hi. The first of these novel loci contains 14 tandem 5′AGTC repeats and is unique to strain R2846 (starting at nucleotide 1505819; see Table 5). This SSR is found immediately downstream of the presumptive start codon of an ORF encoding a 294 aa protein with homology to the glycosyltransferase 2 family of proteins (PFAM PF000535). In Hi, phase variable glycosyltransferases are frequently involved in LPS biosynthesis (Hood et al., 1996a). In strain 86-028NP the same glycosyltransferase is replaced with a different gene (NTHi_1053) that has high homology (e value of 1 × 10−141, BLASTN) to the phosphoethanolamine transferase gene, lpt3, of Neisseria meningitidis (Mackinnon et al., 2002). This is the first report of a gene with significant homology to lpt3 in Hi. Both NTHi_1053 and the gene encoding the putative glycosyltransferase have an atypically low G + C content (<30%), suggesting that they have been acquired by horizontal transfer. The finding that multiple, distinct gene insertions have occurred in the same region of the bacterial genome in different strains may indicate that this is a hotspot for recombination.
The second novel tetranucleotide SSR locus contains a 5′CCAA tract associated with a putative glycosyltransferase (gene NTHi_1769 in strain 86-028NP). This SSR is present in all four genomes examined with between 8 and 16 repeat units and constitutes the first example of a 5′CCAA tract that is associated with a gene other than iron utilisation genes in Hi (Jin et al., 1996; Morton and Stull, 1999).
A total of eight pentanucleotide SSR loci were identified across the four Hi strains investigated, six of these were located within ORFs (see supplementary Table 1). The length of the pentanucleotide SSRs ranged from three to twelve units but the majority were of the minimum threshold value of three units. Two of the pentanucleotide SSRs located within ORFs are of particular interest. The first of these pentanucleotide SSRs is associated with the type I modification enzyme, HsdM (the ORF in the Rd KW20 genome (HI1287) is truncated due to the repeat), and has previously been implicated in the phase variable expression of this type I restriction-modification gene (Zaleski et al., 2005). The SSRs identified in the four genome study are one, two or four units in length. van Belkum et al. (1997) and van Belkum (1999) described length variation in the region of this pentanucleotide repeat in a survey of 20 Hi strains. Zaleski et al. (2005) estimated the phase variation rates of the (5′GACGA)4 (4 tandem repeats of the sequence 5′GACGA3′) pentanucleotide repeat at this locus from observations on the degree of bacterial lysis induced by exposure to phage HP1c1. The rates they recorded for a change from four to three pentanucleotide repeats in strain RM118 were high and equivalent to those previously measured for much longer tetranucleotide repeat tracts (De Bolle et al., 2000) in the same strain.
The second coding pentanucleotide SSR of interest (5′TCAGC) was found in a gene of the hmg locus that encodes a high molecular weight glycoform of the LPS (Hood et al., 2004). The two repeat unit pentanucleotide SSRs present in Rd KW20 and R2846 (within the ORFs HI0867 and Hflu103000281, respectively) are consistent with the expression of a putative LPS flippase, whilst the three unit SSR in R2866 is inconsistent with expression of this gene. It is noteworthy that these two potential phase variation-mediating pentanucleotide SSRs relate to gene functions (restriction-modification and LPS modification) whose expression has previously been reported to be phase varied by tetranucleotide SSRs.
Four heptanucleotide SSRs were found in the survey of the four Hi genomes, three of which we have designated as potential mediators of phase variation. Two of these heptanucleotide SSRs are located approximately 100 bp upstream of the hmw1a and hmw2a genes and have previously been described by Dawid et al. (1999). They reported that these SSRs are within the promoters of the hmw1a and hmw2a genes and that alteration of the number of repeat units present in these SSRs results in a modulation of gene expression. The exact mechanism by which these SSRs influence transcription from these genes remains to be determined but may involve modulation of transcription from two alternative start sites (Dawid et al., 1999). Strain R2846 has (5′TGAAAGA)17 and (5′TGAAAGA)16 for hmw1a and hmw2a, respectively, and strain 86-028NP has (5′TGAAAGA)17 and (5′TGAAAGA)23 units for hmw1a and hmw2a, respectively, but there are no equivalent loci or repeat tracts in the other two genomes.
The third heptanucleotide SSR with a potential to mediate phase variation is the (5′AACAACC)1-7 tract situated only 13 bp upstream of a gene encoding a member of the TonB-dependent receptor family (PF0593) that has similarity to Fe transport proteins. One unit of the repeat is found in the genomes of strains Rd KW20 and R2846, seven in R2866 and six in 86-028NP. Rd KW20 and R2866 appear to have full length ORFs but the 86-028NP ORF is disrupted by a frameshift unrelated to the SSR. The observed variation in the length of this SSR, together with its position so close to the start of the downstream ORF, led us to postulate that it may mediate phase variation in Hi.
Analysis of mononucleotide, dinucleotide, trinucleotide and hexanucleotide SSRs in the four genome study did not provide any evidence to suggest to us that these classes of repeat were associated with phase variable gene expression as detailed below.
Mononucleotide SSRs have previously been documented as important mediators of phase variation in species such as Neisseria meningitidis (Schoen et al., 2007), Bordetella pertussis (Gogol et al., 2007) and Campylobacter jejuni (Hofreuter et al., 2006; Pearson et al., 2007). Perhaps surprisingly, they have not been implicated in phase variation in Hi, although partial sequencing of the iga gene from some Hi biogroup aegyptius strains led the investigators to suggest that a G10 tract found in only one strain may have mediated phase variable expression of the gene (Kilian et al., 2002).
Our analysis of the mononucleotide SSR loci present in the four Hi genomes revealed a considerable degree of heterogeneity in this class of SSR between these strains. 64 homopolymeric tracts were identified across the four genomes and Supplementary Table 1 summarises their characteristics. 28/64 (44%) of the mononucleotide SSRs were found within ORFs and although variations were frequently observed between strains they were not consistent with mediating phase variation (see Supplementary Table 1). The findings from the genome of strain Rd KW20 were representative of the distribution of mononucleotide SSRs found in the three other strains. All of the mononucleotide SSRs in this strain were A or T tracts (18/18) and most were the minimum threshold length of 9 units in length (16/18). Comparison across the four strains revealed that the variation observed in the equivalent mononucleotide SSRs of 8–10 units usually occurred by the substitution of one of the bases within the homopolymeric tract with a different base (e.g. an (A)9 tract was found as (A)7CA in some strains). All substitutions interrupting the A or T homopolymeric SSRs were found to be G or C nucleotides, suggesting an uneven pattern of mutation.
Examination of the further three genomes identified some anomalous mononucleotide SSRs. The first is an exceptionally long (A)34 tract identified in strain R2866. This SSR was located 120 bp upstream of the start of the ORF encoding the autotransporter adhesin Hia, which is an autotransporter protein containing the YadA domain and is believed to bind vitronectin and aid survival in human serum (Cotter et al., 2005a; Hallström et al., 2006; Meng et al., 2006). This SSR is not obviously associated with a promoter region and its function, if any, remains unclear. The second and third are a (G)12 and a (C)11 repeat tract both found in the genome of strain 86-028NP, and which are noteworthy because mononucleotide SSRs of G or C residues are uncommon in Hi, reflecting the low G + C content of this organism (38%). The (G)12 SSR was within the 5′ end of ORF ntHI0694. This gene shows homology with genes encoding methyltransferases of the FkbM family, some of which are involved in the biosynthesis of methylated sugars in Rhizobium etli LPS (Duelli et al., 2001). This gene has not been identified in other Hi strains and suggests that 86-028NP LPS may be O-methylated. The (C)11 SSR was located 230 bp upstream of the acpP gene (ntHI0243). Members of the AcpP family are short proteins which are involved in the transfer of acyl groups and are considered house keeping proteins. In the three other genomes the tract at the same location contains five C residues.
Phase variation mediated by dinucleotide repeats has been documented previously in Hi. A (TA)9-11 tract, located in the promoter region of two divergently transcribed genes, hifA and hifB was shown to control fimbriae biogenesis in some strains (van Ham et al., 1993). The hif locus is present in only 20% of NTHi strains and, of the four strains analysed here, only R2866 contains the hifA and hifB genes. In this strain, however, the 5′TA tract was present as a 5′(TA)4ATTA sequence. The threshold value set for dinucleotide SSRs in this study was five, therefore this tract was not identified as an SSR; further discussion of this locus is found later in this paper.
Eight dinucleotide SSR loci were identified in this four genome analysis, all of which were found to be of the threshold value of five repeat units in length. Five were located within coding regions. In a similar fashion to the variation observed for many of the mononucleotide SSRs, seven of the dinucleotide SSR loci were found to have sequence variations that did not alter the overall length of the sequence between strains and so would not cause frameshifts consistent with phase variation. For example, a (CA)5 repeat conserved in the genomes of strains Rd KW20, R2846 and 86-028NP was found to be replaced with CACG(CA)3 in strain R2866.
Eighteen trinucleotide SSRs were identified in this study, the majority of which, (15/18), were located within coding regions. All of these 15 SSRs consisted of no more than four repeat units and where variation in the repeats was observed between strains, it either resulted in a reduction in length of the SSR or disruption of the sequence whilst maintaining the same length. The three trinucleotide SSRs that were found in non-coding regions showed greater variation in overall length but were not within identified promoter or other regulatory regions.
Ten hexanucleotide SSRs were identified within the four genomes, eight within ORFs and two in non-coding regions. Variation in coding hexanucleotide repeats can lead to altered amino acid sequence but not phase variable gene expression. The coding region hexanucleotide SSRs identified in this study were either conserved or, like the mononucleotide and dinucleotide variations discussed above, showed changes in sequence but not length and thus were inconsistent with modulating phase variable expression. Of the two non-coding region associated hexanucleotide repeats, one is conserved across all four strains and is present downstream of the closest ORFs, whilst the other 5′TTAAAA SSR is present as three repeat units in Rd KW20, two units in 86-028NP and as two units plus an interrupted third repeat unit in R2866 and R2846. This SSR is situated 19 bp from the start codon of HI0525 in strain Rd KW20, which encodes a phosphoglycerate kinase involved in central metabolism and the influence of this SSR on the expression of this ORF is unknown.
Of the limited number of SSRs with repeat unit lengths greater than seven nucleotides that were identified in the four genomes study, most were found in only one strain. These include a nonanucleotide SSR found in the genome of strain 86-028NP. This (5′GTTTTCTTA)19 SSR was found to be located 92 bp upstream of the hmw2C gene. As discussed above, variations in heptanucleotide SSR associated with the hmw2A loci are thought to modulate gene expression but the function of this nonanucleotide SSR is not known. An octanucleotide (5′ATTATTTG) SSR however, was found in multiple strains, varying in length between 1 and 6 repeat units. It was found to be located between the divergently transcribed cmkB and pdxS genes which encode a cytidylate kinase 2 and a pyridoxal biosynthesis lyase, respectively, (designated HI1646 and HI1647 in strain Rd KW20). They are both suggested to play roles in metabolism and so it is uncertain whether this SSR would actually be utilised in modulating their expression.
Whilst the SSR analysis of the four Hi genomes was ongoing, 12 further Hi genomes were sequenced and the resulting full or partial sequences made publicly available (listed in Table 1). These 12 additional genome sequences offered us the chance to confirm and extend our detailed SSR analysis of the four Hi genomes.
Using the same SSR search methods and threshold values described for the four genome study, 765 SSRs were identified in these 12 additional genomes (summarised in Table 6). From these data it was seen that mononucleotide SSRs are found in 10 out of 12 of the additional genomes at a higher frequency than was observed in the four genome study. However, it should be noted that the 454 sequencing technology used to generate the majority of the further genome sequences has a decreased fidelity for mononucleotide tracts which may account, to some extent, for the higher number of mononucleotides SSR detected in these strains. However, the F3031 and F3043 genomes, for which the highest number of mononucleotide SSRs were identified, were sequenced using ABI Sanger dideoxy sequencing technology.
In a high proportion of the additional genomes, tetranucleotide and hexanucleotide SSR were also observed more frequently than in the four genome study.
The nine NTHi genomes, sequenced by the Center for Genomic Sciences (Hogg et al., 2007) (see Table 1), and the genome of strain 10810, contained a similar number of tetranucleotide SSRs to that previously observed in the four genome study (12-14 per genome) and only two novel tetranucleotide SSR loci were identified. Conversely, in the genomes of strains F3031 and F3043, 18 and 21 tetranucleotide SSR loci were identified, respectively, and eight of these tetranucleotide SSRs were not identified in any of the previously analysed genomes (see Table 5).
A total of ten novel tetranucleotide SSR loci were identified in the additional twelve genomes. One locus, licA2 is a duplication of the licA locus reported in the four genome study (Fox et al., 2008) Five of the novel tetranucleotide SSR were associated with genes encoding members of the trimeric autotransporter protein family which commonly contain a C-terminal YadA domain (PFAM03895) (Cotter et al., 2005b; Koretke et al., 2006). All five of these paralogous loci were present in the two Hi biogroup aegyptius strains F3031 and F3043 and one of the loci was also present in the genome of the NTHi strain, PittHH. Previously described members of this family of proteins from Hi include the adhesins Hsf and Hia which have been implicated in virulence (Cotter et al., 2005a; Surana et al., 2004). It can be envisaged that the expression of adhesins may not be advantageous in all growth conditions as they are possible targets for the host immune system and are large proteins (up to 1016 aa) whose expression would require considerable resources. Indeed, the NadA protein from N. meningitidis which is a member of this family of proteins, has previously been shown to be phase variably expressed (Capecchi et al., 2005; Martin et al., 2005). Hi biogroup aegyptius strains have been associated with atypical invasive disease and it is, therefore, tempting to speculate that the high number of putative phase variable adhesins identified in strains F3031 and F3043 may somehow contribute to the unusual clinical outcomes associated with these strains.
An additional four novel tetranucleotide SSRs were identified from strains F3031 and/or F3043. The first, a (5′ATTA)9 SSR is found 225 bp upstream of a gene encoding a putative DNA repair enzyme, formamidopyrimidine-DNA glycosylase MutM, in strain F3043. The equivalent position in other Hi strains contains 3 copies of the 5′ATTA repeat unit. The role of this repeat in expression of MutM is unknown but variations in the expression of mutM could potentially result in altered mutation rates in Hi (Horst et al., 1999).
The second novel tetranucleotide SSR identified from strains F3031 and F3043 is a 5′CAAT SSR contained within the 5′ region of an ORF that encodes a putative glycosyltransferase with homologies to glycosyltransferase family 8 (PFAM01501). Homologues of this gene are found in other strains of Hi (including HI0223 in strain Rd KW20) but without the associated SSR. The function of this gene is unknown but it may contribute to LPS expression in strains F3031 and F3043.
The third of the four additional novel tetranucleotide SSR loci contains (5′CAAT)21 and was found only in strain F3043. It is located within the 5′ end of an ORF that encodes a putative S-adenosylmethionine (SAM)-dependent methyltransferase and shows some homology to HI0096 in strain Rd KW20. SAM-dependent methyltransferases have been implicated in various cellular processes including protein trafficking and sorting, signal transduction, biosynthesis, metabolism, and gene expression.
The final novel tetranucleotide SSR identified in strain F3031 is a (5′CAAG)32 SSR located 58 bp upstream of a gene encoding an adenine specific methylase homologue (EcoRI) and 202 bp upstream of the divergently transcribed htpX (which encodes a putative protease protein, induced by heat shock in E. coli). HtpX has not been investigated in Hi but in E. coli it is part of the membrane-localised proteolytic system and may play a part in the degradation of unstable membrane proteins (Sakoh et al., 2005).
In total, 199 tetranucleotide SSRs associated with 28 different loci and consisting of nine different repeat unit sequences have been identified in the complete genome collection. The distribution of tetranucleotide SSR length, and the relationship between length of tetranucleotide SSR and strain are shown in Fig. 1. The length of an individual tetranucleotide SSR does not appear to be dependent on strain background, repeat unit sequence or locus (Fig. 1B), and a wide degree of variation and considerable overlap between groupings is observed. Fig. 1A shows that despite differences in the source, date of isolation and associated clinical symptoms of the different strains there is an approximately normal distribution of tetranucleotide SSR lengths. Fig. 1B shows that the two Hi biogroup aegyptius strains F3031 and F3043, which are associated with unusual clinical symptoms and have the highest number of tetranucleotide SSR, display a similar distribution of SSR lengths to all other strains.
As noted previously, the tetranucleotide SSRs identified in the four genome study of Hi are located within ORFs and, with only two exceptions, are located immediately adjacent to or just downstream of the translational start site. In this position, any frameshift due to variation in length of the SSR, would result in a peptide being made from the incorrect reading frame and a premature stop to translation. The location of tetranucleotide SSRs within the 5′ region of the ORFs limits the encoded tetrapeptide repeat to the N-terminus of the respective protein. The two exceptions to this pattern are the 5′GCAA tetranucleotide SSR located in the middle of the oafA gene that has been previously described (Fox et al., 2005), and a 5′GACA tetranucleotide SSR located in the 3′ region of a gene encoding a putative glycosyltransferase (pgt1) in the genomes of strains R2846, 86-028NP, 3655 and PittEE. These repeats may modulate the protein function rather than control ON/OFF switching of its expression.
Tetranucleotide SSRs may constitute a substantial proportion of the coding region of a gene and thus the repeat unit sequence will have a significant influence on the amino acid composition of the encoded protein. The constraints that this may impose, in terms of permissible tetranucleotide sequences, have not been well characterised although High et al. (1996) suggest that the peptides encoded by the repeat regions form structurally flexible regions that loop out of the protein structure and therefore do not interfere with tertiary structure.
An in silico analysis of the repeat sequences identified in the 16 Hi genomes analysed was performed and hydrophilic amino acids are over represented in the SSR encoded peptides, compared with their frequency in the normal proteome. Of the eight tetranucleotide repeat sequences found within ORFs in Hi, five encode hydrophilic peptides with no net charge (5′CAAT, 5′GACA, 5′CCAA, 5′AGCC, and 5′AGTC), one encodes a hydrophobic peptide with no net charge (5′TTTA) and two encode hydrophilic peptides with a net positive charge (5′GCAA and 5′CAAA). The high proportion of hydrophilic peptides encoded by the tetranucleotide SSRs and their frequent N-terminal location suggests that they are likely to be surface exposed and have the opportunity to ‘loop out’ of the folded protein structure and thus be less likely to interfere with the tertiary structure of the protein and, therefore, its function. The exception is a 5′TTTA tetranucleotide SSR which encodes a hydrophobic peptide within a putative drug/metabolite exporter (HI0687 in strain Rd KW20). Transmembrane helices predictions (TMHMM server v2.0, Sonnhammer et al., 1998) suggest that the portion of this protein encoded by the SSR lies entirely within a transmembrane domain. Examination of homologues of the HI0687 gene indicates that the hydrophobic nature of such transmembrane helices is well conserved but often the primary sequence is not (data not shown). Another observation of this study was that although previously SSRs of a particular tetranucleotide repeat unit sequence have been associated with genes of related function, e.g. 5′CCAA tracts with genes encoding iron utilisation proteins (Jin et al., 1996; Morton and Stull, 1999), in this study, we have found no evidence of a particular tetranucleotide repeat unit sequence being restricted to a particular class of gene.
One feature of certain tetranucleotide SSR, noted during the course of this study was their interruption by an imperfect repeat unit. All of the genomes in this study were found to contain between one and four related, hemoglobin/hemoglobin–haptoglobin-binding (hgp) genes containing 5′CCAA SSR that show considerable variation in length. Hi lacks most of the genes of the heme biosynthetic pathway and requires hemoglobin/hemoglobin–haptoglobin-binding proteins to capture heme-containing compounds required for growth (Morton and Stull, 1999). Seven interrupted tetranucleotide SSRs were observed in total in this analysis of which six were found to be associated with hgp genes. We postulate that homologous recombination occurring between these paralogous loci may occasionally generate imperfect repeats and it will be of interest to ascertain whether similar events occur between other duplicated loci, e.g. paralogous adhesin genes (discussed below), partial or fully duplicated hifA loci and duplicated lic1A genes (strain PittGG).
In contrast to the four genome study, the analysis of the 12 additional Hi genomes has identified a number of mononucleotide SSRs with the potential to mediate phase variable gene expression. These mononucleotide SSRs were located within the 5′ coding regions of ORFs, associated with frameshift mutations, or located within potential promoter regions. The potential phase variable genes include those encoding virulence-related factors such as glycosyltransferases, type-I restriction modification systems, haemagglutinins, YadA domain containing proteins, pilin genes and a Fe–S cluster assembly scaffold protein (see Table 7). This study offers the first indication that mononucleotide SSRs may mediate phase variation in Hi.
Further support to the role of the mononucleotide SSRs in mediating phase variation in Hi is that some genes identified in the 12 genome study have previously been determined to be phase variable but mediated by other classes of SSRs. An example is the divergently transcribed pilin genes, hifA and hifB, which Geluk et al. (1998) demonstrated to be phase variable due to variation in the length of a 5′TA SSR located between them and 104–225 bp upstream of the hifA gene. Changes in the length of the dinucleotide SSR were proposed to alter the spacing between the −10 and −35 promoter sequences and therefore alter expression of the genes. In strain F3031, there are four hifA loci in total. Two of the loci have an arrangement similar to that described by Geluk et al. (1998) with the 5′TA SSR located between the divergently transcribed hifA and hifB genes whilst the other two hifA loci have mononucleotide (A17 or A12) instead of dinucleotide SSRs located either 63 or 93 bp upstream of hifA (see Table 7). There is no hifB gene associated with these latter loci. Phase variation of pilin expression mediated by mononucleotide SSRs has not been previously reported in Hi. The exact location and extent of the promoter region of hifA has not been mapped in these strains, but the position of the mononucleotide SSRs makes them a candidate to mediate phase variation.
In the four genome study, homopolymeric A or T tracts of less than 11 bp were found with only one exception, an A34 tract found in the R2866 genome. In the additional twelve genomes a similar tract of between 20 and 49 bp was found in four strains in the same genomic location; approximately 120 bp upstream of the nearest ORF which encodes a protein with homology to YadA-domain containing proteins such as Hsf. The function of this SSR is unknown but it is tempting to speculate that it may play a role in regulating the expression of the downstream Hsf-like encoded protein. Similarly, in the genome of strain F3031 the expression of a number of YadA domain containing proteins was suggested to be mediated by tetranucleotide SSRs (see Table 5). However in one instance, the expression of a YadA domain containing protein in this strain is potentially mediated by a G13 SSR (located at base 548152) located 10 bp within the ORF (see Table 7). The association of mononucleotide SSRs, in certain strains, with paralogs of genes which are phase variable by other SSRs offers strong circumstantial evidence that these mononucleotide SSRs may mediate phase variation in Hi.
The heptanucleotide SSRs associated with the hmw1a and hmw2a genes in the four genome study were also identified in the genomes of strains PittEE and R3655 in the 12 further Hi genomes analysed. In PittEE, 13 copies of the heptanucleotide repeat are present 69 bp upstream and 38 copies 104 bp upstream of the hmw1a and hmw2a genes, respectively, and in R3655, 16 copies of the repeat are present 106 bp upstream of hmw1a. However, in the further genome study, an additional novel heptanucleotide SSR associated with hmw1a was identified in the genome of strain PittAA. Interestingly, this SSR, consisting of (5′AATTTTG)14, was 3.5 kb within the 7.3 kb putative full length ORF rather than in the promoter region, and a frame shift had occurred which is consistent with this being caused by variation in the length of the SSR. In the further genome analysis, an octanucleotide SSR was found associated with hmw loci. This SSR contained twelve to fifteen copies of a 5′GCATCATC repeat and was identified 200–213 nucleotides upstream of the hmw1a and hmw2a loci of strain F3043 and F3031.
A further novel heptanucleotide SSR with the potential to mediate phase variation was identified in strains 22.4.21 and R3655, within an ORF which is a homologue of the HI1369 gene (encoding a putative TonB dependent iron ligand gated channel). Thirteen units of the 5′AACAACC repeat are found in 22.4.21, and eight repeat units in strain R3655 which results in a truncated ORF due to a frameshift.
An octanucleotide SSR identified in the four genome study as containing one, four or six copies of a 5′ATTATTTG unit 12 bp upstream of a gene encoding a pyridoxine biosynthesis protein, was also identified in seven strains of the further twelve genome collection (four copies of the SSR in strains PittHH, PittAA and 22.1.21 and six copies in strains R3655, PittII, F3043 and Hib; see Table 7). However, the limited range of variation and relatively short length of this SSR are not what would be expected at a classically phase variable locus and so the significance of this SSR at this location remains uncertain.
The complete genome sequence of Hi, strain Rd KW20 (Fleischmann et al., 1995), provided for the first time the means to analyse the gene content, organisation and sequence structure of a free-living organism. One of the major findings in Hi strain Rd KW20 was the association of SSRs, especially tetranucleotide SSRs, with genes involved in host adaptation, commensalism and virulence (Hood et al., 1996b). SSRs are hypermutable and mediate a high frequency of reversible increases or decreases in the number of repeat units resulting in phase variable expression of the associated genes (Moxon et al., 2006).
As sequencing techniques have progressed, the ease with which sequencing data can be gathered has increased. As a result, the sequences of multiple strains of a single species have become available for comparison and the extent of genomic variation between strains has become evident. In this study, our aim was to extend our understanding of the role of SSRs in the biology and pathogenicity of Hi by an analysis of four complete genome sequences and a survey of available sequence data for a further twelve strains.
SSRs, consisting of repeat units of between one and nine nucleotides in length, were characterised. For this analysis to be practical, it was necessary to establish threshold values, above which tandem repeat units were designated as SSRs. Data pertaining to the genomic location, position relative to the nearest ORF and the types of polymorphism observed by comparison between genomes was compiled for each of 223 SSRs in the initial survey of the four complete Hi genome sequences from strains Rd KW20, R2846, R2866 and 86-028NP. These SSRs were broadly classified into three categories; invariant, variant and potentially phase variable. Invariant SSRs showed no variation in sequence, position or length between strains whilst variant SSRs showed some variation between strains but not of a type that would mediate phase variation, i.e. they usually showed some variation in sequence but not overall length. Potentially phase variable SSRs showed variation in the number of repeat units constituting the SSR between strains and were in positions consistent with mediating phase variation either within ORFs or promoter regions. The majority of SSRs examined fell into the first two classes. From the further 12 partial and complete genome sequences, 765 additional SSRs were identified.
These studies have confirmed that tetranucleotides are the predominant class of SSR to mediate phenotypic variation via phase variation in Hi. A total of 199 tetranucleotide SSRs were found distributed across the 16 strains, associated with 28 different loci (see Table 5). Of these, 10 were novel tetranucleotides, eight of which were identified in the genome sequences of only two strains, the Hi biogroup aegyptius strains F3031 and F3043. Tetranucleotide SSRs were found associated with a number of paralogous adhesin genes in these strains and, intriguingly, with a mutM locus that could potentially modulate mutation rates due to oxidative damage (Horst et al., 1999).
The Hi biogroup aegyptius strains, F3043 and F3031 isolated in Brazil, were associated with conjunctivitis and BPF, respectively. A relevant question is whether the increased number of tetranucleotide SSRs in these strains may contribute to their unusual virulence phenotype. A detailed analysis of the characteristics of the tetranucleotide SSRs across all strains showed that whilst the number of tetranucleotide SSRs was higher in the biogroup aegyptius strains (Table 6), the length or sequence of the SSRs was similar between all the strains (Fig. 1 and unpublished data). Indeed, no relationship was found between the sequence, length, genomic locus or protein function of tetranucleotide SSRs. Other tetranucleotide SSR loci identified included those encoding two glycosyltransferases, one of which contains a 5′CCAA repeat, the first occasion for Hi where this particular SSR unit sequence has been associated with genes encoding proteins of any function other than hemoglobin and hemoglobin–haptoglobin binding.
Although tetranucleotide SSRs are the most frequent mediators of phase variation in Hi, other SSRs may play a role in mediating phase variation, particularly in strains such as F3031 and F3043. This study has identified a number of novel mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs as potential mediators of phase variation. Mononucleotide SSRs have not previously been described as frequent mediators of phase variation in Hi, in contrast to other bacterial species such as N. meningitidis. There is only one report in the literature of mononucleotide SSRs potentially mediating phase variation in Hi; a G10 SSR is suspected to mediate phase variation of the iga gene, AF522258, in Hi biogroup aegyptius strain HK266 (Kilian et al., 2002). However, the distribution of the mononucleotide SSR loci identified in this study suggests that there may be some strain-dependent differences in the use of mononucleotide SSR to mediate phase variation. The potential mononucleotide SSR-mediated phase variable genes identified include those encoding factors associated with virulence such as glycosyltransferases, type-I restriction modification systems, haemagglutinins, YadA domain containing proteins (Cotter et al., 2005b; Koretke et al., 2006), pilin and a Fe–S cluster assembly scaffold protein (see Table 7). A number of the genes where phase variation is potentially mediated by homopolymeric tracts are phase variable by other mechanisms in other strains. For example, hifA and hifB expression is usually mediated by a dinucleotide SSR (van Ham et al., 1993). Similarly, the expression of YadA-domain containing proteins is potentially mediated by tetranucleotide SSRs in some loci identified in this study and by mononucleotide SSRs in other loci whilst the expression of the hmw1A and hmw2A genes is potentially mediated by upstream heptanucleotide, octanucleotide or nonanucleotide SSRs in different strains.
Differences in the classes of SSRs which mediate phase variation between species, or even different strains of one species, may be determined by inter species/inter strain differences in DNA metabolism as the efficiency with which different types of slippage intermediates are recognised and repaired is reliant upon the complement of DNA repair mechanisms in the given strain/species. Investigation of the molecular basis of these differences will be aided by the availability of full genome sequences in conjunction with experimental assays.
The strains examined in this study were isolated in the United Kingdom (one strain), Brazil (two strains) and the United States of America (13 strains). A majority of the novel SSRs identified were in the F3031 and F3043 genome sequences (the Brazilian strains) and it remains unknown whether the population/geographical structure of Hi strains may be a significant factor in determining the complement of SSR within a strain: until the population structure of Hi is better understood it is difficult to predict the size of the SSR pan-genome and its potential role in mediating phase variation. In the strains studied, with the exception of F3031 and F3043, there were no associations between the ability to cause disease or commensal infection in the strains and the complement of potential phase variable mediating SSRs. For each strain, the contribution of the number and complement of phase variable genes to the probability of pathogenic potential remains unknown.
In conclusion, this study has reaffirmed the primacy of tetranucleotide SSRs as mediators of phase variation in Hi and has characterised and compared 28 tetranucleotide SSR loci (9 of them previously unreported) across 16 strains. Additionally, this study has identified a number of previously unrecognised mononucleotide, dinucleotide, pentanucleotide, heptanucleotide, and octanucleotide SSRs as potential mediators of phase variation that will be the focus of future research efforts. Thus, the utility of whole genome sequences in the investigation of the biology of pathogenic bacteria has been confirmed and, further, the analysis of multiple genomes has revealed non-intuitive subtleties in the population structure concerning the distribution of SSRs across the Hi pan-genome.
PMP was supported by a Beit Memorial Medical Research Fellowship. ERM, DWH, WAS, GAK, MJW, NJG were funded by grants awarded by the MRC and Wellcome Trust. The strain F3031, F3043 and 10810 genome sequence data were produced by the Pathogen Sequencing Group at the Sanger Institute and can be obtained from ftp://ftp.sanger.ac.uk/pub/pathogens/hib/.