3.1. Determination of threshold values used to identify SSRs in this study
Previous studies on Hi
have described the SSRs present within the genome of strain Rd KW20 (Hood et al., 1996b
). Our aim was to extend the analysis of SSRs by comprehensively investigating the repertoire present in the four complete Hi
genome sequences that were available for different strains of Hi
at the commencement of this study (Four Genome Analysis; see ). In this study, SSRs are defined as tandem repeats of a repeat unit that consists of between one and nine nucleotides. In order to attain maximum sensitivity for the detection of SSRs the threshold values (see Section 2
) were set as low as practically possible. Our rationale for adopting threshold values is described in , which shows the number of SSRs identified in the genome of strain Rd KW20 at the threshold values adopted and also the number of SSRs that would have been included in the subsequent analysis if the threshold value had been set one unit higher or lower for each repeat unit length. It can be seen from that, for all but the tetranucleotide SSRs, increasing the threshold value by one substantially decreased the number of SSRs detected and resulted in a total of only 19 SSRs being identified in the Rd KW20 genome. In contrast, decreasing the threshold by one resulted in a large number of SSRs being identified which would be impractical for manual analysis (3924 in strain Rd KW20). At the adopted threshold values, 60 SSRs were identified in strain Rd KW20. The thresholds used in this study for all repeat unit lengths included at least all of the statistically unexpected SSRs determined by hidden Markov model analysis of the Hi
genomes (see ; Paul Swift, Oxford, personal communication). This further substantiates the threshold values chosen as being permissive for having a high degree of sensitivity in identifying SSRs with potential roles in mediating phase variation. Additionally, if SSRs were found to be above threshold in at least one genome the corresponding regions in the other genomes were also characterised.
A total of 223 SSRs were identified in the four genome sequences when the threshold values described in were applied and these 223 SSRs are summarised in . Comparison of the SSRs across the four Hi genomes for each of the repeat unit lengths, reveals that their numbers are not substantially different between the strains. Also, the total number of repeats found within any one strain is not substantially different from the others (total number of SSRs ranged between 53 and 60), despite the differences in the origin, associated disease and date of isolation of the four strains (see ).
The frequency and location of SSRs in the genome sequences of four H. influenzae strains.
SSRs have previously been associated with hypermutation, as loss or gain of repeat units occur at high frequency due to replication slippage (Moxon et al., 2006
). Loss or gain of repeat units from an SSR located within an ORF may result in a frameshift mutation if the length of the repeat unit is not a multiple of three. The position of each of the 223 SSRs identified in this study was manually curated and the proportion of SSRs that were located within ORFs was recorded (see ). Hi
has a coding density of approximately 88% of the genome sequence and, of the repeats examined, only the trinucleotide (83%), tetranucleotide (100%) and hexanucleotide (87%) SSRs occur within ORFs at approximately this frequency, whilst SSRs with repeat unit lengths of one, two, five and seven nucleotides were all found to be located within ORFs with a frequency of less than 88%. SSRs with longer repeat unit lengths were not included in this analysis due to their low frequency in the genomes.
This suggests that the selective pressure against trinucleotide and hexanucleotide SSRs occurring within an ORF may not be as high as that against SSRs of other repeat unit lengths whose expansion or contraction would result in inactivation of an ORF by frameshift mutation. It is noteworthy that tetranucleotide SSRs are found exclusively within ORFs, consistent with the known importance of this class of repeat in mediating phase variable expression at contingency loci in Hi.
3.2. Identification of repeat unit lengths likely to be associated with phase variable gene expression
Manual curation of each of the 223 SSRs allowed us to assess the likelihood of each SSR playing a role in modulating gene expression. Comparison of equivalent SSR loci (those located in the same relative genomic location) allowed the classification of each SSR into one of three categories: (1) SSRs that did not vary in length, sequence or position between the four genomes (invariable), (2) SSRs for which some variation between strains was observed but the variation was not considered likely to result in variation of gene expression (variable) and (3) SSRs that both varied in length and were located in regions consistent with mediating phase variation (potentially phase variable; see ). Careful manual examination of each of the repeat associated loci was necessary to classify the SSRs into the above categories. Factors such as the location of the SSR within a gene, length of the SSR and replacement of a whole or partial tract of an SSR by another sequence contributed to the assessment of whether or not any observed variation in the SSR was likely to mediate phase variation. SSRs located outside ORFs were generally more difficult to assess as to their likely involvement in phase variable modulation of gene expression. SSRs have previously been shown to be mediators of phase variation through modulation of promoter activity and gene transcription (Dawid et al., 1999; Martin et al., 2005; van Ham et al., 1993
), but promoter regions of individual genes often cannot be accurately defined. Thus, the influence of variation in SSRs located in non-coding regions on expression of adjacent genes is difficult to predict. The full assessment of the 223 SSRs identified within the four genomes can be found in Supplementary Table 1
; a summary of the data is provided in .
Assessment of SSR variability and potential to mediate phase variation in the four genome study.
The manual classification of the SSRs into the three categories indicated that despite the considerable variation seen between strains for many of the SSRs (especially the mononucleotide SSRs), the tetranucleotide, pentanucleotide and heptanucleotide repeat tracts were the only types of SSR considered to have a potential role in mediating phase variable gene expression in these four strains of Hi (). The potentially phase variable ORFs associated with each of these types of repeat are detailed below.
3.3. Tetranucleotide SSRs identified in the Hi four genome study
A previous analysis of the Rd KW20 genome sequence identified the primacy of tetranucleotide SSR in mediating phase variation in Hi
(Hood et al., 1996b
). This study extends that work by comparing the tetranucleotide SSRs across four Hi
genomes. We identified 18 different tetranucleotide SSR loci that are distributed fairly uniformly between genomes with each genome containing from 12 to 14 tetranucleotide SSR loci (). Eight of the tetranucleotide SSR loci were found in all four of the strains, two of the loci were found in three of the strains, three of the loci in two of the strains and five of the loci were unique to one strain (two unique loci in each of the strains Rd KW20 and 86-028NP and one unique locus in R2846).
The sequence and number of repeat units that comprise each of the 199 tetranucleotide SSRs identified in 16 H. influenzae genomes.
Two of the tetranucleotide SSR loci that we have identified in the four genome analysis have not previously been described as potential mediators of phase variation in Hi
. The first of these novel loci contains 14 tandem 5′AGTC repeats and is unique to strain R2846 (starting at nucleotide 1505819; see ). This SSR is found immediately downstream of the presumptive start codon of an ORF encoding a 294 aa protein with homology to the glycosyltransferase 2 family of proteins (PFAM PF000535). In Hi
, phase variable glycosyltransferases are frequently involved in LPS biosynthesis (Hood et al., 1996a
). In strain 86-028NP the same glycosyltransferase is replaced with a different gene (NTHi_1053) that has high homology (e value of 1 × 10−141
, BLASTN) to the phosphoethanolamine transferase gene, lpt3,
of Neisseria meningitidis
(Mackinnon et al., 2002
). This is the first report of a gene with significant homology to lpt3
. Both NTHi
_1053 and the gene encoding the putative glycosyltransferase have an atypically low G + C content (<30%), suggesting that they have been acquired by horizontal transfer. The finding that multiple, distinct gene insertions have occurred in the same region of the bacterial genome in different strains may indicate that this is a hotspot for recombination.
The second novel tetranucleotide SSR locus contains a 5′CCAA tract associated with a putative glycosyltransferase (gene NTHi_1769 in strain 86-028NP). This SSR is present in all four genomes examined with between 8 and 16 repeat units and constitutes the first example of a 5′CCAA tract that is associated with a gene other than iron utilisation genes in Hi
(Jin et al., 1996; Morton and Stull, 1999
3.4. Pentanucleotide SSRs identified as potential mediators of phase variation in Hi
A total of eight pentanucleotide SSR loci were identified across the four Hi
strains investigated, six of these were located within ORFs (see supplementary Table 1
). The length of the pentanucleotide SSRs ranged from three to twelve units but the majority were of the minimum threshold value of three units. Two of the pentanucleotide SSRs located within ORFs are of particular interest. The first of these pentanucleotide SSRs is associated with the type I modification enzyme, HsdM (the ORF in the Rd KW20 genome (HI1287
) is truncated due to the repeat), and has previously been implicated in the phase variable expression of this type I restriction-modification gene (Zaleski et al., 2005
). The SSRs identified in the four genome study are one, two or four units in length. van Belkum et al. (1997)
and van Belkum (1999)
described length variation in the region of this pentanucleotide repeat in a survey of 20 Hi
strains. Zaleski et al. (2005)
estimated the phase variation rates of the (5′GACGA)4 (4 tandem repeats of the sequence 5′GACGA3′) pentanucleotide repeat at this locus from observations on the degree of bacterial lysis induced by exposure to phage HP1c1. The rates they recorded for a change from four to three pentanucleotide repeats in strain RM118 were high and equivalent to those previously measured for much longer tetranucleotide repeat tracts (De Bolle et al., 2000
) in the same strain.
The second coding pentanucleotide SSR of interest (5′TCAGC) was found in a gene of the hmg
locus that encodes a high molecular weight glycoform of the LPS (Hood et al., 2004
). The two repeat unit pentanucleotide SSRs present in Rd KW20 and R2846 (within the ORFs HI0867
, respectively) are consistent with the expression of a putative LPS flippase, whilst the three unit SSR in R2866 is inconsistent with expression of this gene. It is noteworthy that these two potential phase variation-mediating pentanucleotide SSRs relate to gene functions (restriction-modification and LPS modification) whose expression has previously been reported to be phase varied by tetranucleotide SSRs.
3.5. Heptanucleotide SSRs as mediators of phase variation in Hi
Four heptanucleotide SSRs were found in the survey of the four Hi
genomes, three of which we have designated as potential mediators of phase variation. Two of these heptanucleotide SSRs are located approximately 100 bp upstream of the hmw1a
genes and have previously been described by Dawid et al. (1999)
. They reported that these SSRs are within the promoters of the hmw1a
genes and that alteration of the number of repeat units present in these SSRs results in a modulation of gene expression. The exact mechanism by which these SSRs influence transcription from these genes remains to be determined but may involve modulation of transcription from two alternative start sites (Dawid et al., 1999
). Strain R2846 has (5′TGAAAGA)17 and (5′TGAAAGA)16 for hmw1a
, respectively, and strain 86-028NP has (5′TGAAAGA)17 and (5′TGAAAGA)23 units for hmw1a
, respectively, but there are no equivalent loci or repeat tracts in the other two genomes.
The third heptanucleotide SSR with a potential to mediate phase variation is the (5′AACAACC)1-7 tract situated only 13 bp upstream of a gene encoding a member of the TonB-dependent receptor family (PF0593) that has similarity to Fe transport proteins. One unit of the repeat is found in the genomes of strains Rd KW20 and R2846, seven in R2866 and six in 86-028NP. Rd KW20 and R2866 appear to have full length ORFs but the 86-028NP ORF is disrupted by a frameshift unrelated to the SSR. The observed variation in the length of this SSR, together with its position so close to the start of the downstream ORF, led us to postulate that it may mediate phase variation in Hi.
3.6. Other types of SSRs identified in the Hi four genomes study are not considered to mediate phase variable gene expression
Analysis of mononucleotide, dinucleotide, trinucleotide and hexanucleotide SSRs in the four genome study did not provide any evidence to suggest to us that these classes of repeat were associated with phase variable gene expression as detailed below.
3.7. Mononucleotide SSRs in the four Hi genome are predominantly short A or T tracts
Mononucleotide SSRs have previously been documented as important mediators of phase variation in species such as Neisseria meningitidis
(Schoen et al., 2007
), Bordetella pertussis
(Gogol et al., 2007
) and Campylobacter jejuni
(Hofreuter et al., 2006; Pearson et al., 2007
). Perhaps surprisingly, they have not been implicated in phase variation in Hi
, although partial sequencing of the iga
gene from some Hi
biogroup aegyptius strains led the investigators to suggest that a G10 tract found in only one strain may have mediated phase variable expression of the gene (Kilian et al., 2002
Our analysis of the mononucleotide SSR loci present in the four Hi
genomes revealed a considerable degree of heterogeneity in this class of SSR between these strains. 64 homopolymeric tracts were identified across the four genomes and Supplementary Table 1
summarises their characteristics. 28/64 (44%) of the mononucleotide SSRs were found within ORFs and although variations were frequently observed between strains they were not consistent with mediating phase variation (see Supplementary Table 1
). The findings from the genome of strain Rd KW20 were representative of the distribution of mononucleotide SSRs found in the three other strains. All of the mononucleotide SSRs in this strain were A or T tracts (18/18) and most were the minimum threshold length of 9 units in length (16/18). Comparison across the four strains revealed that the variation observed in the equivalent mononucleotide SSRs of 8–10 units usually occurred by the substitution of one of the bases within the homopolymeric tract with a different base (e.g. an (A)9 tract was found as (A)7CA in some strains). All substitutions interrupting the A or T homopolymeric SSRs were found to be G or C nucleotides, suggesting an uneven pattern of mutation.
Examination of the further three genomes identified some anomalous mononucleotide SSRs. The first is an exceptionally long (A)34 tract identified in strain R2866. This SSR was located 120 bp upstream of the start of the ORF encoding the autotransporter adhesin Hia, which is an autotransporter protein containing the YadA domain and is believed to bind vitronectin and aid survival in human serum (Cotter et al., 2005a; Hallström et al., 2006; Meng et al., 2006
). This SSR is not obviously associated with a promoter region and its function, if any, remains unclear. The second and third are a (G)12 and a (C)11 repeat tract both found in the genome of strain 86-028NP, and which are noteworthy because mononucleotide SSRs of G or C residues are uncommon in Hi
, reflecting the low G + C content of this organism (38%). The (G)12 SSR was within the 5′ end of ORF ntHI0694
. This gene shows homology with genes encoding methyltransferases of the FkbM family, some of which are involved in the biosynthesis of methylated sugars in Rhizobium etli
LPS (Duelli et al., 2001
). This gene has not been identified in other Hi
strains and suggests that 86-028NP LPS may be O-methylated. The (C)11 SSR was located 230 bp upstream of the acpP
). Members of the AcpP family are short proteins which are involved in the transfer of acyl groups and are considered house keeping proteins. In the three other genomes the tract at the same location contains five C residues.
3.8. Dinucleotide SSRs in the Hi four genome study
Phase variation mediated by dinucleotide repeats has been documented previously in Hi
. A (TA)9-11 tract, located in the promoter region of two divergently transcribed genes, hifA
was shown to control fimbriae biogenesis in some strains (van Ham et al., 1993
). The hif
locus is present in only 20% of NTHi
strains and, of the four strains analysed here, only R2866 contains the hifA
genes. In this strain, however, the 5′TA tract was present as a 5′(TA)4ATTA sequence. The threshold value set for dinucleotide SSRs in this study was five, therefore this tract was not identified as an SSR; further discussion of this locus is found later in this paper.
Eight dinucleotide SSR loci were identified in this four genome analysis, all of which were found to be of the threshold value of five repeat units in length. Five were located within coding regions. In a similar fashion to the variation observed for many of the mononucleotide SSRs, seven of the dinucleotide SSR loci were found to have sequence variations that did not alter the overall length of the sequence between strains and so would not cause frameshifts consistent with phase variation. For example, a (CA)5 repeat conserved in the genomes of strains Rd KW20, R2846 and 86-028NP was found to be replaced with CACG(CA)3 in strain R2866.
3.9. Trinucleotide SSRs were predominantly found to be located within ORFs
Eighteen trinucleotide SSRs were identified in this study, the majority of which, (15/18), were located within coding regions. All of these 15 SSRs consisted of no more than four repeat units and where variation in the repeats was observed between strains, it either resulted in a reduction in length of the SSR or disruption of the sequence whilst maintaining the same length. The three trinucleotide SSRs that were found in non-coding regions showed greater variation in overall length but were not within identified promoter or other regulatory regions.
3.10. Hexanucleotide SSRs identified in the four genome study
Ten hexanucleotide SSRs were identified within the four genomes, eight within ORFs and two in non-coding regions. Variation in coding hexanucleotide repeats can lead to altered amino acid sequence but not phase variable gene expression. The coding region hexanucleotide SSRs identified in this study were either conserved or, like the mononucleotide and dinucleotide variations discussed above, showed changes in sequence but not length and thus were inconsistent with modulating phase variable expression. Of the two non-coding region associated hexanucleotide repeats, one is conserved across all four strains and is present downstream of the closest ORFs, whilst the other 5′TTAAAA SSR is present as three repeat units in Rd KW20, two units in 86-028NP and as two units plus an interrupted third repeat unit in R2866 and R2846. This SSR is situated 19 bp from the start codon of HI0525 in strain Rd KW20, which encodes a phosphoglycerate kinase involved in central metabolism and the influence of this SSR on the expression of this ORF is unknown.
3.11. SSRs with repeat units greater than 7 nucleotides are not found at high frequency in the four genomes
Of the limited number of SSRs with repeat unit lengths greater than seven nucleotides that were identified in the four genomes study, most were found in only one strain. These include a nonanucleotide SSR found in the genome of strain 86-028NP. This (5′GTTTTCTTA)19 SSR was found to be located 92 bp upstream of the hmw2C gene. As discussed above, variations in heptanucleotide SSR associated with the hmw2A loci are thought to modulate gene expression but the function of this nonanucleotide SSR is not known. An octanucleotide (5′ATTATTTG) SSR however, was found in multiple strains, varying in length between 1 and 6 repeat units. It was found to be located between the divergently transcribed cmkB and pdxS genes which encode a cytidylate kinase 2 and a pyridoxal biosynthesis lyase, respectively, (designated HI1646 and HI1647 in strain Rd KW20). They are both suggested to play roles in metabolism and so it is uncertain whether this SSR would actually be utilised in modulating their expression.
3.12. Analysis of the SSRs of a further 12 genomes
Whilst the SSR analysis of the four Hi genomes was ongoing, 12 further Hi genomes were sequenced and the resulting full or partial sequences made publicly available (listed in ). These 12 additional genome sequences offered us the chance to confirm and extend our detailed SSR analysis of the four Hi genomes.
Using the same SSR search methods and threshold values described for the four genome study, 765 SSRs were identified in these 12 additional genomes (summarised in ). From these data it was seen that mononucleotide SSRs are found in 10 out of 12 of the additional genomes at a higher frequency than was observed in the four genome study. However, it should be noted that the 454 sequencing technology used to generate the majority of the further genome sequences has a decreased fidelity for mononucleotide tracts which may account, to some extent, for the higher number of mononucleotides SSR detected in these strains. However, the F3031 and F3043 genomes, for which the highest number of mononucleotide SSRs were identified, were sequenced using ABI Sanger dideoxy sequencing technology.
The number of SSRs, of each repeat unit length, in each genome.
In a high proportion of the additional genomes, tetranucleotide and hexanucleotide SSR were also observed more frequently than in the four genome study.
3.13. Tetranucleotide SSRs in the complete genome collection
The nine NTHi
genomes, sequenced by the Center for Genomic Sciences (Hogg et al., 2007
) (see ), and the genome of strain 10810, contained a similar number of tetranucleotide SSRs to that previously observed in the four genome study (12-14 per genome) and only two novel tetranucleotide SSR loci were identified. Conversely, in the genomes of strains F3031 and F3043, 18 and 21 tetranucleotide SSR loci were identified, respectively, and eight of these tetranucleotide SSRs were not identified in any of the previously analysed genomes (see ).
3.14. Ten novel tetranucleotide SSRs
A total of ten novel tetranucleotide SSR loci were identified in the additional twelve genomes. One locus, licA2
is a duplication of the licA
locus reported in the four genome study (Fox et al., 2008
) Five of the novel tetranucleotide SSR were associated with genes encoding members of the trimeric autotransporter protein family which commonly contain a C-terminal YadA domain (PFAM03895) (Cotter et al., 2005b; Koretke et al., 2006
). All five of these paralogous loci were present in the two Hi
biogroup aegyptius strains F3031 and F3043 and one of the loci was also present in the genome of the NTHi
strain, PittHH. Previously described members of this family of proteins from Hi
include the adhesins Hsf and Hia which have been implicated in virulence (Cotter et al., 2005a; Surana et al., 2004
). It can be envisaged that the expression of adhesins may not be advantageous in all growth conditions as they are possible targets for the host immune system and are large proteins (up to 1016 aa) whose expression would require considerable resources. Indeed, the NadA protein from N. meningitidis
which is a member of this family of proteins, has previously been shown to be phase variably expressed (Capecchi et al., 2005; Martin et al., 2005
biogroup aegyptius strains have been associated with atypical invasive disease and it is, therefore, tempting to speculate that the high number of putative phase variable adhesins identified in strains F3031 and F3043 may somehow contribute to the unusual clinical outcomes associated with these strains.
An additional four novel tetranucleotide SSRs were identified from strains F3031 and/or F3043. The first, a (5′ATTA)9 SSR is found 225 bp upstream of a gene encoding a putative DNA repair enzyme, formamidopyrimidine-DNA glycosylase MutM, in strain F3043. The equivalent position in other Hi
strains contains 3 copies of the 5′ATTA repeat unit. The role of this repeat in expression of MutM is unknown but variations in the expression of mutM
could potentially result in altered mutation rates in Hi
(Horst et al., 1999
The second novel tetranucleotide SSR identified from strains F3031 and F3043 is a 5′CAAT SSR contained within the 5′ region of an ORF that encodes a putative glycosyltransferase with homologies to glycosyltransferase family 8 (PFAM01501). Homologues of this gene are found in other strains of Hi (including HI0223 in strain Rd KW20) but without the associated SSR. The function of this gene is unknown but it may contribute to LPS expression in strains F3031 and F3043.
The third of the four additional novel tetranucleotide SSR loci contains (5′CAAT)21 and was found only in strain F3043. It is located within the 5′ end of an ORF that encodes a putative S-adenosylmethionine (SAM)-dependent methyltransferase and shows some homology to HI0096 in strain Rd KW20. SAM-dependent methyltransferases have been implicated in various cellular processes including protein trafficking and sorting, signal transduction, biosynthesis, metabolism, and gene expression.
The final novel tetranucleotide SSR identified in strain F3031 is a (5′CAAG)32 SSR located 58 bp upstream of a gene encoding an adenine specific methylase homologue (EcoRI) and 202 bp upstream of the divergently transcribed htpX
(which encodes a putative protease protein, induced by heat shock in E. coli
). HtpX has not been investigated in Hi
but in E. coli
it is part of the membrane-localised proteolytic system and may play a part in the degradation of unstable membrane proteins (Sakoh et al., 2005
3.15. Consideration of characteristics of tetranucleotide repeats from the complete genome collection
In total, 199 tetranucleotide SSRs associated with 28 different loci and consisting of nine different repeat unit sequences have been identified in the complete genome collection. The distribution of tetranucleotide SSR length, and the relationship between length of tetranucleotide SSR and strain are shown in . The length of an individual tetranucleotide SSR does not appear to be dependent on strain background, repeat unit sequence or locus (B), and a wide degree of variation and considerable overlap between groupings is observed. A shows that despite differences in the source, date of isolation and associated clinical symptoms of the different strains there is an approximately normal distribution of tetranucleotide SSR lengths. B shows that the two Hi biogroup aegyptius strains F3031 and F3043, which are associated with unusual clinical symptoms and have the highest number of tetranucleotide SSR, display a similar distribution of SSR lengths to all other strains.
Fig. 1 Histogram and boxplot representations of the length distribution, sequence, strain and loci associations of tetranucleotide SSRs. (A) Frequency histogram of tetranucleotide SSR length distribution in the complete genome study. (B) Boxplot analysis of (more ...)
3.16. Consequences of the sequence and location of tetranucleotide SSR
As noted previously, the tetranucleotide SSRs identified in the four genome study of Hi
are located within ORFs and, with only two exceptions, are located immediately adjacent to or just downstream of the translational start site. In this position, any frameshift due to variation in length of the SSR, would result in a peptide being made from the incorrect reading frame and a premature stop to translation. The location of tetranucleotide SSRs within the 5′ region of the ORFs limits the encoded tetrapeptide repeat to the N-terminus of the respective protein. The two exceptions to this pattern are the 5′GCAA tetranucleotide SSR located in the middle of the oafA
gene that has been previously described (Fox et al., 2005
), and a 5′GACA tetranucleotide SSR located in the 3′ region of a gene encoding a putative glycosyltransferase (pgt1
) in the genomes of strains R2846, 86-028NP, 3655 and PittEE. These repeats may modulate the protein function rather than control ON/OFF switching of its expression.
Tetranucleotide SSRs may constitute a substantial proportion of the coding region of a gene and thus the repeat unit sequence will have a significant influence on the amino acid composition of the encoded protein. The constraints that this may impose, in terms of permissible tetranucleotide sequences, have not been well characterised although High et al. (1996)
suggest that the peptides encoded by the repeat regions form structurally flexible regions that loop out of the protein structure and therefore do not interfere with tertiary structure.
An in silico
analysis of the repeat sequences identified in the 16 Hi
genomes analysed was performed and hydrophilic amino acids are over represented in the SSR encoded peptides, compared with their frequency in the normal proteome. Of the eight tetranucleotide repeat sequences found within ORFs in Hi
, five encode hydrophilic peptides with no net charge (5′CAAT, 5′GACA, 5′CCAA, 5′AGCC, and 5′AGTC), one encodes a hydrophobic peptide with no net charge (5′TTTA) and two encode hydrophilic peptides with a net positive charge (5′GCAA and 5′CAAA). The high proportion of hydrophilic peptides encoded by the tetranucleotide SSRs and their frequent N-terminal location suggests that they are likely to be surface exposed and have the opportunity to ‘loop out’ of the folded protein structure and thus be less likely to interfere with the tertiary structure of the protein and, therefore, its function. The exception is a 5′TTTA tetranucleotide SSR which encodes a hydrophobic peptide within a putative drug/metabolite exporter (HI0687
in strain Rd KW20). Transmembrane helices predictions (TMHMM server v2.0, Sonnhammer et al., 1998
) suggest that the portion of this protein encoded by the SSR lies entirely within a transmembrane domain. Examination of homologues of the HI0687
gene indicates that the hydrophobic nature of such transmembrane helices is well conserved but often the primary sequence is not (data not shown). Another observation of this study was that although previously SSRs of a particular tetranucleotide repeat unit sequence have been associated with genes of related function, e.g. 5′CCAA tracts with genes encoding iron utilisation proteins (Jin et al., 1996; Morton and Stull, 1999
), in this study, we have found no evidence of a particular tetranucleotide repeat unit sequence being restricted to a particular class of gene.
3.17. Interrupted tetranucleotide SSR may be an indication of intra-genome recombination between paralogous loci
One feature of certain tetranucleotide SSR, noted during the course of this study was their interruption by an imperfect repeat unit. All of the genomes in this study were found to contain between one and four related, hemoglobin/hemoglobin–haptoglobin-binding (hgp
) genes containing 5′CCAA SSR that show considerable variation in length. Hi
lacks most of the genes of the heme biosynthetic pathway and requires hemoglobin/hemoglobin–haptoglobin-binding proteins to capture heme-containing compounds required for growth (Morton and Stull, 1999
). Seven interrupted tetranucleotide SSRs were observed in total in this analysis of which six were found to be associated with hgp
genes. We postulate that homologous recombination occurring between these paralogous loci may occasionally generate imperfect repeats and it will be of interest to ascertain whether similar events occur between other duplicated loci, e.g. paralogous adhesin genes (discussed below), partial or fully duplicated hifA
loci and duplicated lic1A
genes (strain PittGG).
3.18. Mononucleotides identified as potential mediators of phase variation in the analysis of the 12 further genomes
In contrast to the four genome study, the analysis of the 12 additional Hi genomes has identified a number of mononucleotide SSRs with the potential to mediate phase variable gene expression. These mononucleotide SSRs were located within the 5′ coding regions of ORFs, associated with frameshift mutations, or located within potential promoter regions. The potential phase variable genes include those encoding virulence-related factors such as glycosyltransferases, type-I restriction modification systems, haemagglutinins, YadA domain containing proteins, pilin genes and a Fe–S cluster assembly scaffold protein (see ). This study offers the first indication that mononucleotide SSRs may mediate phase variation in Hi.
Notable non-tetranucleotide SSRs identified in complete H. influenzae genome collection.
Further support to the role of the mononucleotide SSRs in mediating phase variation in Hi
is that some genes identified in the 12 genome study have previously been determined to be phase variable but mediated by other classes of SSRs. An example is the divergently transcribed pilin genes, hifA
which Geluk et al. (1998)
demonstrated to be phase variable due to variation in the length of a 5′TA SSR located between them and 104–225 bp upstream of the hifA
gene. Changes in the length of the dinucleotide SSR were proposed to alter the spacing between the −10 and −35 promoter sequences and therefore alter expression of the genes. In strain F3031, there are four hifA
loci in total. Two of the loci have an arrangement similar to that described by Geluk et al. (1998)
with the 5′TA SSR located between the divergently transcribed hifA
genes whilst the other two hifA
loci have mononucleotide (A17 or A12) instead of dinucleotide SSRs located either 63 or 93 bp upstream of hifA
There is no hifB
gene associated with these latter loci. Phase variation of pilin expression mediated by mononucleotide SSRs has not been previously reported in Hi
. The exact location and extent of the promoter region of hifA
has not been mapped in these strains, but the position of the mononucleotide SSRs makes them a candidate to mediate phase variation.
In the four genome study, homopolymeric A or T tracts of less than 11 bp were found with only one exception, an A34 tract found in the R2866 genome. In the additional twelve genomes a similar tract of between 20 and 49 bp was found in four strains in the same genomic location; approximately 120 bp upstream of the nearest ORF which encodes a protein with homology to YadA-domain containing proteins such as Hsf. The function of this SSR is unknown but it is tempting to speculate that it may play a role in regulating the expression of the downstream Hsf-like encoded protein. Similarly, in the genome of strain F3031 the expression of a number of YadA domain containing proteins was suggested to be mediated by tetranucleotide SSRs (see ). However in one instance, the expression of a YadA domain containing protein in this strain is potentially mediated by a G13 SSR (located at base 548152) located 10 bp within the ORF (see ). The association of mononucleotide SSRs, in certain strains, with paralogs of genes which are phase variable by other SSRs offers strong circumstantial evidence that these mononucleotide SSRs may mediate phase variation in Hi.
3.19. Other potentially phase variable SSRs in the complete genome collection
The heptanucleotide SSRs associated with the hmw1a and hmw2a genes in the four genome study were also identified in the genomes of strains PittEE and R3655 in the 12 further Hi genomes analysed. In PittEE, 13 copies of the heptanucleotide repeat are present 69 bp upstream and 38 copies 104 bp upstream of the hmw1a and hmw2a genes, respectively, and in R3655, 16 copies of the repeat are present 106 bp upstream of hmw1a. However, in the further genome study, an additional novel heptanucleotide SSR associated with hmw1a was identified in the genome of strain PittAA. Interestingly, this SSR, consisting of (5′AATTTTG)14, was 3.5 kb within the 7.3 kb putative full length ORF rather than in the promoter region, and a frame shift had occurred which is consistent with this being caused by variation in the length of the SSR. In the further genome analysis, an octanucleotide SSR was found associated with hmw loci. This SSR contained twelve to fifteen copies of a 5′GCATCATC repeat and was identified 200–213 nucleotides upstream of the hmw1a and hmw2a loci of strain F3043 and F3031.
A further novel heptanucleotide SSR with the potential to mediate phase variation was identified in strains 22.4.21 and R3655, within an ORF which is a homologue of the HI1369 gene (encoding a putative TonB dependent iron ligand gated channel). Thirteen units of the 5′AACAACC repeat are found in 22.4.21, and eight repeat units in strain R3655 which results in a truncated ORF due to a frameshift.
An octanucleotide SSR identified in the four genome study as containing one, four or six copies of a 5′ATTATTTG unit 12 bp upstream of a gene encoding a pyridoxine biosynthesis protein, was also identified in seven strains of the further twelve genome collection (four copies of the SSR in strains PittHH, PittAA and 22.1.21 and six copies in strains R3655, PittII, F3043 and Hib; see ). However, the limited range of variation and relatively short length of this SSR are not what would be expected at a classically phase variable locus and so the significance of this SSR at this location remains uncertain.