We sequenced and annotated the genome of N. meningitidis serogroup C strain FAM18 by standard methods and protocols. The addition of the sequence data present in the FAM18 genome sequence to that from the two previously sequenced N. meningitidis genomes (serogroup A strain Z2491 and serogroup B strain MC58) enabled a three-way whole genome comparison between representatives of three disease-associated lineages within this species. shows the general features of these strains and their genomes. For convenience we will subsequently refer to the three strains simply as Z2491, MC58, and FAM18.
The three sequenced N. meningitidis genomes are largely colinear with only three apparent reciprocal inversions around the origin of replication (). Each genome appears to represent one of these inversions relative to the other two, suggesting that these inversions may have occurred once in each lineage. However, the foci of the inversions show a high degree of variability relative to the strong synteny across the chromosome, suggesting that they may be subject to frequent additional inter- and intra-genomic recombination.
Rearrangements between the Meningococcal Genomes
The inversion event closest to the origin of replication (IE1; 3′-adjacent to NMA0220/NMB0050/NMC0034) seems to be due to recombination between repeat arrays in Z2491. Interestingly, one of the arrays flanks a pilin gene, pilC2,
while the other is adjacent to genes involved in pilus retraction (pilTU).
Thus, in MC58 and FAM18, pilC2
is adjacent to pilTU,
while in Z2491 they are distant. Neisserial PilC proteins are important components of the type IV pilus machinery involved in adhesion to host cells, promotion of piliation, and transformation competence [27
]. The PilT protein is essential for pilus retraction [29
], and it has been shown that PilC1 regulates PilT-mediated pilus fibre retraction [28
]. Although there is no direct published evidence for coregulated transcription of pilTU
it seems plausible that this rearrangement may have an effect on pilus phenotype. The FAM18 pilus regulon also differs from Z2491 and MC58 because of the deletion of much of the pilE/S
locus and the insertion of the class II pilin-encoding pilE2
(see below). Further variability at IE1 can be seen with the insertion of a copy of a meningococcal disease associated (MDA) island in the repeat array (see below) directly upstream of pilC2
in FAM18. The MDA island encodes a filamentous bacteriophage that is secreted via the type IV pilus and is specifically associated with strains that have the potential to cause disease [30
The second inversion, IE2 (5′-adjacent to NMA2200/ NMB0287/NMC0293), is probably due to recombination between copies of IS1106 in FAM18 and is also associated with the insertion of a locus encoding a putative restriction-modification system in FAM18. Restriction-modification systems coordinate the recognition and destruction of “non-self” DNA from sources lacking the same system and for N. meningitidis
have been associated with specific lineages [31
The third, IE3, is the most complex of the three inversion events and seems to be due to recombination between loci encoding a large repetitive surface protein and its associated secretion system (NMA0688, NMB0497, NMB1779, and NMC0444; see below). These loci appear to encode two-partner, or type V, secretion systems [32
] and are similar in sequence and genetic arrangement to those of Bordetella
species where fhaC
respectively, encode a secretion accessory protein and a filamentous haemagglutinin important in virulence [33
]. Z2491 and FAM18 have a single copy of this locus, while MC58 has two copies that are approximately equidistant from the origin of replication and are the foci of the rearrangement. Prior to duplication the locus has also acquired a novel set of two-partner secretion protein genes and an MDA-related prophage. Duplication of the whole locus and subsequent recombination involving another MDA island may have lead to the current genomic arrangement, which would appear to have benefited MC58 with a greater potential variety of surface protein expression.
All three of the reciprocal inversion foci that we have described here seem to affect genetic loci with the potential to modulate interaction with the host and/or other strains of N. meningitidis. Despite frequent inter-strain recombination, N. meningitidis genomes maintain a high level of colinearity, so it may be the case that the rearrangements observed in this three-genome comparison have added significance.
Three-Way Coding Sequence Comparison
The predicted amino acid sequences of the coding sequences (CDS) from each of the genome annotations were compared by three-way reciprocal Fasta analysis to assess the numbers of orthologous and unique CDS. The latter were defined as CDS where a reciprocal match was not detected in either of the other two translated genome sequences. Visualisation and manual curation of the results of this analysis using the Artemis Comparison Tool revealed limitations of the test. This analysis methodology did not take into account the relative chromosomal position of the genes, so the best matches between genes of the different genomes could be those that are in different chromosomal contexts and, therefore, likely to be paralogues (genes of similar sequence in the same genome) rather than true orthologues. Examples of characteristic features in N. meningitidis that confound the reciprocal match test include CDS within loci encoding variable surface proteins such as adhesins or haemagglutinins. In some cases multiple paralogous loci exist within each genome and may be exchanging DNA by intra- and/or inter-genomic recombination. The result is that syntenic loci (those in the same position) are equally diverged from one another as they are from nonsyntenic loci (see below). For convenience we have designated such genes as “variable” to distinguish them from simple orthologues. Paralogous CDS at nonsyntenic loci are also designated as variable.
Variable genes tend to occur in clusters, and there is a clear correlation between these gene clusters and regions of low % G + C content. Viewed on a whole genome scale, eight of the nine most prominent GC troughs across the genome of FAM18 coincide with variable loci with the one exception being the ribosomal protein operon (NMC0129–NMC0159) (Figure S1
). It was observed in Helicobacter pylori
that genes that are not universally present across a number of strains, and are therefore likely to be laterally acquired, tend to have a lower than average GC content [34
], and a similar bias has been seen in related enteric genomes [35
]. It has been suggested that accessory genes (those variably present in different strains within a species) may be subject to different selective pressures to the core genes, and that low % G + C content is one of the results of this difference [36
]. It is therefore possible that the low % G + C nature of the variable genes in N. meningitidis
may be a consequence of selection for exchange within the species.
In addition, the three meningococcal genome sequences were compared using ACEDB, as described previously [37
], to identify unique coding sequences, regardless of their annotation in their respective genomes. The results of these two analysis methodologies were combined and, following manual curation, 240 unique genes were identified; 83 (4.1%) in Z2491, 97 (4.8%) in MC58, and 60 (3.0%) in FAM18. summarizes the types and numbers of regions containing unique genes and Table S1
details the individual CDS functional annotations. The majority encode hypothetical proteins of unknown function. This is to be expected, because strain-specific genes are generally poorly studied, and largely do not form part of the common and core metabolic functions that have been most studied, and are most readily identifiable through comparison with other well-studied species and biochemical pathways. There are some unique restriction-modification systems and these would be expected to have an impact on the uptake of DNA from N. menigitidis
strains in the same niche; such systems have previously been shown to be associated with different lineages [5
Unique Regions in the Meningococcal Genome Sequences
The majority (39 of 56) of the unique gene clusters contain three or fewer consecutive genes, and 30 of these (68 genes) correspond to known or candidate Minimal Mobile Elements [40
] with alternative unique loci present at syntenic locations across the three genomes. With the exception of dam
in FAM18, all of the unique restriction-modification system genes are within MMEpheST
Larger unique regions are often associated with insertion sequence elements (nine of 56 clusters; 53 genes; Table S1
) and with a Mu-like prophage (pnm2) present at the same location in all three genomes (between; NMA1280 and NMA1323, NMB1077 and NMB1112, NMC1041 and NMC1056). The IS-associated unique CDS are often small and lie in low % G + C troughs. They are also mostly annotated as “hypothetical proteins” with little information available to allow prediction of the effect of their differential presence. MC58 carries the largest version of prophage pnm2, which includes CDS-encoding cell surface antigens able to induce bacteriocidal antibodies in mice [41
]. The presence of unique genes in each genome within these prophage could be due to independent phage insertions, differential gene loss from a larger prophage inserted in a common ancestor, or intergenomic recombination between prophage. Z2491 contains a large unique region of 63 annotated genes (NMA1821–NMA1885), which constitutes a Mu-like prophage (pnm1) shown to be conserved among epidemic serogroup A strains [42
], though an association with virulence has not been demonstrated. FAM18 contains a region (NMC0852–NMC0895, IHT-E) that includes genes homologous to lambdoid bacteriophage genes and a transposon carrying a type I secretion system [26
Repeat Arrays and Flanking Genes
As with other members of the species, the N. meningitidis
FAM18 genome contains many hundreds of repetitive sequence elements ranging from simple sequence repeats associated with phase variable genes (see below), to complete gene cluster duplications (). DNA uptake sequences (5′-GCCGTCTGAA-3′) are the most abundant repeats and are distributed throughout the genome [43
]. Concordant with their % G + C-rich sequence, they are less frequent in low % G + C regions, which often coincide with important genetic loci including those for ribosomal proteins, capsule biosynthesis, pilus biosynthesis, Maf adhesins, prophage, Iga protease, cytolysin transport, and RTX-family exoproteins.
Noncoding Repeat Families Present in FAM18, Z2491, and MC58
The next most abundant repeat types are the “neisserial intergenic mosaic elements” (NIMEs), which consist of 20-bp inverted repeats (ATTCCCNNNNNNNNGGGAAT, dRS3 elements) flanking over 100 families of ~50–150-bp repeat sequences (RS elements) [17
]. Also frequent are the “Correia repeat enclosed elements” (known as CREE or Correia elements), which comprise a conserved repeat sequence (156 bp full length or 51 bp internal deletant) bounded by a 51-bp inverted repeat. CREEs are often located upstream of genes [44
], have been shown to affect gene expression [45
], and may be transposable or mobilisable [47
The numbers of each major repeat type are comparable in the three complete N. meningitidis genomes (). Comparison of repeat elements between the three genomes revealed no repeat types unique to one genome though it did identify RS element diversity. For example, repeat sequence clustering analysis for Z2491 and FAM18 showed that of the 611 RS elements in FAM18, there are 80 FAM18-specific versions that group into 27 subfamilies, suggesting novel repeat development, possibly generated by recombination.
NIMEs are often clustered into long arrays of multiple dRS3s separated by different RS elements. These arrays may also contain other repeats such as CREEs and insertion sequence elements, which may be opportunistic insertions. We have previously suggested that these NIME arrays may encourage sequence variation in neighbouring genes by increasing the frequency of recombination with exogenous DNA, and thus exchange of adjacent sequences, either by acting as substrates for homologous recombination, or as targets for a specific recombinase [17
]. The chromosomal position of these repeat arrays is generally consistent between the three genomes suggesting that they were initially introduced in a common ancestor. However, comparison of syntenic repeat arrays reveals considerable differences in repeat number and array length, indicating that the arrays themselves are dynamic (), as would be expected if they were substrates for recurring recombination.
Syntenic Repeat Arrays Showing Variation in Repeat Number and Array Length and tbpAB Divergence
To study the correlation between repeat arrays and coding sequence divergence, the three pairwise genome comparisons were combined to measure amino acid identities between orthologous CDS. This showed that the average percentage identity between orthologous CDS flanking repeat arrays is significantly (p-value = 5.2 × 10−6 using a single-tailed t-test) lower than the average percentage identity of orthologues not flanking repeat arrays, supporting the hypothesis that the arrays are associated with increased diversity in flanking genes. Despite this strong association, other measures of the diversifying affect of repeat arrays are less clear cut. The relationship between array length and flanking gene diversity is displayed in A and shows that for positionally orthologous genes immediately adjacent to repeat arrays there is a tendency for sequence identity to decrease as array length increases and that the same pattern is seen in all three pairwise comparisons. The low combined R2 value indicates that the association with repeat length is not strong. However, a strong correlation with array length may be obscured since array size is dynamic () and may itself change rapidly.
Sequence Divergence in Orthologues Flanking Repeat Arrays
We further analysed the orthologue sequence identities to test whether the diversifying affect of the repeat arrays could be detected beyond the immediate adjacent CDS. B plots the orthologue identities for the CDS within ten genes of an array and shows a weak association between othologue diversity and distance from the array. C shows that this effect is almost entirely due to the diversity seen in the CDS immediately adjacent to arrays. The distance to which the diversifying effect of the arrays extends should be limited by the length of DNA fragment that can be recombined through a double crossover event where one crossover is within the repeat array. A study focused on allele replacement of the tbpB
gene () encoding an immunogenic cell surface protein in N. meningitidis
found that the recombining DNA fragments ranged from 1.5 to 9.9 kb with a median size of 5.1 kb [48
]. This size range would suggest that recombination fragments could extend several CDS away from a repeat array, although the effect of this is not seen in our comparisons. Interestingly, of the 19 fragments mapped to tbpB
allele replacements, 13 had at least one end point within or very close to a flanking NIME repeat array [48
], so while immune selection may be driving the retention of variants, the majority of the required recombination events are associated with NIME repeat arrays.
Based on the above findings, we hypothesise that the relative positions of CDS and repeat arrays are under selective pressure such that genes where increased variation is beneficial are more likely to be associated with arrays. The repeat arrays serve to promote recombination with exogenously acquired DNA, increasing the rate of gene exchange at the adjacent loci. Although this does not directly cause increased variation in these genes, it should enhance the exchange of variants, and therefore increase the apparent rate of variation. This correlates with the pattern seen in A, where some genes are highly variable in some comparisons but not in others—exchange is a stochastic process—sometimes a highly variant gene will have been acquired, and sometimes a gene more similar to the comparators used here. Clearly any exchange will only be fixed if it is selected for, and this may explain why variation from the array is not retained. In support of this hypothesis, there is a clear association between repeat arrays and CDS encoding cell surface associated proteins where increased sequence variation may be an advantage in host interactions (; Table S2
). At any given repeat array size range, the majority of flanking genes code for cell surface or exported proteins. Moreover, the proportion of this class of genes increases as array size increases. A clear example of the diversifying effect of repeat arrays is the tbpAB
locus (), which lies at a syntenic position in all three genomes flanked by large repeat arrays. The dynamic nature of the repeat arrays is reflected in a wide variation in array size and content. The tbpA
DNA sequence identities (74.1%–94.9% and 50.8%–77.8%, respectively) are far below the typical genome figure of >90% and the flanking gene identities of 96.1%–98.5%. Clearly, sequence divergence is focused on tbpAB
leaving the surrounding genes largely unaffected further implying that selection for advantageous variation is operating. Analogous arrangements can be seen for the lbpAB
locus, encoding a lactoferrin-binding protein and the porA
locus, encoding a major outer membrane protein, and the pilC1/pilC2
Pie Charts Showing Association between Repeat Arrays and Surface Proteins in FAM18
Another consideration is that there may be a reciprocal relationship between the genes undergoing repeated recombination to generate antigenically variable mosaics and the flanking repeats. Since there is little or no selective pressure for accurate recombination within the flanking repeat regions, the recombination within these regions is likely to be more “error prone” than that within the coding regions. So, the recombination of these genes may serve to create variation and growth of the flanking repeats, which in turn may favour further recombination within the adjacent coding regions.
NIME Array Structure
The NIME arrays themselves display a striking and regular wave profile for % G + C content with troughs corresponding to RS elements and peaks corresponding to dRS3. shows one example but the profile is apparent for all large regular NIME arrays in the genome. The profile is sometimes less obvious in smaller arrays or less regular arrays containing other repeat types such as CREEs, but the % G + C profile is often even detectable for isolated dRS3-RS-dRS3 units. The profile does not appear to be simply due to the high % G + C nature of the dRS3s, but is rather a combination with the lower % G + C RS elements. Average % G + C values for FAM18 are 55.78% for dRS3s, 45.18% for RS elements, and 51.62% for the whole chromosome. Closer analysis shows considerable internal variation for the NIME subunits with dRS3 % G + C ranging from 35% to 70% and RS elements ranging from 28% to 76%. The large overlap between these ranges implies that to maintain the wave profile the dRS3 and RS element sequences are somehow interdependent. Since dRS3s are only 20 bp long and have a conserved 6-bp terminal inverted repeat, their % G + C variation is limited to the central eight bases. RS elements are essentially defined as regions flanked by dRSs, so may not necessarily be functional units and could be largely inert, or have a structural role. Under this definition, RS elements vary in size from 19 to 214 bp and form multiple sequence families some of which tend to have short (5-bp) terminal inverted repeats. Within RS elements % G + C content is often “patterned” with either a central low % G + C trough or a side-to-side slope. Maintenance of the % G + C wave profile in NIME arrays suggests that the structure may be related to function, potentially participating in, or promoting, recombinational events.
NIME Array Percentage G + C Content Profile for an Array Located at 1283600–1284640 bp in the FAM18 Genome
We hypothesise that dRS3 sequences within NIME arrays are binding sites for a site-specific recombinase that enhances recombination between these sequences and exogenously acquired DNA containing other dRS3 elements, thereby promoting variation at a number of genes associated with NIME arrays. If the dRS3-mediated recombination formed an initial cross-over event, then the insertion of linear DNA could be completed by RecA-mediated homologous recombination in the adjacent sequences, ensuring replacement with similar genes. Alternatively, pairs of arrays surrounding genes could both participate in dRS3-mediated recombination, or array-flanked genes on acquired DNA could be inserted into chromosomal arrays. Continued recombination between chromosomal and acquired dRS3 elements, with the functionally selected consequence of exchanging adjacent genes, could have the effect of building up repeat arrays containing “spacer” regions (the RS elements) with specific physical or conformational properties.
Bille et al. [30
] and Kawai et al. [49
] have recently described a type of neisserial filamentous prophage whose presence in meningococcal genomes is associated with the ability to invade host tissues. They have also showed that these bacteriophage integrate into dRS3 repeats by the action of a phage-encoded transposase/recombinase. This protein is therefore a plausible candidate for the specific recombinase predicted by our hypothesis. This phage is a member of a larger family of neisserial phage, and it is therefore reasonable to suppose that this recombinase has been present in the neisserial genome for some time.
Silent Gene Cassette–Mediated Variation
genomes contain several loci where transcriptionally silent gene cassettes can be used as sources of variation for expressed surface structures and proteins. Comparison of such variable loci from different strains reveals detail of different genetic arrangements and may be useful for understanding the mechanisms for generation of variants. The best-described example is the pilin-encoding pilE/S
system where the expressed pilin (PilE) can be altered by incorporation into the pilE
CDS of DNA from 5′-adjacent promoter-less pilS
]. Much of the pilE/S
locus has been deleted in FAM18, which is associated with the previously recognized insertion, and conversion to the sole expression, of a class II pilin-encoding gene (pilE2)
elsewhere on the chromosome (which is not present in Z2491 and MC58) [51
]. Variation of the pilE
gene using pilS
sequences has been extensively studied [50
], the efficiency of which has been shown in N. gonorrhoeae
to involve a short DNA sequence (the Sma/Cla repeat) located downstream of the pilE
gene. In N. meningitidis
the silent pilS
loci are embedded within NIME arrays, and it is possible that the specific dRS3-mediated recombination postulated above may contribute to generating silent variation within pilS
A different mechanism of variation appears to exist for several loci encoding putative haemagluttinins (fhaB)
and adhesins (mafB)
. Downstream of these genes are what appear to be silent cassettes encoding alternative C termini for the encoded proteins. These cassettes contain short repeats that are identical to sequences only present within the upstream genes. We have previously suggested that these repeats could be the substrates for direct recombination, replacing the 5′ end of the gene [17
]. The three-way comparison provides more evidence in support of this view, including examples where the C terminus of one of the expressed genes in one genome is identical to a silent cassette in the same locus in another genome.
The maf loci are generally comprised of tandem mafA and mafB genes, both of which are thought to encode adhesins, followed by a number of putative silent cassettes, and many genes of unknown function (A). Each of the three N. meningitidis genomes has three maf loci designated here as maf1, maf2, and maf3 (based on their order relative to the origin of replication in the FAM18 genome sequence). The loci are at syntenic positions but occur in a different order within each chromosome due to the chromosomal inversion events described above (). For all three genomes the maf2 and maf3 loci start with a mafA gene, which is absent from maf1. However, maf1 from MC58 (maf1_MC58) and FAM18 (maf1_FAM18) have a truncated mafA mid way through the locus, and maf1_Z2491 has been largely deleted. There is also a possible mafA remnant in the middle of maf3_FAM18. Clearly there is considerable variation encoded within the maf loci, which would be expected to manifest as variations in adhesin structure at the cell surface.
Silent Gene Cassette–Mediated Variation
Although all three maf loci have a similar structure and appear to be encoding a similar product, at the sequence level maf1 and maf2 are more similar to each other, with maf3 having only localised similarity. DNA identities between maf1 and maf2 mafA genes are high (>97%), but their identities to maf3 mafA genes are much lower (~65%). Identities between maf3 mafA genes are greater than 98%. An analogous situation exists for the mafB sequences. The encoded maf3 MafAs have an N-terminal extension relative to the others but all appear to have an intact signal sequence and are likely to be exported.
At 41.2%, the average % G + C content of maf loci is markedly lower than the genome average of 51.5%. Furthermore, there is a distinct profile of % G + C content across maf loci. Generally mafA and mafB CDS, including the downstream alternative CDS, correspond to % G + C peaks with some mafA CDS as high as 59% GC. Although the intervening % G + C troughs have been annotated with potential CDS, they have no similarity to genes in the database, and their role is unclear. It is notable that they do not contain any of the repeats associated with repeat arrays.
FAM18 and Z2491 have single syntenic fha
loci, while MC58 has two that are associated with a genome inversion event (IE3) as described above (). These loci are analogous to the fha
loci in Bordetella pertussis
with an upstream gene (fhaC)
encoding a two-partner secretion system outer membrane transporter [54
], though the B. pertussis
loci do not have downstream silent cassettes. The N. meningitidis fha
loci have similar structure to the maf
loci with silent cassettes directly downstream from a gene encoding a large low complexity surface protein and a distinct % G + C profile (B). The size of the repeats shared by the silent cassettes and the functional gene is comparable for both fha
. DNA identities for N. meningitidis fhaC
and the 5′ portion of fhaB
are greater than 98%. The degree of similarity between fhaB
genes is variable with the two copies in MC58 having a high level of identity over almost the full length, while for Z2491 and FAM18 gene identity is largely confined to the 5′ end with the remainder either unique or having similarity in the downstream “silent” region.
The maf and fha loci show considerable potential for generation of multiple versions of the expressed coding sequence and, together with surface structures such as pilus, capsule, and other surface proteins, are likely to be major contributors to cell surface diversity. The presence of multiple syntenic maf loci is striking and suggests an important role.
Phase Variable Genes
Previously, a number of potential phase variable genes have been identified based on the presence of potentially slippage-prone short repetitive sequences, and these lists have been progressively refined, through analysis of neisserial genome sequences, first in N. meningitidis
strains MC58 [18
] and Z2491 [17
] and then in comparative studies using both of these and N. gonorrhoeae
strain FA1090 [37
], and subsequently in a study of the commonly used experimental N. gonorrhoeae
] and a partial study of N. meningitidis
Based upon these studies, and those published by others on specific genes, and a four-way comparison using the N. meningitidis
FAM18 genome sequence, a revised and updated phase variable gene list is presented here (Table S3
). There are now 24 known phase variable genes in the Neisseria
spp. and a further 25 strong candidates, counting members of established gene families such as Opa proteins only once. Over half of these encode surface proteins, enzymes that modify surface proteins, or are LPS biosynthesis proteins. This mechanism therefore has a vast capacity to vary the surface-exposed structures and epitopes of N. meningitidis
Based upon the comparisons of the three meningococcal genomes, we further characterized a number of known and putative mechanisms for the generation of diversity within and between strains of this highly adaptable and variable species. Many of these mechanisms involve random variation, which is locally increased due to the presence of repeats, either generating local instability or serving as substrates for homologous recombination, resulting in altered expression of specific genes (phase variation) or generating allelic diversity within particular surface proteins (pilE, mafB/fhaB families, NIME array-associated genes)
While phase variation through homopolymeric tracts has been noted in several other genera, the NIME arrays and cassette-mediated variation described here seem to be specific to Neisseria
and may be important characterising features of the genus. NIME arrays are found at syntenic positions in the genomes of N. gonorrhoeae
and N. lactamica
) but have not been observed in non-neisserial genomes. Although two-partner secretion systems analogous to the fha
locus are found in other bacterial genomes [33
], they only include the two essential components and lack the silent cassettes that enhance variability in N. meningitidis
. Notably, the N. gonorrhoeae
and N. lactamica
genomes both have three maf
loci syntenic with those in N. meningitidis
, N. gonorrhoeae
has two extra maf
loci, and neither have fha
loci. These genomic differences will affect the cell surface and may relate to niche differences and interactions with the host.
Variation of the bacterial cell surface is a common theme in host–pathogen interactions and appears to be important for colonisation of new niches and avoidance of the immune system. Such variation may be even more important for commensal organisms, such as Neisseria, that remain associated with their hosts for long periods. The genome of FAM18, and its comparison with MC58 and Z2491, highlights N. meningitidis as a paradigm of genomic variability linking a combination of DNA uptake and recombination, minimal mobile elements, intergenic repeat arrays, and phase variation to generate and maintain phenotypic diversity focused at the cell surface.