The scale and efficiency of high throughput ‘next generation’ parallel sequencing methods, combined with their reduced costs, provides the unprecedented prospect of large numbers of whole genome sequences becoming available for the meningococcus in the near future. At least in the short term these will not be fully annotated, closed genome sequences, but rather a number of contiguous sequence assemblies generated from short-read sequencing technologies, such as the Illumina HiSeq 2000 instrument. With sequence data availability increasing more rapidly than conventional analyses can be performed, novel strategies for data analysis, storage, and output are required if full advantage is to be taken of them.
The initial studies using this type of data concentrated on bacteria with relatively limited genetic variation, either the study of relatively ‘young’ single clone pathogens, or closely related members of more diverse species, often initially chosen on the basis of MLST data [92
]. For such analyses, it was possible to borrow analysis techniques from studies of human populations, which are not very diverse, for the identification of single nucleotide polymorphisms (SNPs) by mapping against a complete closed reference genome. This approach, however, is not appropriate with more diverse bacteria like the meningococcus where there is high diversity, i.e. extensive sequence polymorphism rather than SNPs, and especially when this diversity is reassorted by recombination. Another limitation of these studies was the requirement of a reference genome, a high-quality, finished genome, closely related to the genomes being analysed and against which SNPs can be called. These approaches are not scalable to a situation where many hundreds or thousands of partial genome sequences are being generated for diverse members of a bacterial species.
We have recently developed an alternative paradigm for the analysis of whole genome sequence data, which places allelic diversity among bacterial isolates at the centre of multiple genome analysis (). This approach is effectively an expansion of the analysis concept that has been central to the success of MLST [23
] and has the advantage that multiple genomes that are highly diverse can be simultaneously analysed. While most MLST schemes index variation at 1-10 ‘housekeeping loci’, usually fragments of genes under stabilising selection [46
], there is in principle no limit to the number or type of loci that can be analysed in this way, up to all the loci present in a genome. For the purposes of characterising and cataloguing genetic variation, a locus can be defined as any sequence string, nucleotide or peptide, which can be identified by its sequence, genomic context or a combination of the two. As in MLST, each novel variant of each locus is assigned an allele number and the sequence and the number are stored in a curated data table, providing a comprehensive catalogue of the variation observed to date for that locus. These tables are easily expanded as new variants are detected. The advantage of this approach is that the sequence variation at a given locus can be summarised as a single number, and a genome can be rapidly identified as having a known variant at a particular locus, or a novel variant which can then be added to the curated data set for that locus.
Figure 3 The BIGSdb integrated online database (http://pubmlst.org/neisseria/), which links provenance, phenotype, bibliographic and sequence data for use in applications for epidemiological, evolutionary, and functional studies. Figure is adapted from SK Sheppard, (more ...)
For organisms in which recombination is common [49
], this analysis approach has the advantage that analyses based on the alleles present, rather than their sequences or individual nucleotides, inherently corrects for bias introduced by nucleotide-based phylogenetic or genealogical analyses by horizontal genetic exchange; indeed, this was why the approach was used for MLST of such organisms [46
]. In organisms where recombination is absent or rare, the sequences can be used directly, as the sequence variation conforms to tree-like evolutionary models [94
]. A further level of efficiency in cataloguing variation is achieved by grouping alleles, as is done in MLST, to generate STs which effectively summarise variation data at seven loci with a single number. There is no limit to the number of such schemes that can be defined, each of which will group genetic variation in different ways, for example membership of the same macromolecular structure, such as the ribosome [43
], or contribution to a particular phenotype, such as antibiotic resistance. Once the genetic variation of a given isolate is catalogued in this way it can readily be associated with provenance and phenotype data for the isolate, often referred to as metadata.
This approach has been implemented with the BIGSdb [52
] software on the PubMLST.org/neisseria
website. There are three fundamental levels of interrelated information within the website: (i) a table of isolate provenance and phenotype data; (ii) a ‘sequence bin’ associated with each isolate; and (iii) tables of reference sequences. The isolate information table contains data on the provenance and phenotype of the isolate, together with the any alternative names which have been used and links to relevant records in external databases, such as PubMed. The sequence bin, can contain any type of compiled sequence data, including: (i) single locus sequence data, such as MLST data, 16S rRNA gene or and antimicrobial resistance determinant; (ii) assembled draft genomes, such as those generated from ‘next generation’ short read sequences; or (iii) complete closed genomes sequences. Each of these types of data can reside in a single sequence bin, associated with an experiment type, which indicates the likely quality of the data and can be used to analyses the data preferentially, such that complete closed genome data, where available, will be used in preference to draft genome data.
The tables of reference sequences store any number of references sequences from any number of loci, grouped into any number of schemes, enabling rapid and flexible annotation of the sequences within the sequence bin using well established search algorithms such as BLAST. Once identified the loci are tagged within the sequence bin, for easy future identification and the allelic designations reported back to the isolate information table. This scanning process is automated so that as new genomes are added, they are automatically annotated against the loci defined in the reference tables the reference tables themselves not only define variants and associate them with allele numbers, but can also contain links to other data, including publications, which indicate the function of the loci. These tables also give alternative locus names. To facilitate unambiguous labelling each locus is assigned a unique identifier of the form NEIS0001. Loci are curated, but multiple curators are possible such that individuals expert in a particular locus can have responsibility for that locus. All features of the database are accessible through a web interface, so that the system provides a flexible and backwards compatible means of cataloguing and analysing genome wide information. The PubMLST database employing the BIGSdb software provides a platform upon which population genomic analyses of the Neisseria can be efficiently performed. It contains sequence data generated by single locus, multilocus and whole genome approaches that can be analysed together. A number of data analysis, summary and export tools are built into the system and it is possible to conduct phylogenetic and genealogical analysis on the same data set that is being used for functional studies, enabling the effective fusion of these two powerful but not always well integrated functions. This is especially important in bacteria such as the meningococcus, where population structure has to be taken into account when performing association studies to determine which genes or genetic variants are associated with particular phenotype. The pathogenic phenotype in the meningococcus is polygenic and complex, but the data accumulated to date, which has demonstrated the existence of defined genotypes and their association with particular phenotypes suggests that unravelling the genetic elements associated with these traits will be achievable in the foreseeable future.