Whole genome assemblies were generated for all 127 isolates and the protein sequences of 388 shared orthologous genes extracted and aligned to create the phylogenetic tree represented in . Within the tree there are several clusters of similar sequence, which can be correlated with other characterizing metrics such as serotype and drug resistance profile. It is also notable that even for very closely related isolates within clusters it is possible, using the whole genome data, to specifically distinguish every isolate. Multi-locus sequence type (MLST) was derived from the sequence data to allow comparison of the genetic diversity of this sample with other samples in the literature and the entire public MLST database.
Phlyogenetic tree of Malawian strains.
Serotype Distribution and Pneumococcal Conjugate Vaccine Coverage
The serotype of each isolate was determined by mapping Illumina reads against a reference of concatenated capsular biosynthesis loci (cps
) for all known serotypes (). There were thirty-nine (39) circulating serotypes amongst the isolates: 22 were found in invasive isolates only, 1 from carriage only (15A) and 14 from both invasive disease and carriage. Six isolates (3, 5, 7F, 18C, 19A, 23F) were only associated with invasive disease and had population frequencies of greater than 3, but all are included in PCV13. Serotypes 13, 21 and 35B were more frequent in carriage than invasive disease. There were no non-typeable isolates amongst the carriage group. There were clusters though of serotypes 1, 3, 5, 6A/B, 7A/F, 10B, 12A/B, 14 and 23F correlating with genotypes previously reported 
details the cumulative distribution of serotypes detected directly from the genome in this study. The most common invasive serotypes in our selection were serotypes 1 (20.5%), 5 (6.3%), 6A (5.5%), 14 (5.5%), 12B (4.7%), 23F (3.9%) and 19F (3.1%), accounting for 49.5% of the serotypes detected within the population. Our data suggests that the 13-valent pneumococcal conjugate vaccine (PCV13) would provide 62.9% (95% CI 54.5 - 71.3%) coverage against all IPD in Malawi. However the vaccine is scheduled for children under-5 years old. 43.4% (46/106) of invasive isolates were from this age group and PCV13 coverage would be 78.3% (36/46).
Bar chart showing rank order and cumulative serotype distribution (%).
If we remove the PCV13 serotypes from the analysis, 22.9% (8/35) of non-vaccine serotypes would be classified as multidrug resistant (MDR) by being resistant to at least 3 classes of antibiotics (File S1) and 25.7% (9/35) would be resistant to at least 2 classes of antibiotic. Drug resistance is clustered within the 10B, 12B and 16F carried and invasive groups on the tree ().
The population diversity as judged by ST comprised of 20 distinct clonal groups. Novel STs accounted for 29.9% (38/127) of the population and included novel STs derived for serotypes 6B, 7F, 13, 18A and 19A ().
Description of Circulating Clones.
The Pneumococcal Molecular Epidemiology Network (PMEN) 27 clone (Sequence Type [ST] 217) was exclusively associated with serotype 1 and PMEN 19 (ST289) with serotype 5 (). The 23F cluster corresponded to ST802; for two isolates within that cluster we could not assemble complete MLST genes but the whole genome data allows us to confidently place them within the ST802 group. Similarly one 12B sits within the ST989 group and a single 16F sits within the ST705 group (). Two novel ST, serotype 12B isolates, cluster together, but not with the larger ST989 grouping. The serotype 14 cluster was predominately ST63 (75%), with a single ST2678 and a single novel ST. ST63 represents the PMEN 25 clone, which is known to express serotype 15A. However only serotypes 15B and 15C were identified of the serotype 15 cluster. The cluster of serotype 10B isolates did not correspond to a known MLST, but three of this cluster were from invasive serotypes and accounted for 2.4% of the population. No members of the pandemic 23F clone first identified in Spain (PMEN1, 23F-Spain
, ST81) were detected 
provides an overall summary of the genomic diversity associated with the serotypes targeted by the PCV13; Serotypes 1, 5 and 23F were exclusively detected from single lineages (distance of isolates falling within 0–0.002 amino acid changes per site), while some serotypes were detected from multiple lineages (distance >0.002 amino acid substitution per site). The vaccine will target more than one lineage associated with 3 serotypes (6A, 14 and 19F), which will potentially have different propensities to cause disease or association with antibiotic resistance. Lineage-serotype associations allow for the potential detection of serotype switch events; within this sample we found two isolates with the same ST (ST172) but unrelated serotypes (19F and 15C) indicating a likely serotype change within this lineage due to exchange of the cps locus via recombination. However, the genomic distance indicated by the branch lengths shows that these isolates are not as closely related as may be assumed by the common MLST type suggesting that the serotype switch is not a recent event and that there are likely to be many further genetic differences between these isolates. The serogroup 7 cluster appears to indicate a more recent switch event from serotype 7F to 7A. The difference between these serotypes is subtle, due to a single base insertion/deletion causing a frameshift in a glycosyl transferase gene thus altering the sugar composition of the repeating unit (PMID: 17766424). Our current assemblies of this specific location are not conclusive so further confirmation is required, nevertheless, the subtlety of the variation suggests that 7A/7F switches will be relatively frequent in nature.
Genomic diversity of serotypes targeted by PCV13.
There were 17 serotypes identified within this group and only 3 serotypes (6A, 7F and 19F) from 4 isolates (4/21, 19.1%) were included in the PCV13. Seven STs were identified, 7 were designated novel and 7 were incomplete and unable to determine. A single PMEN 25 (ST63) clone expressing serotype 14 was identified in this group. Carriage isolates were evenly distributed throughout the tree with no apparent clustering.