Search tips
Search criteria

Results 1-25 (29)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Cryptic ecology among host generalist Campylobacter jejuni in domestic animals 
Molecular Ecology  2014;23(10):2442-2451.
Homologous recombination between bacterial strains is theoretically capable of preventing the separation of daughter clusters, and producing cohesive clouds of genotypes in sequence space. However, numerous barriers to recombination are known. Barriers may be essential such as adaptive incompatibility, or ecological, which is associated with the opportunities for recombination in the natural habitat. Campylobacter jejuni is a gut colonizer of numerous animal species and a major human enteric pathogen. We demonstrate that the two major generalist lineages of C. jejuni do not show evidence of recombination with each other in nature, despite having a high degree of host niche overlap and recombining extensively with specialist lineages. However, transformation experiments show that the generalist lineages readily recombine with one another in vitro. This suggests ecological rather than essential barriers to recombination, caused by a cryptic niche structure within the hosts.
PMCID: PMC4237157  PMID: 24689900
adaptation; Campylobacter; genomics; recombination barriers
2.  Two-phase importance sampling for inference about transmission trees 
There has been growing interest in the statistics community to develop methods for inferring transmission pathways of infectious pathogens from molecular sequence data. For many datasets, the computational challenge lies in the huge dimension of the missing data. Here, we introduce an importance sampling scheme in which the transmission trees and phylogenies of pathogens are both sampled from reasonable importance distributions, alleviating the inference. Using this approach, arbitrary models of transmission could be considered, contrary to many earlier proposed methods. We illustrate the scheme by analysing transmissions of Streptococcus pneumoniae from household to household within a refugee camp, using data in which only a fraction of hosts is observed, but which is still rich enough to unravel the within-household transmission dynamics and pairs of households between whom transmission is plausible. We observe that while probability of direct transmission is low even for the most prominent cases of transmission, still those pairs of households are geographically much closer to each other than expected under random proximity.
PMCID: PMC4211445  PMID: 25253455
transmission tree; molecular epidemiology; Streptococcus pneumonia
3.  Evolution and transmission of drug resistant tuberculosis in a Russian population 
Nature genetics  2014;46(3):279-286.
The molecular mechanisms determining transmissibility and prevalence of drug-resistant tuberculosis in a population were investigated through whole genome sequencing of 1,000 prospectively-obtained patient isolates from Russia. Two-thirds belonged to the Beijing lineage, which was dominated by two homogeneous clades. MDR genotypes were found in 48% of isolates overall and 87% of the major clades. The most common rifampicin-resistance rpoB mutation was associated with fitness-compensatory mutations in rpoA or rpoC, and a novel intragenic compensatory substitution was identified. The proportion of MDR cases with XDR-tuberculosis was 16% overall with 65% of MDR isolates harboring eis mutations, selected by kanamycin therapy, which may drive the expansion of strains with enhanced virulence. The combination of drug resistance and compensatory mutations displayed by the major clades confer clinical resistance without compromising fitness and transmissibility, revealing a biological contribution to the tuberculosis program weaknesses driving the persistence and spread of M/XDR-tuberculosis in Russia and beyond.
PMCID: PMC3939361  PMID: 24464101
4.  Dense genomic sampling identifies highways of pneumococcal recombination 
Nature genetics  2014;46(3):305-309.
Evasion of clinical interventions by Streptococcus pneumoniae occurs through selection of non-susceptible genomic variants. Here we use genome sequencing of 3,085 pneumococcal carriage isolates from a 2.4 km2 refugee camp to enable unprecedented resolution of the process of recombination, and highlight its impact on population evolution. Genomic recombination hotspots show remarkable consistency between lineages, indicating common selective pressures acting at certain loci, particularly those associated with antibiotic resistance. Temporal changes in antibiotic consumption are reflected in changes in recombination trends demonstrating rapid spread of resistance when selective pressure is high. The highest frequencies of receipt and donation of recombined DNA fragments were observed in non-encapsulated lineages, implying that this largely overlooked pneumococcal group, which is beyond the reach of current vaccines, may play a major role in genetic exchange and adaptation of the species as a whole. These findings advance our understanding of pneumococcal population dynamics and provide important information for the design of future intervention strategies.
PMCID: PMC3970364  PMID: 24509479
5.  Transcriptome Analysis Reveals Signature of Adaptation to Landscape Fragmentation 
PLoS ONE  2014;9(7):e101467.
We characterize allelic and gene expression variation between populations of the Glanville fritillary butterfly (Melitaea cinxia) from two fragmented and two continuous landscapes in northern Europe. The populations exhibit significant differences in their life history traits, e.g. butterflies from fragmented landscapes have higher flight metabolic rate and dispersal rate in the field, and higher larval growth rate, than butterflies from continuous landscapes. In fragmented landscapes, local populations are small and have a high risk of local extinction, and hence the long-term persistence at the landscape level is based on frequent re-colonization of vacant habitat patches, which is predicted to select for increased dispersal rate. Using RNA-seq data and a common garden experiment, we found that a large number of genes (1,841) were differentially expressed between the landscape types. Hexamerin genes, the expression of which has previously been shown to have high heritability and which correlate strongly with larval development time in the Glanville fritillary, had higher expression in fragmented than continuous landscapes. Genes that were more highly expressed in butterflies from newly-established than old local populations within a fragmented landscape were also more highly expressed, at the landscape level, in fragmented than continuous landscapes. This result suggests that recurrent extinctions and re-colonizations in fragmented landscapes select a for specific expression profile. Genes that were significantly up-regulated following an experimental flight treatment had higher basal expression in fragmented landscapes, indicating that these butterflies are genetically primed for frequent flight. Active flight causes oxidative stress, but butterflies from fragmented landscapes were more tolerant of hypoxia. We conclude that differences in gene expression between the landscape types reflect genomic adaptations to landscape fragmentation.
PMCID: PMC4079591  PMID: 24988207
6.  Inference on Population Histories by Approximating Infinite Alleles Diffusion 
Molecular Biology and Evolution  2012;30(2):457-468.
Reconstruction of the past is an important task of evolutionary biology. It takes place at different points in a hierarchy of molecular variation, including genes, individuals, populations, and species. Statistical inference about population histories has recently received considerable attention, following the development of computational tools to provide tractable approaches to this very challenging problem. Here, we introduce a likelihood-based approach which generalizes a recently developed model for random fluctuations in allele frequencies based on an approximation to the neutral Wright–Fisher diffusion. Our new framework approximates the infinite alleles Wright–Fisher model and uses an implementation with an adaptive Markov chain Monte Carlo algorithm. The method is especially well suited to data sets harboring large population samples and relatively few loci for which other likelihood-based models are currently computationally intractable. Using our model, we reconstruct the global population history of a major human pathogen, Streptococcus pneumoniae. The results illustrate the potential to reach important biological insights to an evolutionary process by a population genetics approach, which can appropriately accommodate very large population samples.
PMCID: PMC3548313  PMID: 22993237
population history; genetic drift; infinite alleles Wright–Fisher model
7.  Emergence of Epidemic Multidrug-Resistant Enterococcus faecium from Animal and Commensal Strains 
mBio  2013;4(4):e00534-13.
Enterococcus faecium, natively a gut commensal organism, emerged as a leading cause of multidrug-resistant hospital-acquired infection in the 1980s. As the living record of its adaptation to changes in habitat, we sequenced the genomes of 51 strains, isolated from various ecological environments, to understand how E. faecium emerged as a leading hospital pathogen. Because of the scale and diversity of the sampled strains, we were able to resolve the lineage responsible for epidemic, multidrug-resistant human infection from other strains and to measure the evolutionary distances between groups. We found that the epidemic hospital-adapted lineage is rapidly evolving and emerged approximately 75 years ago, concomitant with the introduction of antibiotics, from a population that included the majority of animal strains, and not from human commensal lines. We further found that the lineage that included most strains of animal origin diverged from the main human commensal line approximately 3,000 years ago, a time that corresponds to increasing urbanization of humans, development of hygienic practices, and domestication of animals, which we speculate contributed to their ecological separation. Each bifurcation was accompanied by the acquisition of new metabolic capabilities and colonization traits on mobile elements and the loss of function and genome remodeling associated with mobile element insertion and movement. As a result, diversity within the species, in terms of sequence divergence as well as gene content, spans a range usually associated with speciation.
Enterococci, in particular vancomycin-resistant Enterococcus faecium, recently emerged as a leading cause of hospital-acquired infection worldwide. In this study, we examined genome sequence data to understand the bacterial adaptations that accompanied this transformation from microbes that existed for eons as members of host microbiota. We observed changes in the genomes that paralleled changes in human behavior. An initial bifurcation within the species appears to have occurred at a time that corresponds to the urbanization of humans and domestication of animals, and a more recent bifurcation parallels the introduction of antibiotics in medicine and agriculture. In response to the opportunity to fill niches associated with changes in human activity, a rapidly evolving lineage emerged, a lineage responsible for the vast majority of multidrug-resistant E. faecium infections.
PMCID: PMC3747589  PMID: 23963180
8.  Recent Recombination Events in the Core Genome Are Associated with Adaptive Evolution in Enterococcus faecium 
Genome Biology and Evolution  2013;5(8):1524-1535.
Reasons for the rising clinical impact of the bacterium Enterococcus faecium include the species’ rapid acquisition of adaptive genetic elements. Here, we focused on the impact of recombination on the evolution of E. faecium. We used the recently developed BratNextGen algorithm to detect recombinant regions in the core genome of 34 E. faecium strains, including three newly sequenced clinical strains. Recombination was found to have a significant impact on the E. faecium genome: of the original 1.2 million positions in the core genome, 0.5 million were predicted to have been affected by recombination in at least one strain. Importantly, strains in one of the two major E. faecium clades (clade B), which contains most of the E. faecium human gut commensals, formed the most important reservoir for donating foreign DNA to the second major E. faecium clade (clade A), which contains most of the clinical isolates. Also, several genomic regions were found to mainly recombine in specific hospital-associated E. faecium strains. One of these regions (the epa-like locus) likely encodes the biosynthesis of cell wall polysaccharides. These findings suggest a crucial role for recombination in the emergence of E. faecium as a successful hospital-associated pathogen.
PMCID: PMC3762198  PMID: 23882129
BratNextGen; comparative genomics; phylogenomics; whole-genome sequencing; nosocomial pathogen; antibiotic resistance
9.  Historical Zoonoses and Other Changes in Host Tropism of Staphylococcus aureus, Identified by Phylogenetic Analysis of a Population Dataset 
PLoS ONE  2013;8(5):e62369.
Staphylococcus aureus exhibits tropisms to many distinct animal hosts. While spillover events can occur wherever there is an interface between host species, changes in host tropism only occur with the establishment of sustained transmission in the new host species, leading to clonal expansion. Although the genomic variation underpinning adaptation in S. aureus genotypes infecting bovids and poultry has been well characterized the frequency of switches from one host to another remains obscure. We sought to identify sustained switches in host tropism in the S. aureus population, both anthroponotic and zoonotic, and their distribution over the species phylogeny.
We have used a sample of 3042 isolates, representing 696 distinct MLST genotypes, from a well-established database ( Using an empirical parsimony approach (AdaptML) we have investigated the distribution of switches in host association between both human and non-human (henceforth referred to as animal) hosts. We reconstructed a credible description of past events in the form of a phylogenetic tree; the nodes and leaves of which are statistically associated with either human or animal habitats, estimated from extant host-association and the degree of sequence divergence between genotypes. We identified 15 likely historical switching events; 13 anthroponoses and two zoonoses. Importantly, we identified two human-associated clade candidates (CC25 and CC59) that have arisen from animal-associated ancestors; this demonstrates that a human-specific lineage can emerge from an animal host. We also highlight novel rabbit-associated genotypes arising from a human ancestor.
S. aureus is an organism with the capacity to switch into and adapt to novel hosts, even after long periods of isolation in a single host species. Based on this evidence, animal-adapted S. aureus lineages exhibiting resistance to antibiotics must be considered a major threat to public health, as they can adapt to the human population.
PMCID: PMC3647051  PMID: 23667472
10.  Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software 
Molecular Biology and Evolution  2013;30(5):1224-1228.
Phylogeographical analyses have become commonplace for a myriad of organisms with the advent of cheap DNA sequencing technologies. Bayesian model-based clustering is a powerful tool for detecting important patterns in such data and can be used to decipher even quite subtle signals of systematic differences in molecular variation. Here, we introduce two upgrades to the Bayesian Analysis of Population Structure (BAPS) software, which enable 1) spatially explicit modeling of variation in DNA sequences and 2) hierarchical clustering of DNA sequence data to reveal nested genetic population structures. We provide a direct interface to map the results from spatial clustering with Google Maps using the portal and illustrate this approach using sequence data from Borrelia burgdorferi. The usefulness of hierarchical clustering is demonstrated through an analysis of the metapopulation structure within a bacterial population experiencing a high level of local horizontal gene transfer. The tools that are introduced are freely available at
PMCID: PMC3670731  PMID: 23408797
genetic population structure; phylogeographics; Bayesian inference; evolutionary epidemiology
11.  Approximate Bayesian Computation 
PLoS Computational Biology  2013;9(1):e1002803.
Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics. In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models. For simple models, an analytical formula for the likelihood function can typically be derived. However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. ABC methods bypass the evaluation of the likelihood function. In this way, ABC methods widen the realm of models for which statistical inference can be considered. ABC methods are mathematically well-founded, but they inevitably make assumptions and approximations whose impact needs to be carefully assessed. Furthermore, the wider application domain of ABC exacerbates the challenges of parameter estimation and model selection. ABC has rapidly gained popularity over the last years and in particular for the analysis of complex problems arising in biological sciences (e.g., in population genetics, ecology, epidemiology, and systems biology).
PMCID: PMC3547661  PMID: 23341757
12.  Phylogeographic variation in recombination rates within a global clone of methicillin-resistant Staphylococcus aureus 
Genome Biology  2012;13(12):R126.
Next-generation sequencing (NGS) is a powerful tool for understanding both patterns of descent over time and space (phylogeography) and the molecular processes underpinning genome divergence in pathogenic bacteria. Here, we describe a synthesis between these perspectives by employing a recently developed Bayesian approach, BRATNextGen, for detecting recombination on an expanded NGS dataset of the globally disseminated methicillin-resistant Staphylococcus aureus (MRSA) clone ST239.
The data confirm strong geographical clustering at continental, national and city scales and demonstrate that the rate of recombination varies significantly between phylogeographic sub-groups representing independent introductions from Europe. These differences are most striking when mobile non-core genes are included, but remain apparent even when only considering the stable core genome. The monophyletic ST239 sub-group corresponding to isolates from South America shows heightened recombination, the sub-group predominantly from Asia shows an intermediate level, and a very low level of recombination is noted in a third sub-group representing a large collection from Turkey.
We show that the rapid global dissemination of a single pathogenic bacterial clone results in local variation in measured recombination rates. Possible explanatory variables include the size and time since emergence of each defined sub-population (as determined by the sampling frame), variation in transmission dynamics due to host movement, and changes in the bacterial genome affecting the propensity for recombination.
PMCID: PMC3803117  PMID: 23270620
13.  Population subdivision and the detection of recombination in non-typable Haemophilus influenzae 
Microbiology  2012;158(Pt 12):2958-2964.
The disparity in diversity between unencapsulated (non-typable; NT) and encapsulated, serotypable Haemophilus influenzae (Hi) has been recognized for some time. It has previously been suggested that the wider diversity evidenced within NTHi compared with typable lineages may be due to different rates of recombination within the encapsulated and NT populations. To examine whether there is evidence for different levels of recombination within typable and NT lineages of Hi, we performed a statistical genetic analysis of 819 distinct genotypes of Hi to explore the congruence of serotype with population genetic clustering, and to identify patterns of recombination within the Hi population. We find that a significantly larger proportion of NT isolates show evidence of recombination, compared with typable isolates, and also that when admixture is present, the total amount of recombination per strain is greater within NT isolates, compared with the typable population. Furthermore, we demonstrate significant heterogeneity in the number of admixed individuals between NT lineages themselves, while such variation was not observed in typable lineages. This variability suggests that factors other than the presence of capsule are important determinants of recombination rate in the Hi population.
PMCID: PMC4083659  PMID: 23038806
14.  Probabilistic Prediction of Contacts in Protein-Ligand Complexes 
PLoS ONE  2012;7(11):e49216.
We introduce a statistical method for evaluating atomic level 3D interaction patterns of protein-ligand contacts. Such patterns can be used for fast separation of likely ligand and ligand binding site combinations out of all those that are geometrically possible. The practical purpose of this probabilistic method is for molecular docking and scoring, as an essential part of a scoring function. Probabilities of interaction patterns are calculated conditional on structural x-ray data and predefined chemical classification of molecular fragment types. Spatial coordinates of atoms are modeled using a Bayesian statistical framework with parametric 3D probability densities. The parameters are given distributions a priori, which provides the possibility to update the densities of model parameters with new structural data and use the parameter estimates to create a contact hierarchy. The contact preferences can be defined for any spatial area around a specified type of fragment. We compared calculated contact point hierarchies with the number of contact atoms found near the contact point in a reference set of x-ray data, and found that these were in general in a close agreement. Additionally, using substrate binding site in cathechol-O-methyltransferase and 27 small potential binder molecules, it was demonstrated that these probabilities together with auxiliary parameters separate well ligands from decoys (true positive rate 0.75, false positive rate 0). A particularly useful feature of the proposed Bayesian framework is that it also characterizes predictive uncertainty in terms of probabilities, which have an intuitive interpretation from the applied perspective.
PMCID: PMC3498326  PMID: 23155467
15.  Clinical isolates of Yersinia enterocolitica Biotype 1A represent two phylogenetic lineages with differing pathogenicity-related properties 
BMC Microbiology  2012;12:208.
Y. enterocolitica biotype (BT) 1A strains are often isolated from human clinical samples but their contribution to disease has remained a controversial topic. Variation and the population structure among the clinical Y. enterocolitica BT 1A isolates have been poorly characterized. We used multi-locus sequence typing (MLST), 16S rRNA gene sequencing, PCR for ystA and ystB, lipopolysaccharide analysis, phage typing, human serum complement killing assay and analysis of the symptoms of the patients to characterize 298 clinical Y. enterocolitica BT 1A isolates in order to evaluate their relatedness and pathogenic potential.
A subset of 71 BT 1A strains, selected based on their varying LPS patterns, were subjected to detailed genetic analyses. The MLST on seven house-keeping genes (adk, argA, aroA, glnA, gyrB, thrA, trpE) conducted on 43 of the strains discriminated them into 39 MLST-types. By Bayesian analysis of the population structure (BAPS) the strains clustered conclusively into two distinct lineages, i.e. Genetic groups 1 and 2. The strains of Genetic group 1 were more closely related (97% similarity) to the pathogenic bio/serotype 4/O:3 strains than Genetic group 2 strains (95% similarity). Further comparison of the 16S rRNA genes of the BT 1A strains indicated that altogether 17 of the 71 strains belong to Genetic group 2. On the 16S rRNA analysis, these 17 strains were only 98% similar to the previously identified subspecies of Y. enterocolitica. The strains of Genetic group 2 were uniform in their pathogenecity-related properties: they lacked the ystB gene, belonged to the same LPS subtype or were of rough type, were all resistant to the five tested yersiniophages, were largely resistant to serum complement and did not ferment fucose. The 54 strains in Genetic group 1 showed much more variation in these properties. The most commonly detected LPS types were similar to the LPS types of reference strains with serotypes O:6,30 and O:6,31 (37%), O:7,8 (19%) and O:5 (15%).
The results of the present study strengthen the assertion that strains classified as Y. enterocolitica BT 1A represent more than one subspecies. Especially the BT 1A strains in our Genetic group 2 commonly showed resistance to human serum complement killing, which may indicate pathogenic potential for these strains. However, their virulence mechanisms remain unknown.
PMCID: PMC3512526  PMID: 22985268
Yersinia enterocolitica biotype 1A; MLST; 16S rRNA gene; yst genes; LPS; Phage typing; Human serum complement killing; Bayesian analysis of population structure; Pathogenicity
16.  Restricted Gene Flow among Hospital Subpopulations of Enterococcus faecium 
mBio  2012;3(4):e00151-12.
Enterococcus faecium has recently emerged as an important multiresistant nosocomial pathogen. Defining population structure in this species is required to provide insight into the existence, distribution, and dynamics of specific multiresistant or pathogenic lineages in particular environments, like the hospital. Here, we probe the population structure of E. faecium using Bayesian-based population genetic modeling implemented in Bayesian Analysis of Population Structure (BAPS) software. The analysis involved 1,720 isolates belonging to 519 sequence types (STs) (491 for E. faecium and 28 for Enterococcus faecalis). E. faecium isolates grouped into 13 BAPS (sub)groups, but the large majority (80%) of nosocomial isolates clustered in two subgroups (2-1 and 3-3). Phylogenetic and eBURST analysis of BAPS groups 2 and 3 confirmed the existence of three separate hospital lineages (17, 18, and 78), highlighting different evolutionary trajectories for BAPS 2-1 (lineage 78) and 3-3 (lineage 17 and lineage 18) isolates. Phylogenomic analysis of 29 E. faecium isolates showed agreement between BAPS assignment of STs and their relative positions in the phylogenetic tree. Odds ratio calculation confirmed the significant association between hospital isolates with BAPS 3-3 and lineages 17, 18, and 78. Admixture analysis showed a scarce number of recombination events between the different BAPS groups. For the E. faecium hospital population, we propose an evolutionary model in which strains with a high propensity to colonize and infect hospitalized patients arise through horizontal gene transfer. Once adapted to the distinct hospital niche, this subpopulation becomes isolated, and recombination with other populations declines.
Multiresistant Enterococcus faecium has become one of the most important nosocomial pathogens, causing increasing numbers of nosocomial infections worldwide. Here, we used Bayesian population genetic analysis to identify groups of related E. faecium strains and show a significant association of hospital and farm animal isolates to different genetic groups. We also found that hospital isolates could be divided into three lineages originating from sequence types (STs) 17, 18, and 78. We propose that, driven by the selective pressure in hospitals, the three hospital lineages have arisen through horizontal gene transfer, but once adapted to the distinct pathogenic niche, this population has become isolated and recombination with other populations declines. Elucidation of the population structure is a prerequisite for effective control of multiresistant E. faecium since it provides insight into the processes that have led to the progressive change of E. faecium from an innocent commensal to a multiresistant hospital-adapted pathogen.
PMCID: PMC3413404  PMID: 22807567
17.  Bayesian estimation of bacterial community composition from 454 sequencing data 
Nucleic Acids Research  2012;40(12):5240-5249.
Estimating bacterial community composition from a mixed sample in different applied contexts is an important task for many microbiologists. The bacterial community composition is commonly estimated by clustering polymerase chain reaction amplified 16S rRNA gene sequences. Current taxonomy-independent clustering methods for analyzing these sequences, such as UCLUST, ESPRIT-Tree and CROP, have two limitations: (i) expert knowledge is needed, i.e. a difference cutoff between species needs to be specified; (ii) closely related species cannot be separated. The first limitation imposes a burden on the user, since considerable effort is needed to select appropriate parameters, whereas the second limitation leads to an inaccurate description of the underlying bacterial community composition. We propose a probabilistic model-based method to estimate bacterial community composition which tackles these limitations. Our method requires very little expert knowledge, where only the possible maximum number of clusters needs to be specified. Also our method demonstrates its ability to separate closely related species in two experiments, in spite of sequencing errors and individual variations.
PMCID: PMC3384343  PMID: 22406836
18.  Population structure in the Neisseria, and the biological significance of fuzzy species 
Phenotypic and genetic variation in bacteria can take bewilderingly complex forms even within a single genus. One of the most intriguing examples of this is the genus Neisseria, which comprises both pathogens and commensals colonizing a variety of body sites and host species, and causing a range of disease. Complex relatedness among both named species and previously identified lineages of Neisseria makes it challenging to study their evolution. Using the largest publicly available collection of bacterial sequence data in combination with a population genetic analysis and experiment, we probe the contribution of inter-species recombination to neisserial population structure, and specifically whether it is more common in some strains than others. We identify hybrid groups of strains containing sequences typical of more than one species. These groups of strains, typical of a fuzzy species, appear to have experienced elevated rates of inter-species recombination estimated by population genetic analysis and further supported by transformation experiments. In particular, strains of the pathogen Neisseria meningitidis in the fuzzy species boundary appear to follow a different lifestyle, which may have considerable biological implications concerning distribution of novel resistance elements and meningococcal vaccine development. Despite the strong evidence for negligible geographical barriers to gene flow within the population, exchange of genetic material still shows directionality among named species in a non-uniform manner.
PMCID: PMC3350722  PMID: 22072450
fuzzy species; recombination; Neisseria
19.  Detection of recombination events in bacterial genomes from large population samples 
Nucleic Acids Research  2011;40(1):e6.
Analysis of important human pathogen populations is currently under transition toward whole-genome sequencing of growing numbers of samples collected on a global scale. Since recombination in bacteria is often an important factor shaping their evolution by enabling resistance elements and virulence traits to rapidly transfer from one evolutionary lineage to another, it is highly beneficial to have access to tools that can detect recombination events. Multiple advanced statistical methods exist for such purposes; however, they are typically limited either to only a few samples or to data from relatively short regions of a total genome. By harnessing the power of recent advances in Bayesian modeling techniques, we introduce here a method for detecting homologous recombination events from whole-genome sequence data for bacterial population samples on a large scale. Our statistical approach can efficiently handle hundreds of whole genome sequenced population samples and identify separate origins of the recombinant sequence, offering an enhanced insight into the diversification of bacterial clones at the level of the whole genome. A data set of 241 whole genome sequences from an important pandemic lineage of Streptococcus pneumoniae is used together with multiple simulated data sets to demonstrate the potential of our approach.
PMCID: PMC3245952  PMID: 22064866
20.  Bayesian semi-supervised classification of bacterial samples using MLST databases 
BMC Bioinformatics  2011;12:302.
Worldwide effort on sampling and characterization of molecular variation within a large number of human and animal pathogens has lead to the emergence of multi-locus sequence typing (MLST) databases as an important tool for studying the epidemiology and evolution of pathogens. Many of these databases are currently harboring several thousands of multi-locus DNA sequence types (STs) enriched with metadata over traits such as serotype, antibiotic resistance, host organism etc of the isolates. Curators of the databases have thus the possibility of dividing the pathogen populations into subsets representing different evolutionary lineages, geographically associated groups, or other subpopulations, which are defined in terms of molecular similarities and dissimilarities residing within a database. When combined with the existing metadata, such subsets may provide invaluable information for assessing the position of a new set of isolates in relation to the whole pathogen population.
To enable users of MLST schemes to query the databases with sets of new bacterial isolates and to automatically analyze their relation to existing curated sequences, we introduce here a Bayesian model-based method for semi-supervised classification of MLST data. Our method can use an MLST database as a training set and assign simultaneously any set of query sequences into the earlier discovered lineages/populations, while also allowing some or all of these sequences to form previously undiscovered genetically distinct groups. This tool provides probabilistic quantification of the classification uncertainty and is highly efficient computationally, thus enabling rapid analyses of large databases and sets of query sequences. The latter feature is a necessary prerequisite for an automated access through the MLST web interface. We demonstrate the versatility of our approach by anayzing both real and synthesized data from MLST databases. The introduced method for semi-supervised classification of sets of query STs is freely available for Windows, Mac OS X and Linux operative systems in BAPS 5.4 software which is downloadable at The query functionality is also directly available for the Staphylococcus aureus database at and shortly will be available for other species databases hosted at this web portal.
We have introduced a model-based tool for automated semi-supervised classification of new pathogen samples that can be integrated into the web interface of the MLST databases. In particular, when combined with the existing metadata, the semi-supervised labeling may provide invaluable information for assessing the position of a new set of query strains in relation to the particular pathogen population represented by the curated database.
Such information will be useful both for clinical and basic research purposes.
PMCID: PMC3155509  PMID: 21791094
21.  Efficient Bayesian approach for multilocus association mapping including gene-gene interactions 
BMC Bioinformatics  2010;11:443.
Since the introduction of large-scale genotyping methods that can be utilized in genome-wide association (GWA) studies for deciphering complex diseases, statistical genetics has been posed with a tremendous challenge of how to most appropriately analyze such data. A plethora of advanced model-based methods for genetic mapping of traits has been available for more than 10 years in animal and plant breeding. However, most such methods are computationally intractable in the context of genome-wide studies. Therefore, it is hardly surprising that GWA analyses have in practice been dominated by simple statistical tests concerned with a single marker locus at a time, while the more advanced approaches have appeared only relatively recently in the biomedical and statistical literature.
We introduce a novel Bayesian modeling framework for association mapping which enables the detection of multiple loci and their interactions that influence a dichotomous phenotype of interest. The method is shown to perform well in a simulation study when compared to widely used standard alternatives and its computational complexity is typically considerably smaller than that of a maximum likelihood based approach. We also discuss in detail the sensitivity of the Bayesian inferences with respect to the choice of prior distributions in the GWA context.
Our results show that the Bayesian model averaging approach which explicitly considers gene-gene interactions may improve the detection of disease associated genetic markers in two respects: first, by providing better estimates of the locations of the causal loci; second, by reducing the number of false positives. The benefits are most apparent when the interacting genes exhibit no main effects. However, our findings also illustrate that such an approach is somewhat sensitive to the prior distribution assigned on the model structure.
PMCID: PMC2942856  PMID: 20809988
22.  Multilocus sequence types of Finnish bovine Campylobacter jejuni isolates and their attribution to human infections 
BMC Microbiology  2010;10:200.
Campylobacter jejuni is the most common bacterial cause of human gastroenteritis worldwide. Due to the sporadic nature of infection, sources often remain unknown. Multilocus sequence typing (MLST) has been successfully applied to population genetics of Campylobacter jejuni and mathematical modelling can be applied to the sequence data. Here, we analysed the population structure of a total of 250 Finnish C. jejuni isolates from bovines, poultry meat and humans collected in 2003 using a combination of Bayesian clustering (BAPS software) and phylogenetic analysis.
In the first phase we analysed sequence types (STs) of 102 Finnish bovine C. jejuni isolates by MLST and found a high diversity totalling 50 STs of which nearly half were novel. In the second phase we included MLST data from domestic human isolates as well as poultry C. jejuni isolates from the same time period. Between the human and bovine isolates we found an overlap of 72.2%, while 69% of the human isolates were overlapping with the chicken isolates. In the BAPS analysis 44.3% of the human isolates were found in bovine-associated BAPS clusters and 45.4% of the human isolates were found in the poultry-associated BAPS cluster. BAPS reflected the phylogeny of our data very well.
These findings suggest that bovines and poultry were equally important as reservoirs for human C. jejuni infections in Finland in 2003. Our results differ from those obtained in other countries where poultry has been identified as the most important source for human infections. The low prevalence of C. jejuni in poultry flocks in Finland could explain the lower attribution of human infection to poultry. Of the human isolates 10.3% were found in clusters not associated with any host which warrants further investigation, with particular focus on waterborne transmission routes and companion animals.
PMCID: PMC2914712  PMID: 20659332
23.  Full Likelihood Analysis of Genetic Risk with Variable Age at Onset Disease—Combining Population-Based Registry Data and Demographic Information 
PLoS ONE  2009;4(8):e6836.
In genetic studies of rare complex diseases it is common to ascertain familial data from population based registries through all incident cases diagnosed during a pre-defined enrollment period. Such an ascertainment procedure is typically taken into account in the statistical analysis of the familial data by constructing either a retrospective or prospective likelihood expression, which conditions on the ascertainment event. Both of these approaches lead to a substantial loss of valuable data.
Methodology and Findings
Here we consider instead the possibilities provided by a Bayesian approach to risk analysis, which also incorporates the ascertainment procedure and reference information concerning the genetic composition of the target population to the considered statistical model. Furthermore, the proposed Bayesian hierarchical survival model does not require the considered genotype or haplotype effects be expressed as functions of corresponding allelic effects. Our modeling strategy is illustrated by a risk analysis of type 1 diabetes mellitus (T1D) in the Finnish population-based on the HLA-A, HLA-B and DRB1 human leucocyte antigen (HLA) information available for both ascertained sibships and a large number of unrelated individuals from the Finnish bone marrow donor registry. The heterozygous genotype DR3/DR4 at the DRB1 locus was associated with the lowest predictive probability of T1D free survival to the age of 15, the estimate being 0.936 (0.926; 0.945 95% credible interval) compared to the average population T1D free survival probability of 0.995.
The proposed statistical method can be modified to other population-based family data ascertained from a disease registry provided that the ascertainment process is well documented, and that external information concerning the sizes of birth cohorts and a suitable reference sample are available. We confirm the earlier findings from the same data concerning the HLA-DR3/4 related risks for T1D, and also provide here estimated predictive probabilities of disease free survival as a function of age.
PMCID: PMC2730012  PMID: 19718441
24.  Identifying Currents in the Gene Pool for Bacterial Populations Using an Integrative Approach 
PLoS Computational Biology  2009;5(8):e1000455.
The evolution of bacterial populations has recently become considerably better understood due to large-scale sequencing of population samples. It has become clear that DNA sequences from a multitude of genes, as well as a broad sample coverage of a target population, are needed to obtain a relatively unbiased view of its genetic structure and the patterns of ancestry connected to the strains. However, the traditional statistical methods for evolutionary inference, such as phylogenetic analysis, are associated with several difficulties under such an extensive sampling scenario, in particular when a considerable amount of recombination is anticipated to have taken place. To meet the needs of large-scale analyses of population structure for bacteria, we introduce here several statistical tools for the detection and representation of recombination between populations. Also, we introduce a model-based description of the shape of a population in sequence space, in terms of its molecular variability and affinity towards other populations. Extensive real data from the genus Neisseria are utilized to demonstrate the potential of an approach where these population genetic tools are combined with an phylogenetic analysis. The statistical tools introduced here are freely available in BAPS 5.2 software, which can be downloaded from
Author Summary
The study of bacterial population biology is complicated by the fact that, although bacteria are largely asexual, they can also exchange genetic materials through homologous recombination. Unlike eukaryotes, recombination in bacteria is not an obligatory process. Furthermore, the recombination mechanisms are subject to many biological and ecological factors that can vary even within different populations of the same species. Although increasing evidence for homologous recombination has been found in many bacterial species, determining the frequency of recombination and understanding the influence that it exerts upon the evolution of bacterial populations remains a challenging work. In this article, we provide a dynamic picture of recombination within and between closely related bacteria species. Through an integration of several Bayesian statistical models, our method highlights the importance of a quantitative estimation of recombination. Our analyses of a challenging multi-locus sequence typing (MLST) database demonstrate that combined analyses using both traditional phylogenetic methods, explorative MLST tools and Bayesian population genetic models can together yield interesting biological insights that cannot easily be reached by any of the approaches alone.
PMCID: PMC2713424  PMID: 19662158
25.  Sequence analysis of percent G+C fraction libraries of human faecal bacterial DNA reveals a high number of Actinobacteria 
BMC Microbiology  2009;9:68.
The human gastrointestinal (GI) tract microbiota is characterised by an abundance of uncultured bacteria most often assigned in phyla Firmicutes and Bacteroidetes. Diversity of this microbiota, even though approached with culture independent techniques in several studies, still requires more elucidation. The main purpose of this work was to study whether the genomic percent guanine and cytosine (%G+C) -based profiling and fractioning prior to 16S rRNA gene sequence analysis reveal higher microbiota diversity, especially with high G+C bacteria suggested to be underrepresented in previous studies.
A phylogenetic analysis of the composition of the human GI microbiota of 23 healthy adult subjects was performed from a pooled faecal bacterial DNA sample by combining genomic %G+C -based profiling and fractioning with 16S rRNA gene cloning and sequencing. A total of 3199 partial 16S rRNA genes were sequenced. For comparison, 459 clones were sequenced from a comparable unfractioned sample. The most important finding was that the proportional amount of sequences affiliating with the phylum Actinobacteria was 26.6% in the %G+C fractioned sample but only 3.5% in the unfractioned sample. The orders Coriobacteriales, Bifidobacteriales and Actinomycetales constituted the 65 actinobacterial phylotypes in the fractioned sample, accounting for 50%, 47% and 3% of sequences within the phylum, respectively.
This study shows that the %G+C profiling and fractioning prior to cloning and sequencing can reveal a significantly larger proportion of high G+C content bacteria within the clones recovered, compared with the unfractioned sample in the human GI tract. Especially the order Coriobacteriales within the phylum Actinobacteria was found to be more abundant than previously estimated with conventional sequencing studies.
PMCID: PMC2679024  PMID: 19351420

Results 1-25 (29)