Search tips
Search criteria

Results 1-23 (23)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Inference on Population Histories by Approximating Infinite Alleles Diffusion 
Molecular Biology and Evolution  2012;30(2):457-468.
Reconstruction of the past is an important task of evolutionary biology. It takes place at different points in a hierarchy of molecular variation, including genes, individuals, populations, and species. Statistical inference about population histories has recently received considerable attention, following the development of computational tools to provide tractable approaches to this very challenging problem. Here, we introduce a likelihood-based approach which generalizes a recently developed model for random fluctuations in allele frequencies based on an approximation to the neutral Wright–Fisher diffusion. Our new framework approximates the infinite alleles Wright–Fisher model and uses an implementation with an adaptive Markov chain Monte Carlo algorithm. The method is especially well suited to data sets harboring large population samples and relatively few loci for which other likelihood-based models are currently computationally intractable. Using our model, we reconstruct the global population history of a major human pathogen, Streptococcus pneumoniae. The results illustrate the potential to reach important biological insights to an evolutionary process by a population genetics approach, which can appropriately accommodate very large population samples.
PMCID: PMC3548313  PMID: 22993237
population history; genetic drift; infinite alleles Wright–Fisher model
2.  Emergence of Epidemic Multidrug-Resistant Enterococcus faecium from Animal and Commensal Strains 
mBio  2013;4(4):e00534-13.
Enterococcus faecium, natively a gut commensal organism, emerged as a leading cause of multidrug-resistant hospital-acquired infection in the 1980s. As the living record of its adaptation to changes in habitat, we sequenced the genomes of 51 strains, isolated from various ecological environments, to understand how E. faecium emerged as a leading hospital pathogen. Because of the scale and diversity of the sampled strains, we were able to resolve the lineage responsible for epidemic, multidrug-resistant human infection from other strains and to measure the evolutionary distances between groups. We found that the epidemic hospital-adapted lineage is rapidly evolving and emerged approximately 75 years ago, concomitant with the introduction of antibiotics, from a population that included the majority of animal strains, and not from human commensal lines. We further found that the lineage that included most strains of animal origin diverged from the main human commensal line approximately 3,000 years ago, a time that corresponds to increasing urbanization of humans, development of hygienic practices, and domestication of animals, which we speculate contributed to their ecological separation. Each bifurcation was accompanied by the acquisition of new metabolic capabilities and colonization traits on mobile elements and the loss of function and genome remodeling associated with mobile element insertion and movement. As a result, diversity within the species, in terms of sequence divergence as well as gene content, spans a range usually associated with speciation.
Enterococci, in particular vancomycin-resistant Enterococcus faecium, recently emerged as a leading cause of hospital-acquired infection worldwide. In this study, we examined genome sequence data to understand the bacterial adaptations that accompanied this transformation from microbes that existed for eons as members of host microbiota. We observed changes in the genomes that paralleled changes in human behavior. An initial bifurcation within the species appears to have occurred at a time that corresponds to the urbanization of humans and domestication of animals, and a more recent bifurcation parallels the introduction of antibiotics in medicine and agriculture. In response to the opportunity to fill niches associated with changes in human activity, a rapidly evolving lineage emerged, a lineage responsible for the vast majority of multidrug-resistant E. faecium infections.
PMCID: PMC3747589  PMID: 23963180
3.  Recent Recombination Events in the Core Genome Are Associated with Adaptive Evolution in Enterococcus faecium 
Genome Biology and Evolution  2013;5(8):1524-1535.
Reasons for the rising clinical impact of the bacterium Enterococcus faecium include the species’ rapid acquisition of adaptive genetic elements. Here, we focused on the impact of recombination on the evolution of E. faecium. We used the recently developed BratNextGen algorithm to detect recombinant regions in the core genome of 34 E. faecium strains, including three newly sequenced clinical strains. Recombination was found to have a significant impact on the E. faecium genome: of the original 1.2 million positions in the core genome, 0.5 million were predicted to have been affected by recombination in at least one strain. Importantly, strains in one of the two major E. faecium clades (clade B), which contains most of the E. faecium human gut commensals, formed the most important reservoir for donating foreign DNA to the second major E. faecium clade (clade A), which contains most of the clinical isolates. Also, several genomic regions were found to mainly recombine in specific hospital-associated E. faecium strains. One of these regions (the epa-like locus) likely encodes the biosynthesis of cell wall polysaccharides. These findings suggest a crucial role for recombination in the emergence of E. faecium as a successful hospital-associated pathogen.
PMCID: PMC3762198  PMID: 23882129
BratNextGen; comparative genomics; phylogenomics; whole-genome sequencing; nosocomial pathogen; antibiotic resistance
4.  Historical Zoonoses and Other Changes in Host Tropism of Staphylococcus aureus, Identified by Phylogenetic Analysis of a Population Dataset 
PLoS ONE  2013;8(5):e62369.
Staphylococcus aureus exhibits tropisms to many distinct animal hosts. While spillover events can occur wherever there is an interface between host species, changes in host tropism only occur with the establishment of sustained transmission in the new host species, leading to clonal expansion. Although the genomic variation underpinning adaptation in S. aureus genotypes infecting bovids and poultry has been well characterized the frequency of switches from one host to another remains obscure. We sought to identify sustained switches in host tropism in the S. aureus population, both anthroponotic and zoonotic, and their distribution over the species phylogeny.
We have used a sample of 3042 isolates, representing 696 distinct MLST genotypes, from a well-established database ( Using an empirical parsimony approach (AdaptML) we have investigated the distribution of switches in host association between both human and non-human (henceforth referred to as animal) hosts. We reconstructed a credible description of past events in the form of a phylogenetic tree; the nodes and leaves of which are statistically associated with either human or animal habitats, estimated from extant host-association and the degree of sequence divergence between genotypes. We identified 15 likely historical switching events; 13 anthroponoses and two zoonoses. Importantly, we identified two human-associated clade candidates (CC25 and CC59) that have arisen from animal-associated ancestors; this demonstrates that a human-specific lineage can emerge from an animal host. We also highlight novel rabbit-associated genotypes arising from a human ancestor.
S. aureus is an organism with the capacity to switch into and adapt to novel hosts, even after long periods of isolation in a single host species. Based on this evidence, animal-adapted S. aureus lineages exhibiting resistance to antibiotics must be considered a major threat to public health, as they can adapt to the human population.
PMCID: PMC3647051  PMID: 23667472
5.  Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software 
Molecular Biology and Evolution  2013;30(5):1224-1228.
Phylogeographical analyses have become commonplace for a myriad of organisms with the advent of cheap DNA sequencing technologies. Bayesian model-based clustering is a powerful tool for detecting important patterns in such data and can be used to decipher even quite subtle signals of systematic differences in molecular variation. Here, we introduce two upgrades to the Bayesian Analysis of Population Structure (BAPS) software, which enable 1) spatially explicit modeling of variation in DNA sequences and 2) hierarchical clustering of DNA sequence data to reveal nested genetic population structures. We provide a direct interface to map the results from spatial clustering with Google Maps using the portal and illustrate this approach using sequence data from Borrelia burgdorferi. The usefulness of hierarchical clustering is demonstrated through an analysis of the metapopulation structure within a bacterial population experiencing a high level of local horizontal gene transfer. The tools that are introduced are freely available at
PMCID: PMC3670731  PMID: 23408797
genetic population structure; phylogeographics; Bayesian inference; evolutionary epidemiology
6.  Approximate Bayesian Computation 
PLoS Computational Biology  2013;9(1):e1002803.
Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics. In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models. For simple models, an analytical formula for the likelihood function can typically be derived. However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. ABC methods bypass the evaluation of the likelihood function. In this way, ABC methods widen the realm of models for which statistical inference can be considered. ABC methods are mathematically well-founded, but they inevitably make assumptions and approximations whose impact needs to be carefully assessed. Furthermore, the wider application domain of ABC exacerbates the challenges of parameter estimation and model selection. ABC has rapidly gained popularity over the last years and in particular for the analysis of complex problems arising in biological sciences (e.g., in population genetics, ecology, epidemiology, and systems biology).
PMCID: PMC3547661  PMID: 23341757
7.  Phylogeographic variation in recombination rates within a global clone of methicillin-resistant Staphylococcus aureus 
Genome Biology  2012;13(12):R126.
Next-generation sequencing (NGS) is a powerful tool for understanding both patterns of descent over time and space (phylogeography) and the molecular processes underpinning genome divergence in pathogenic bacteria. Here, we describe a synthesis between these perspectives by employing a recently developed Bayesian approach, BRATNextGen, for detecting recombination on an expanded NGS dataset of the globally disseminated methicillin-resistant Staphylococcus aureus (MRSA) clone ST239.
The data confirm strong geographical clustering at continental, national and city scales and demonstrate that the rate of recombination varies significantly between phylogeographic sub-groups representing independent introductions from Europe. These differences are most striking when mobile non-core genes are included, but remain apparent even when only considering the stable core genome. The monophyletic ST239 sub-group corresponding to isolates from South America shows heightened recombination, the sub-group predominantly from Asia shows an intermediate level, and a very low level of recombination is noted in a third sub-group representing a large collection from Turkey.
We show that the rapid global dissemination of a single pathogenic bacterial clone results in local variation in measured recombination rates. Possible explanatory variables include the size and time since emergence of each defined sub-population (as determined by the sampling frame), variation in transmission dynamics due to host movement, and changes in the bacterial genome affecting the propensity for recombination.
PMCID: PMC3803117  PMID: 23270620
8.  Probabilistic Prediction of Contacts in Protein-Ligand Complexes 
PLoS ONE  2012;7(11):e49216.
We introduce a statistical method for evaluating atomic level 3D interaction patterns of protein-ligand contacts. Such patterns can be used for fast separation of likely ligand and ligand binding site combinations out of all those that are geometrically possible. The practical purpose of this probabilistic method is for molecular docking and scoring, as an essential part of a scoring function. Probabilities of interaction patterns are calculated conditional on structural x-ray data and predefined chemical classification of molecular fragment types. Spatial coordinates of atoms are modeled using a Bayesian statistical framework with parametric 3D probability densities. The parameters are given distributions a priori, which provides the possibility to update the densities of model parameters with new structural data and use the parameter estimates to create a contact hierarchy. The contact preferences can be defined for any spatial area around a specified type of fragment. We compared calculated contact point hierarchies with the number of contact atoms found near the contact point in a reference set of x-ray data, and found that these were in general in a close agreement. Additionally, using substrate binding site in cathechol-O-methyltransferase and 27 small potential binder molecules, it was demonstrated that these probabilities together with auxiliary parameters separate well ligands from decoys (true positive rate 0.75, false positive rate 0). A particularly useful feature of the proposed Bayesian framework is that it also characterizes predictive uncertainty in terms of probabilities, which have an intuitive interpretation from the applied perspective.
PMCID: PMC3498326  PMID: 23155467
9.  Clinical isolates of Yersinia enterocolitica Biotype 1A represent two phylogenetic lineages with differing pathogenicity-related properties 
BMC Microbiology  2012;12:208.
Y. enterocolitica biotype (BT) 1A strains are often isolated from human clinical samples but their contribution to disease has remained a controversial topic. Variation and the population structure among the clinical Y. enterocolitica BT 1A isolates have been poorly characterized. We used multi-locus sequence typing (MLST), 16S rRNA gene sequencing, PCR for ystA and ystB, lipopolysaccharide analysis, phage typing, human serum complement killing assay and analysis of the symptoms of the patients to characterize 298 clinical Y. enterocolitica BT 1A isolates in order to evaluate their relatedness and pathogenic potential.
A subset of 71 BT 1A strains, selected based on their varying LPS patterns, were subjected to detailed genetic analyses. The MLST on seven house-keeping genes (adk, argA, aroA, glnA, gyrB, thrA, trpE) conducted on 43 of the strains discriminated them into 39 MLST-types. By Bayesian analysis of the population structure (BAPS) the strains clustered conclusively into two distinct lineages, i.e. Genetic groups 1 and 2. The strains of Genetic group 1 were more closely related (97% similarity) to the pathogenic bio/serotype 4/O:3 strains than Genetic group 2 strains (95% similarity). Further comparison of the 16S rRNA genes of the BT 1A strains indicated that altogether 17 of the 71 strains belong to Genetic group 2. On the 16S rRNA analysis, these 17 strains were only 98% similar to the previously identified subspecies of Y. enterocolitica. The strains of Genetic group 2 were uniform in their pathogenecity-related properties: they lacked the ystB gene, belonged to the same LPS subtype or were of rough type, were all resistant to the five tested yersiniophages, were largely resistant to serum complement and did not ferment fucose. The 54 strains in Genetic group 1 showed much more variation in these properties. The most commonly detected LPS types were similar to the LPS types of reference strains with serotypes O:6,30 and O:6,31 (37%), O:7,8 (19%) and O:5 (15%).
The results of the present study strengthen the assertion that strains classified as Y. enterocolitica BT 1A represent more than one subspecies. Especially the BT 1A strains in our Genetic group 2 commonly showed resistance to human serum complement killing, which may indicate pathogenic potential for these strains. However, their virulence mechanisms remain unknown.
PMCID: PMC3512526  PMID: 22985268
Yersinia enterocolitica biotype 1A; MLST; 16S rRNA gene; yst genes; LPS; Phage typing; Human serum complement killing; Bayesian analysis of population structure; Pathogenicity
10.  Restricted Gene Flow among Hospital Subpopulations of Enterococcus faecium 
mBio  2012;3(4):e00151-12.
Enterococcus faecium has recently emerged as an important multiresistant nosocomial pathogen. Defining population structure in this species is required to provide insight into the existence, distribution, and dynamics of specific multiresistant or pathogenic lineages in particular environments, like the hospital. Here, we probe the population structure of E. faecium using Bayesian-based population genetic modeling implemented in Bayesian Analysis of Population Structure (BAPS) software. The analysis involved 1,720 isolates belonging to 519 sequence types (STs) (491 for E. faecium and 28 for Enterococcus faecalis). E. faecium isolates grouped into 13 BAPS (sub)groups, but the large majority (80%) of nosocomial isolates clustered in two subgroups (2-1 and 3-3). Phylogenetic and eBURST analysis of BAPS groups 2 and 3 confirmed the existence of three separate hospital lineages (17, 18, and 78), highlighting different evolutionary trajectories for BAPS 2-1 (lineage 78) and 3-3 (lineage 17 and lineage 18) isolates. Phylogenomic analysis of 29 E. faecium isolates showed agreement between BAPS assignment of STs and their relative positions in the phylogenetic tree. Odds ratio calculation confirmed the significant association between hospital isolates with BAPS 3-3 and lineages 17, 18, and 78. Admixture analysis showed a scarce number of recombination events between the different BAPS groups. For the E. faecium hospital population, we propose an evolutionary model in which strains with a high propensity to colonize and infect hospitalized patients arise through horizontal gene transfer. Once adapted to the distinct hospital niche, this subpopulation becomes isolated, and recombination with other populations declines.
Multiresistant Enterococcus faecium has become one of the most important nosocomial pathogens, causing increasing numbers of nosocomial infections worldwide. Here, we used Bayesian population genetic analysis to identify groups of related E. faecium strains and show a significant association of hospital and farm animal isolates to different genetic groups. We also found that hospital isolates could be divided into three lineages originating from sequence types (STs) 17, 18, and 78. We propose that, driven by the selective pressure in hospitals, the three hospital lineages have arisen through horizontal gene transfer, but once adapted to the distinct pathogenic niche, this population has become isolated and recombination with other populations declines. Elucidation of the population structure is a prerequisite for effective control of multiresistant E. faecium since it provides insight into the processes that have led to the progressive change of E. faecium from an innocent commensal to a multiresistant hospital-adapted pathogen.
PMCID: PMC3413404  PMID: 22807567
11.  Bayesian estimation of bacterial community composition from 454 sequencing data 
Nucleic Acids Research  2012;40(12):5240-5249.
Estimating bacterial community composition from a mixed sample in different applied contexts is an important task for many microbiologists. The bacterial community composition is commonly estimated by clustering polymerase chain reaction amplified 16S rRNA gene sequences. Current taxonomy-independent clustering methods for analyzing these sequences, such as UCLUST, ESPRIT-Tree and CROP, have two limitations: (i) expert knowledge is needed, i.e. a difference cutoff between species needs to be specified; (ii) closely related species cannot be separated. The first limitation imposes a burden on the user, since considerable effort is needed to select appropriate parameters, whereas the second limitation leads to an inaccurate description of the underlying bacterial community composition. We propose a probabilistic model-based method to estimate bacterial community composition which tackles these limitations. Our method requires very little expert knowledge, where only the possible maximum number of clusters needs to be specified. Also our method demonstrates its ability to separate closely related species in two experiments, in spite of sequencing errors and individual variations.
PMCID: PMC3384343  PMID: 22406836
12.  Population structure in the Neisseria, and the biological significance of fuzzy species 
Phenotypic and genetic variation in bacteria can take bewilderingly complex forms even within a single genus. One of the most intriguing examples of this is the genus Neisseria, which comprises both pathogens and commensals colonizing a variety of body sites and host species, and causing a range of disease. Complex relatedness among both named species and previously identified lineages of Neisseria makes it challenging to study their evolution. Using the largest publicly available collection of bacterial sequence data in combination with a population genetic analysis and experiment, we probe the contribution of inter-species recombination to neisserial population structure, and specifically whether it is more common in some strains than others. We identify hybrid groups of strains containing sequences typical of more than one species. These groups of strains, typical of a fuzzy species, appear to have experienced elevated rates of inter-species recombination estimated by population genetic analysis and further supported by transformation experiments. In particular, strains of the pathogen Neisseria meningitidis in the fuzzy species boundary appear to follow a different lifestyle, which may have considerable biological implications concerning distribution of novel resistance elements and meningococcal vaccine development. Despite the strong evidence for negligible geographical barriers to gene flow within the population, exchange of genetic material still shows directionality among named species in a non-uniform manner.
PMCID: PMC3350722  PMID: 22072450
fuzzy species; recombination; Neisseria
13.  Detection of recombination events in bacterial genomes from large population samples 
Nucleic Acids Research  2011;40(1):e6.
Analysis of important human pathogen populations is currently under transition toward whole-genome sequencing of growing numbers of samples collected on a global scale. Since recombination in bacteria is often an important factor shaping their evolution by enabling resistance elements and virulence traits to rapidly transfer from one evolutionary lineage to another, it is highly beneficial to have access to tools that can detect recombination events. Multiple advanced statistical methods exist for such purposes; however, they are typically limited either to only a few samples or to data from relatively short regions of a total genome. By harnessing the power of recent advances in Bayesian modeling techniques, we introduce here a method for detecting homologous recombination events from whole-genome sequence data for bacterial population samples on a large scale. Our statistical approach can efficiently handle hundreds of whole genome sequenced population samples and identify separate origins of the recombinant sequence, offering an enhanced insight into the diversification of bacterial clones at the level of the whole genome. A data set of 241 whole genome sequences from an important pandemic lineage of Streptococcus pneumoniae is used together with multiple simulated data sets to demonstrate the potential of our approach.
PMCID: PMC3245952  PMID: 22064866
14.  Bayesian semi-supervised classification of bacterial samples using MLST databases 
BMC Bioinformatics  2011;12:302.
Worldwide effort on sampling and characterization of molecular variation within a large number of human and animal pathogens has lead to the emergence of multi-locus sequence typing (MLST) databases as an important tool for studying the epidemiology and evolution of pathogens. Many of these databases are currently harboring several thousands of multi-locus DNA sequence types (STs) enriched with metadata over traits such as serotype, antibiotic resistance, host organism etc of the isolates. Curators of the databases have thus the possibility of dividing the pathogen populations into subsets representing different evolutionary lineages, geographically associated groups, or other subpopulations, which are defined in terms of molecular similarities and dissimilarities residing within a database. When combined with the existing metadata, such subsets may provide invaluable information for assessing the position of a new set of isolates in relation to the whole pathogen population.
To enable users of MLST schemes to query the databases with sets of new bacterial isolates and to automatically analyze their relation to existing curated sequences, we introduce here a Bayesian model-based method for semi-supervised classification of MLST data. Our method can use an MLST database as a training set and assign simultaneously any set of query sequences into the earlier discovered lineages/populations, while also allowing some or all of these sequences to form previously undiscovered genetically distinct groups. This tool provides probabilistic quantification of the classification uncertainty and is highly efficient computationally, thus enabling rapid analyses of large databases and sets of query sequences. The latter feature is a necessary prerequisite for an automated access through the MLST web interface. We demonstrate the versatility of our approach by anayzing both real and synthesized data from MLST databases. The introduced method for semi-supervised classification of sets of query STs is freely available for Windows, Mac OS X and Linux operative systems in BAPS 5.4 software which is downloadable at The query functionality is also directly available for the Staphylococcus aureus database at and shortly will be available for other species databases hosted at this web portal.
We have introduced a model-based tool for automated semi-supervised classification of new pathogen samples that can be integrated into the web interface of the MLST databases. In particular, when combined with the existing metadata, the semi-supervised labeling may provide invaluable information for assessing the position of a new set of query strains in relation to the particular pathogen population represented by the curated database.
Such information will be useful both for clinical and basic research purposes.
PMCID: PMC3155509  PMID: 21791094
15.  Efficient Bayesian approach for multilocus association mapping including gene-gene interactions 
BMC Bioinformatics  2010;11:443.
Since the introduction of large-scale genotyping methods that can be utilized in genome-wide association (GWA) studies for deciphering complex diseases, statistical genetics has been posed with a tremendous challenge of how to most appropriately analyze such data. A plethora of advanced model-based methods for genetic mapping of traits has been available for more than 10 years in animal and plant breeding. However, most such methods are computationally intractable in the context of genome-wide studies. Therefore, it is hardly surprising that GWA analyses have in practice been dominated by simple statistical tests concerned with a single marker locus at a time, while the more advanced approaches have appeared only relatively recently in the biomedical and statistical literature.
We introduce a novel Bayesian modeling framework for association mapping which enables the detection of multiple loci and their interactions that influence a dichotomous phenotype of interest. The method is shown to perform well in a simulation study when compared to widely used standard alternatives and its computational complexity is typically considerably smaller than that of a maximum likelihood based approach. We also discuss in detail the sensitivity of the Bayesian inferences with respect to the choice of prior distributions in the GWA context.
Our results show that the Bayesian model averaging approach which explicitly considers gene-gene interactions may improve the detection of disease associated genetic markers in two respects: first, by providing better estimates of the locations of the causal loci; second, by reducing the number of false positives. The benefits are most apparent when the interacting genes exhibit no main effects. However, our findings also illustrate that such an approach is somewhat sensitive to the prior distribution assigned on the model structure.
PMCID: PMC2942856  PMID: 20809988
16.  Multilocus sequence types of Finnish bovine Campylobacter jejuni isolates and their attribution to human infections 
BMC Microbiology  2010;10:200.
Campylobacter jejuni is the most common bacterial cause of human gastroenteritis worldwide. Due to the sporadic nature of infection, sources often remain unknown. Multilocus sequence typing (MLST) has been successfully applied to population genetics of Campylobacter jejuni and mathematical modelling can be applied to the sequence data. Here, we analysed the population structure of a total of 250 Finnish C. jejuni isolates from bovines, poultry meat and humans collected in 2003 using a combination of Bayesian clustering (BAPS software) and phylogenetic analysis.
In the first phase we analysed sequence types (STs) of 102 Finnish bovine C. jejuni isolates by MLST and found a high diversity totalling 50 STs of which nearly half were novel. In the second phase we included MLST data from domestic human isolates as well as poultry C. jejuni isolates from the same time period. Between the human and bovine isolates we found an overlap of 72.2%, while 69% of the human isolates were overlapping with the chicken isolates. In the BAPS analysis 44.3% of the human isolates were found in bovine-associated BAPS clusters and 45.4% of the human isolates were found in the poultry-associated BAPS cluster. BAPS reflected the phylogeny of our data very well.
These findings suggest that bovines and poultry were equally important as reservoirs for human C. jejuni infections in Finland in 2003. Our results differ from those obtained in other countries where poultry has been identified as the most important source for human infections. The low prevalence of C. jejuni in poultry flocks in Finland could explain the lower attribution of human infection to poultry. Of the human isolates 10.3% were found in clusters not associated with any host which warrants further investigation, with particular focus on waterborne transmission routes and companion animals.
PMCID: PMC2914712  PMID: 20659332
17.  Full Likelihood Analysis of Genetic Risk with Variable Age at Onset Disease—Combining Population-Based Registry Data and Demographic Information 
PLoS ONE  2009;4(8):e6836.
In genetic studies of rare complex diseases it is common to ascertain familial data from population based registries through all incident cases diagnosed during a pre-defined enrollment period. Such an ascertainment procedure is typically taken into account in the statistical analysis of the familial data by constructing either a retrospective or prospective likelihood expression, which conditions on the ascertainment event. Both of these approaches lead to a substantial loss of valuable data.
Methodology and Findings
Here we consider instead the possibilities provided by a Bayesian approach to risk analysis, which also incorporates the ascertainment procedure and reference information concerning the genetic composition of the target population to the considered statistical model. Furthermore, the proposed Bayesian hierarchical survival model does not require the considered genotype or haplotype effects be expressed as functions of corresponding allelic effects. Our modeling strategy is illustrated by a risk analysis of type 1 diabetes mellitus (T1D) in the Finnish population-based on the HLA-A, HLA-B and DRB1 human leucocyte antigen (HLA) information available for both ascertained sibships and a large number of unrelated individuals from the Finnish bone marrow donor registry. The heterozygous genotype DR3/DR4 at the DRB1 locus was associated with the lowest predictive probability of T1D free survival to the age of 15, the estimate being 0.936 (0.926; 0.945 95% credible interval) compared to the average population T1D free survival probability of 0.995.
The proposed statistical method can be modified to other population-based family data ascertained from a disease registry provided that the ascertainment process is well documented, and that external information concerning the sizes of birth cohorts and a suitable reference sample are available. We confirm the earlier findings from the same data concerning the HLA-DR3/4 related risks for T1D, and also provide here estimated predictive probabilities of disease free survival as a function of age.
PMCID: PMC2730012  PMID: 19718441
18.  Identifying Currents in the Gene Pool for Bacterial Populations Using an Integrative Approach 
PLoS Computational Biology  2009;5(8):e1000455.
The evolution of bacterial populations has recently become considerably better understood due to large-scale sequencing of population samples. It has become clear that DNA sequences from a multitude of genes, as well as a broad sample coverage of a target population, are needed to obtain a relatively unbiased view of its genetic structure and the patterns of ancestry connected to the strains. However, the traditional statistical methods for evolutionary inference, such as phylogenetic analysis, are associated with several difficulties under such an extensive sampling scenario, in particular when a considerable amount of recombination is anticipated to have taken place. To meet the needs of large-scale analyses of population structure for bacteria, we introduce here several statistical tools for the detection and representation of recombination between populations. Also, we introduce a model-based description of the shape of a population in sequence space, in terms of its molecular variability and affinity towards other populations. Extensive real data from the genus Neisseria are utilized to demonstrate the potential of an approach where these population genetic tools are combined with an phylogenetic analysis. The statistical tools introduced here are freely available in BAPS 5.2 software, which can be downloaded from
Author Summary
The study of bacterial population biology is complicated by the fact that, although bacteria are largely asexual, they can also exchange genetic materials through homologous recombination. Unlike eukaryotes, recombination in bacteria is not an obligatory process. Furthermore, the recombination mechanisms are subject to many biological and ecological factors that can vary even within different populations of the same species. Although increasing evidence for homologous recombination has been found in many bacterial species, determining the frequency of recombination and understanding the influence that it exerts upon the evolution of bacterial populations remains a challenging work. In this article, we provide a dynamic picture of recombination within and between closely related bacteria species. Through an integration of several Bayesian statistical models, our method highlights the importance of a quantitative estimation of recombination. Our analyses of a challenging multi-locus sequence typing (MLST) database demonstrate that combined analyses using both traditional phylogenetic methods, explorative MLST tools and Bayesian population genetic models can together yield interesting biological insights that cannot easily be reached by any of the approaches alone.
PMCID: PMC2713424  PMID: 19662158
19.  Sequence analysis of percent G+C fraction libraries of human faecal bacterial DNA reveals a high number of Actinobacteria 
BMC Microbiology  2009;9:68.
The human gastrointestinal (GI) tract microbiota is characterised by an abundance of uncultured bacteria most often assigned in phyla Firmicutes and Bacteroidetes. Diversity of this microbiota, even though approached with culture independent techniques in several studies, still requires more elucidation. The main purpose of this work was to study whether the genomic percent guanine and cytosine (%G+C) -based profiling and fractioning prior to 16S rRNA gene sequence analysis reveal higher microbiota diversity, especially with high G+C bacteria suggested to be underrepresented in previous studies.
A phylogenetic analysis of the composition of the human GI microbiota of 23 healthy adult subjects was performed from a pooled faecal bacterial DNA sample by combining genomic %G+C -based profiling and fractioning with 16S rRNA gene cloning and sequencing. A total of 3199 partial 16S rRNA genes were sequenced. For comparison, 459 clones were sequenced from a comparable unfractioned sample. The most important finding was that the proportional amount of sequences affiliating with the phylum Actinobacteria was 26.6% in the %G+C fractioned sample but only 3.5% in the unfractioned sample. The orders Coriobacteriales, Bifidobacteriales and Actinomycetales constituted the 65 actinobacterial phylotypes in the fractioned sample, accounting for 50%, 47% and 3% of sequences within the phylum, respectively.
This study shows that the %G+C profiling and fractioning prior to cloning and sequencing can reveal a significantly larger proportion of high G+C content bacteria within the clones recovered, compared with the unfractioned sample in the human GI tract. Especially the order Coriobacteriales within the phylum Actinobacteria was found to be more abundant than previously estimated with conventional sequencing studies.
PMCID: PMC2679024  PMID: 19351420
20.  Bayesian clustering and feature selection for cancer tissue samples 
BMC Bioinformatics  2009;10:90.
The versatility of DNA copy number amplifications for profiling and categorization of various tissue samples has been widely acknowledged in the biomedical literature. For instance, this type of measurement techniques provides possibilities for exploring sets of cancerous tissues to identify novel subtypes. The previously utilized statistical approaches to various kinds of analyses include traditional algorithmic techniques for clustering and dimension reduction, such as independent and principal component analyses, hierarchical clustering, as well as model-based clustering using maximum likelihood estimation for latent class models.
While purely algorithmic methods are usually easily applicable, their suboptimal performance and limitations in making formal inference have been thoroughly discussed in the statistical literature. Here we introduce a Bayesian model-based approach to simultaneous identification of underlying tissue groups and the informative amplifications. The model-based approach provides the possibility of using formal inference to determine the number of groups from the data, in contrast to the ad hoc methods often exploited for similar purposes. The model also automatically recognizes the chromosomal areas that are relevant for the clustering.
Validatory analyses of simulated data and a large database of DNA copy number amplifications in human neoplasms are used to illustrate the potential of our approach. Our software implementation BASTA for performing Bayesian statistical tissue profiling is freely available for academic purposes at
PMCID: PMC2679022  PMID: 19296858
21.  Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations 
BMC Bioinformatics  2008;9:539.
During the most recent decade many Bayesian statistical models and software for answering questions related to the genetic structure underlying population samples have appeared in the scientific literature. Most of these methods utilize molecular markers for the inferences, while some are also capable of handling DNA sequence data. In a number of earlier works, we have introduced an array of statistical methods for population genetic inference that are implemented in the software BAPS. However, the complexity of biological problems related to genetic structure analysis keeps increasing such that in many cases the current methods may provide either inappropriate or insufficient solutions.
We discuss the necessity of enhancing the statistical approaches to face the challenges posed by the ever-increasing amounts of molecular data generated by scientists over a wide range of research areas and introduce an array of new statistical tools implemented in the most recent version of BAPS. With these methods it is possible, e.g., to fit genetic mixture models using user-specified numbers of clusters and to estimate levels of admixture under a genetic linkage model. Also, alleles representing a different ancestry compared to the average observed genomic positions can be tracked for the sampled individuals, and a priori specified hypotheses about genetic population structure can be directly compared using Bayes' theorem. In general, we have improved further the computational characteristics of the algorithms behind the methods implemented in BAPS facilitating the analyses of large and complex datasets. In particular, analysis of a single dataset can now be spread over multiple computers using a script interface to the software.
The Bayesian modelling methods introduced in this article represent an array of enhanced tools for learning the genetic structure of populations. Their implementations in the BAPS software are designed to meet the increasing need for analyzing large-scale population genetics data. The software is freely downloadable for Windows, Linux and Mac OS X systems at .
PMCID: PMC2629778  PMID: 19087322
22.  Bayesian modeling of recombination events in bacterial populations 
BMC Bioinformatics  2008;9:421.
We consider the discovery of recombinant segments jointly with their origins within multilocus DNA sequences from bacteria representing heterogeneous populations of fairly closely related species. The currently available methods for recombination detection capable of probabilistic characterization of uncertainty have a limited applicability in practice as the number of strains in a data set increases.
We introduce a Bayesian spatial structural model representing the continuum of origins over sites within the observed sequences, including a probabilistic characterization of uncertainty related to the origin of any particular site. To enable a statistically accurate and practically feasible approach to the analysis of large-scale data sets representing a single genus, we have developed a novel software tool (BRAT, Bayesian Recombination Tracker) implementing the model and the corresponding learning algorithm, which is capable of identifying the posterior optimal structure and to estimate the marginal posterior probabilities of putative origins over the sites.
A multitude of challenging simulation scenarios and an analysis of real data from seven housekeeping genes of 120 strains of genus Burkholderia are used to illustrate the possibilities offered by our approach. The software is freely available for download at URL .
PMCID: PMC2579306  PMID: 18840286
23.  Genealogical lineage sorting leads to significant, but incorrect Bayesian multilocus inference of population structure 
Molecular Ecology  2011;20(6):1108-1121.
Over the past decades, the use of molecular markers has revolutionized biology and led to the foundation of a new research discipline—phylogeography. Of particular interest has been the inference of population structure and biogeography. While initial studies focused on mtDNA as a molecular marker, it has become apparent that selection and genealogical lineage sorting could lead to erroneous inferences. As it is not clear to what extent these forces affect a given marker, it has become common practice to use the combined evidence from a set of molecular markers as an attempt to recover the signals that approximate the true underlying demography. Typically, the number of markers used is determined by either budget constraints or by statistical power required to recognize significant population differentiation. Using microsatellite markers from Drosophila and humans, we show that even large numbers of loci (>50) can frequently result in statistically well-supported, but incorrect inference of population structure using the software baps. Most importantly, genomic features, such as chromosomal location, variability of the markers, or recombination rate, cannot explain this observation. Instead, it can be attributed to sampling variation among loci with different realizations of the stochastic lineage sorting. This phenomenon is particularly pronounced for low levels of population differentiation. Our results have important implications for ongoing studies of population differentiation, as we unambiguously demonstrate that statistical significance of population structure inferred from a random set of genetic markers cannot necessarily be taken as evidence for a reliable demographic inference.
PMCID: PMC3084510  PMID: 21244537
confidence of inference; Drosophila melanogaster; microsatellites; population structure

Results 1-23 (23)