Photon trajectories from single molecule experiments can report on biomolecule structural changes and motions. Hidden Markov models (HMM) facilitate extraction of the sequence of hidden states from noisy data through construction of probabilistic models. Typically, the true number of states is determined by the Bayesian information criteria (BIC), however, constraints resulting from short data sets and Poisson-distributed photons in radiative processes like fluorescence can limit successful application of goodness-of-fit statistics. For single molecule intensity trajectories, additional information criteria such as peak localization error (LE) and chi-square probabilities can incorporate theoretical constraints on experimental data while modifying normal HMM. Chi-square minimization also serves as a stopping point of the iteration in which the system parameters are trained. Peak LE enables exclusion of overfitted and overlapped states. These constraints and criteria are tested against BIC on simulated single molecule trajectories to best identify the true number of emissive levels in any sequence.
Our understanding of the dynamics of neuronal activity in the human brain remains limited, due in part to a lack of adequate methods for reconstructing neuronal activity from noninvasive electrophysiological data. Here, we present a novel adaptive time-varying approach to source reconstruction that can be applied to magnetoencephalography (MEG) and electroencephalography (EEG) data. The method is underpinned by a Hidden Markov Model (HMM), which infers the points in time when particular states re-occur in the sensor space data. HMM inference finds short-lived states on the scale of 100 ms. Intriguingly, this is on the same timescale as EEG microstates. The resulting state time courses can be used to intelligently pool data over these distinct and short-lived periods in time. This is used to compute time-varying data covariance matrices for use in beamforming, resulting in a source reconstruction approach that can tune its spatial filtering properties to those required at different points in time. Proof of principle is demonstrated with simulated data, and we demonstrate improvements when the method is applied to MEG.
•Novel adaptive time-varying approach to MEG source reconstruction•HMM infers reoccurring short-lived states on the same scale as EEG microstates•Tunes source reconstruction properties to those needed at different points in time•Improves spatial localisation of high temporal resolution information from MEG data
Magnetoencephalography; MEG; EEG; Source reconstruction; Microstates; Connectivity; Hidden Markov Model
Imaging plays an important role in characterizing tumors. Knowledge inferred from imaging data has the potential to improve disease management dramatically, but physicians lack a tool to easily interpret and manipulate the data. A probabilistic disease model, such as a Bayesian belief network, may be used to quantitatively model relationships found in the data. In this paper, a framework is presented that enables visual querying of an underlying disease model via a query-by-example paradigm. Users draw graphical metaphors to visually represent features in their query. The structure and parameters specified within the model guide the user through query formulation by determining when a user may draw a particular metaphor. Spatial and geometrical features are automatically extracted from the query diagram and used to instantiate the probabilistic model in order to answer the query. An implementation is described in the context of managing patients with brain tumors.
Hidden Markov Models (HMMs) have proven very useful in computational biology for such applications as sequence pattern matching, gene-finding, and structure prediction. Thus far, however, they have been confined to representing 1D sequence (or the aspects of structure that could be represented by character strings).
We develop an HMM formalism that explicitly uses 3D coordinates in its match states. The match states are modeled by 3D Gaussian distributions centered on the mean coordinate position of each alpha carbon in a large structural alignment. The transition probabilities depend on the spread of the neighboring match states and on the number of gaps found in the structural alignment. We also develop methods for aligning query structures against 3D HMMs and scoring the result probabilistically. For 1D HMMs these tasks are accomplished by the Viterbi and forward algorithms. However, these will not work in unmodified form for the 3D problem, due to non-local quality of structural alignment, so we develop extensions of these algorithms for the 3D case. Several applications of 3D HMMs for protein structure classification are reported. A good separation of scores for different fold families suggests that the described construct is quite useful for protein structure analysis.
We have created a rigorous 3D HMM representation for protein structures and implemented a complete set of routines for building 3D HMMs in C and Perl. The code is freely available from , and at this site we also have a simple prototype server to demonstrate the features of the described approach.
Hidden Markov Models (HMMs) have been extensively used in computational molecular biology, for modelling protein and nucleic acid sequences. In many applications, such as transmembrane protein topology prediction, the incorporation of limited amount of information regarding the topology, arising from biochemical experiments, has been proved a very useful strategy that increased remarkably the performance of even the top-scoring methods. However, no clear and formal explanation of the algorithms that retains the probabilistic interpretation of the models has been presented so far in the literature.
We present here, a simple method that allows incorporation of prior topological information concerning the sequences at hand, while at the same time the HMMs retain their full probabilistic interpretation in terms of conditional probabilities. We present modifications to the standard Forward and Backward algorithms of HMMs and we also show explicitly, how reliable predictions may arise by these modifications, using all the algorithms currently available for decoding HMMs. A similar procedure may be used in the training procedure, aiming at optimizing the labels of the HMM's classes, especially in cases such as transmembrane proteins where the labels of the membrane-spanning segments are inherently misplaced. We present an application of this approach developing a method to predict the transmembrane regions of alpha-helical membrane proteins, trained on crystallographically solved data. We show that this method compares well against already established algorithms presented in the literature, and it is extremely useful in practical applications.
The algorithms presented here, are easily implemented in any kind of a Hidden Markov Model, whereas the prediction method (HMM-TM) is freely available for academic users at , offering the most advanced decoding options currently available.
Motivation: Classification of tissues using static gene-expression data has received considerable attention. Recently, a growing number of expression datasets are measured as a time series. Methods that are specifically designed for this temporal data can both utilize its unique features (temporal evolution of profiles) and address its unique challenges (different response rates of patients in the same class).
Results: We present a method that utilizes hidden Markov models (HMMs) for the classification task. We use HMMs with less states than time points leading to an alignment of the different patient response rates. To focus on the differences between the two classes we develop a discriminative HMM classifier. Unlike the traditional generative HMM, discriminative HMM can use examples from both classes when learning the model for a specific class. We have tested our method on both simulated and real time series expression data. As we show, our method improves upon prior methods and can suggest markers for specific disease and response stages that are not found when using traditional classifiers.
Availability: Matlab implementation is available from http://www.cs.cmu.edu/~thlin/tram/
A query learning algorithm based on hidden Markov models (HMMs) isdeveloped to design experiments for string analysis and prediction of MHCclass I binding peptides. Query learning is introduced to aim at reducingthe number of peptide binding data for training of HMMs. A multiple numberof HMMs, which will collectively serve as a committee, are trained withbinding data and used for prediction in real-number values. The universeof peptides is randomly sampled and subjected to judgement by the HMMs.Peptides whose prediction is least consistent among committee HMMs aretested by experiment. By iterating the feedback cycle of computationalanalysis and experiment the most wanted information is effectivelyextracted. After 7 rounds of active learning with 181 peptides in all,predictive performance of the algorithm surpassed the so far bestperforming matrix based prediction. Moreover, by combining the bothmethods binder peptides (log Kd < -6) could be predicted with84% accuracy. Parameter distribution of the HMMs that can be inspectedvisually after training further offers a glimpse of dynamic specificity ofthe MHC molecules.
algorithm; binding; experimental design; MHC class I molecules; peptides; prediction; query learning; specificity; string analysis
Often protein (or gene) time-course data are collected for multiple replicates. Each replicate generally has sparse data with the number of time points being less than the number of proteins. Usually each replicate is modeled separately. However, here all the information in each of the replicates is used to make a composite inference about signal networks. The composite inference comes from combining well structured Bayesian probabilistic modeling with a multi-faceted Markov Chain Monte Carlo algorithm. Based on simulations which investigate many different types of network interactions and experimental variabilities, the composite examination uncovers many important relationships within the networks. In particular, when the edge's partial correlation between two proteins is at least moderate, then the composite's posterior probability is large.
The Celera Discovery System™ (CDS) is a web-accessible research workbench for mining genomic and related biological information. Users have access to the human and mouse genome sequences with annotation presented in summary form in BioMolecule Reports for genes, transcripts and proteins. Over 40 additional databases are available, including sequence, mapping, mutation, genetic variation, mRNA expression, protein structure, motif and classification data. Data are accessible by browsing reports, through a variety of interactive graphical viewers, and by advanced query capability provided by the LION SRS™ search engine. A growing number of sequence analysis tools are available, including sequence similarity, pattern searching, multiple sequence alignment and Hidden Markov Model search. A user workspace keeps track of queries and analyses. CDS is widely used by the academic research community and requires a subscription for access. The system and academic pricing information are available at http://cds.celera.com.
Summary: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast customizable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search and a library of ‘widgets’ performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages.
Availability: Freely available from http://www.intermine.org under the LGPL license.
Supplementary data are available at Bioinformatics online.
Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model's architecture. The implementation of prediction algorithms and algorithms to train the model's parameters, however, can be a time-consuming and error-prone task. We here present HMMConverter, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMs and pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMConverter can thus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.
The identification of effective connectivity from time-series data such as electroencephalogram (EEG) or time-resolved function magnetic resonance imaging (fMRI) recordings is an important problem in brain imaging. One commonly used approach to inference effective connectivity is based on vector autoregressive models and the concept of Granger causality. However, this probabilistic concept of causality can lead to spurious causalities in the presence of latent variables. Recently, graphical models have been used to discuss problems of causal inference for multivariate data. In this paper, we extend these concepts to the case of time-series and present a graphical approach for discussing Granger-causal relationships among multiple time-series. In particular, we propose a new graphical representation that allows the characterization of spurious causality and, thus, can be used to investigate spurious causality. The method is demonstrated with concurrent EEG and fMRI recordings which are used to investigate the interrelations between the alpha rhythm in the EEG and blood oxygenation level dependent (BOLD) responses in the fMRI. The results confirm previous findings on the location of the source of the EEG alpha rhythm.
graphical models; multivariate time-series; Granger causality; causal inference; spurious causality
Methods of determining whether or not any particular HIV-1 sequence stems - completely or in part - from some unknown HIV-1 subtype are important for the design of vaccines and molecular detection systems, as well as for epidemiological monitoring. Nevertheless, a single algorithm only, the Branching Index (BI), has been developed for this task so far. Moving along the genome of a query sequence in a sliding window, the BI computes a ratio quantifying how closely the query sequence clusters with a subtype clade. In its current version, however, the BI does not provide predicted boundaries of unknown fragments.
We have developed Unknown Subtype Finder (USF), an algorithm based on a probabilistic model, which automatically determines which parts of an input sequence originate from a subtype yet unknown. The underlying model is based on a simple profile hidden Markov model (pHMM) for each known subtype and an additional pHMM for an unknown subtype. The emission probabilities of the latter are estimated using the emission frequencies of the known subtypes by means of a (position-wise) probabilistic model for the emergence of new subtypes. We have applied USF to SIV and HIV-1 sequences formerly classified as having emerged from an unknown subtype. Moreover, we have evaluated its performance on artificial HIV-1 recombinants and non-recombinant HIV-1 sequences. The results have been compared with the corresponding results of the BI.
Our results demonstrate that USF is suitable for detecting segments in HIV-1 sequences stemming from yet unknown subtypes. Comparing USF with the BI shows that our algorithm performs as good as the BI or better.
Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task.
We present an accurate modeling of dependences between genetic markers, based on a forest of hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data dimensionality can be reduced through the latent variables of the model which synthesize the information borne by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on benchmarks describing 105 variables for 2000 individuals.
The forest of hierarchical latent class models offers several advantages for genome-wide association studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning borne by latent variables.
Networks are widely recognized as key determinants of structure and function in systems that span the biological, physical, and social sciences. They are static pictures of the interactions among the components of complex systems. Often, much effort is required to identify networks as part of particular patterns as well as to visualize and interpret them.
From a pure dynamical perspective, simulation represents a relevant way-out. Many simulator tools capitalized on the "noisy" behavior of some systems and used formal models to represent cellular activities as temporal trajectories. Statistical methods have been applied to a fairly large number of replicated trajectories in order to infer knowledge.
A tool which both graphically manipulates reactive models and deals with sets of simulation time-course data by aggregation, interpretation and statistical analysis is missing and could add value to simulators.
We designed and implemented Snazer, the simulations and networks analyzer. Its goal is to aid the processes of visualizing and manipulating reactive models, as well as to share and interpret time-course data produced by stochastic simulators or by any other means.
Snazer is a solid prototype that integrates biological network and simulation time-course data analysis techniques.
Motivation: Although widely accepted that high-throughput biological data are typically highly noisy, the effects that this uncertainty has upon the conclusions we draw from these data are often overlooked. However, in order to assign any degree of confidence to our conclusions, we must quantify these effects. Bootstrap resampling is one method by which this may be achieved. Here, we present a parametric bootstrapping approach for time-course data, in which Gaussian process regression (GPR) is used to fit a probabilistic model from which replicates may then be drawn. This approach implicitly allows the time dependence of the data to be taken into account, and is applicable to a wide range of problems.
Results: We apply GPR bootstrapping to two datasets from the literature. In the first example, we show how the approach may be used to investigate the effects of data uncertainty upon the estimation of parameters in an ordinary differential equations (ODE) model of a cell signalling pathway. Although we find that the parameter estimates inferred from the original dataset are relatively robust to data uncertainty, we also identify a distinct second set of estimates. In the second example, we use our method to show that the topology of networks constructed from time-course gene expression data appears to be sensitive to data uncertainty, although there may be individual edges in the network that are robust in light of present data.
Availability: Matlab code for performing GPR bootstrapping is available from our web site: http://www3.imperial.ac.uk/theoreticalsystemsbiology/data-software/
Contact: firstname.lastname@example.org, email@example.com
Supplementary information:Supplementary data are available at Bioinformatics online.
Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs).
HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems.
Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, acccuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed.
Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.
The Baum-Welch learning procedure for Hidden Markov Models (HMMs) provides a powerful tool for tailoring HMM topologies to data for use in knowledge discovery and clustering. A linear memory procedure recently proposed by Miklós, I. and Meyer, I.M. describes a memory sparse version of the Baum-Welch algorithm with modifications to the original probabilistic table topologies to make memory use independent of sequence length (and linearly dependent on state number). The original description of the technique has some errors that we amend. We then compare the corrected implementation on a variety of data sets with conventional and checkpointing implementations.
We provide a correct recurrence relation for the emission parameter estimate and extend it to parameter estimates of the Normal distribution. To accelerate estimation of the prior state probabilities, and decrease memory use, we reverse the originally proposed forward sweep. We describe different scaling strategies necessary in all real implementations of the algorithm to prevent underflow. In this paper we also describe our approach to a linear memory implementation of the Viterbi decoding algorithm (with linearity in the sequence length, while memory use is approximately independent of state number). We demonstrate the use of the linear memory implementation on an extended Duration Hidden Markov Model (DHMM) and on an HMM with a spike detection topology. Comparing the various implementations of the Baum-Welch procedure we find that the checkpointing algorithm produces the best overall tradeoff between memory use and speed. In cases where sequence length is very large (for Baum-Welch), or state number is very large (for Viterbi), the linear memory methods outlined may offer some utility.
Our performance-optimized Java implementations of Baum-Welch algorithm are available at . The described method and implementations will aid sequence alignment, gene structure prediction, HMM profile training, nanopore ionic flow blockades analysis and many other domains that require efficient HMM training with EM.
Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada et al.  proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.
We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM  and Wang et al. . It is also computationally much faster than Wang et al. .
Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.
Mining bacterial genomes for bacteriocins is a challenging task due to the substantial structure and sequence diversity, and generally small sizes, of these antimicrobial peptides. Major progress in the research of antimicrobial peptides and the ever-increasing quantities of genomic data, varying from (un)finished genomes to meta-genomic data, led us to develop the significantly improved genome mining software BAGEL2, as a follow-up of our previous BAGEL software. BAGEL2 identifies putative bacteriocins on the basis of conserved domains, physical properties and the presence of biosynthesis, transport and immunity genes in their genomic context. The software supports parameter-free, class-specific mining and has high-throughput capabilities. Besides building an expert validated bacteriocin database, we describe the development of novel Hidden Markov Models (HMMs) and the interpretation of combinations of HMMs via simple decision rules for prediction of bacteriocin (sub-)classes. Furthermore, the genetic context is automatically annotated based on (combinations of) PFAM domains and databases of known context genes. The scoring system was fine-tuned using expert knowledge on data derived from screening all bacterial genomes currently available at the NCBI. BAGEL2 is freely accessible at http://bagel2.molgenrug.nl.
This article discusses the compositional structure of hand movements by analyzing and modeling neural and behavioral data obtained from experiments where a monkey (Macaca fascicularis) performed scribbling movements induced by a search task. Using geometrically based approaches to movement segmentation, it is shown that the hand trajectories are composed of elementary segments that are primarily parabolic in shape. The segments could be categorized into a small number of classes on the basis of decreasing intra-class variance over the course of training. A separate classification of the neural data employing a hidden Markov model showed a coincidence of the neural states with the behavioral categories. An additional analysis of both types of data by a data mining method provided evidence that the neural activity patterns underlying the behavioral primitives were formed by sets of specific and precise spike patterns. A geometric description of the movement trajectories, together with precise neural timing data indicates a compositional variant of a realistic synfire chain model. This model reproduces the typical shapes and temporal properties of the trajectories; hence the structure and composition of the primitives may reflect meaningful behavior.
voluntary-movements; scribbling; compositionality; hand-motion-model; synfire chains; motion-primitives
The recent explosion of experimental techniques in single molecule biophysics has generated a variety of novel time series data requiring equally novel computational tools for analysis and inference. This article describes in general terms how graphical modeling may be used to learn from biophysical time series data using the variational Bayesian expectation maximization algorithm (VBEM). The discussion is illustrated by the example of single-molecule fluorescence resonance energy transfer (smFRET) versus time data, where the smFRET time series is modeled as a hidden Markov model (HMM) with Gaussian observables. A detailed description of smFRET is provided as well.
The VBEM algorithm returns the model’s evidence and an approximating posterior parameter distribution given the data. The former provides a metric for model selection via maximum evidence (ME), and the latter a description of the model’s parameters learned from the data. ME/VBEM provide several advantages over the more commonly used approach of maximum likelihood (ML) optimized by the expectation maximization (EM) algorithm, the most important being a natural form of model selection and a well-posed (non-divergent) optimization problem.
The results demonstrate the utility of graphical modeling for inference of dynamic processes in single molecule biophysics.
Detecting recombinations in the genome sequence of human immunodeficiency virus (HIV-1) is crucial for epidemiological studies and for vaccine development. Herein, we present a web server for subtyping and localization of phylogenetic breakpoints in HIV-1. Our software is based on a jumping profile Hidden Markov Model (jpHMM), a probabilistic generalization of the jumping-alignment approach proposed by Spang et al. The input data for our server is a partial or complete genome sequence from HIV-1; our tool assigns regions of the input sequence to known subtypes of HIV-1 and predicts phylogenetic breakpoints. jpHMM is available online at .
Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available.
We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.
Next-generation sequencing technology has enabled the generation of whole-genome data for many closely related species. For population genetic inference we have sequenced many loci, but only in a few individuals. We present a new method that allows inference of the divergence process based on two closely related genomes, modelled as gradual isolation in an isolation with migration model. This allows estimation of the initial time of restricted gene flow, the cessation of gene flow, as well as the population sizes, migration rates, and recombination rates. We show by simulations that the parameter estimation is accurate with genome-wide data and use the model to disentangle the divergence processes among three sets of closely related great ape species: bonobo/chimpanzee, eastern/western gorillas, and Sumatran/Bornean orang-utans. We find allopatric speciation for bonobo and chimpanzee and non-allopatric speciation for the gorillas and orang-utans. We also consider the split between humans and chimpanzees/bonobos and find evidence for non-allopatric speciation, similar to that within gorillas and orang-utans.