Single-molecule techniques allow biophysicists to probe the dynamics of proteins, nucleic acids, and other biological macromolecules with unprecedented resolution [
1-
3]. It is now possible to observe viruses pack DNA into capsids [
4], helicases unzip nucleic acids [
5], motor proteins walk on biopolymers [
6], and ribosome domains undergo structural rearrangements during translation [
7]. These data are acquired by recording the fluorescent output or forces generated from, for example, biomolecules tethered onto microscope slides [
8]; walking on biopolymers [
9]; diffusing in hydrodynamic flow cells [
10]; or pulled by optical [
11] or magnetic [
12] tweezers. Often the molecules studied move through a series of locally stable molecular conformations or positions (generically termed states) and give rise to data of the type shown in Fig. . From these data, the experimentalist wishes to learn a model describing the number of states occupied by the molecule and the transition rates between states. Although the myriad experimental techniques available have much in common, the data they generate often differ enough to require unique models.
For example, some of these models will involve conversion of chemical to mechanical energy, or motion associated with diffusion, or motion associated with transitions between distinct configurational states. Modeling the data, then, typically involves introducing several variables — some of which are observed, others of which are latent or “hidden”; some of which are real-valued coordinates, others of which are discrete states — and specifying algebraically how they are related. Such algebraic relations among a few variables are typical in physical modeling (e.g., the stochastic motion of a random walker, or the assumption of additive, independent, normally distributed errors typical in regression); models involving multiple conditionally-dependent observations or hidden variables with more structured noise behavior are less common. Implicitly, each equation of motion or of constraint specifies which variables are conditionally dependent and which are conditionally independent. Graphical modeling, which begins with charting these dependencies among a set of nodes, with edges corresponding to the conditional probabilities which must be algebraically specified (i.e., the typical elements of a physical model) organizes this process and facilitates basing inference on such models [
13-
15].
Here we explore the application of a specific subset of GMs to biophysical time series data using a specific algorithmic approach for inference: the directed GM and the variational Bayesian expectation maximization algorithm (VBEM). After discussing the theoretical basis and practical advantages of this general approach to modeling biophysical time series data, we apply the method to the problem of inference given single molecule fluorescence resonance energy transfer (smFRET) time series data. We emphasize the process and caveats of modeling smFRET data with a GM and demonstrate the most helpful features of VBEM for this type of time series inference.
Graphical models
GMs are a flexible inference framework based on factorizing a (high-dimensional) multivariate joint distribution into (lower-dimensional) conditionals and marginals [
13-
15]. In a GM, the nodes of the graph represent either observable variables (data, denoted by filled circles), latent variables (hidden states, denoted by open circles), or fixed parameters (denoted by dots). Directed edges between nodes represent conditional probabilities. For example, the three-node graphical model
X →
Y →
Z implies that the joint distribution
p(
Z,Y,
X)
p(
Z|Y,
X)
p(
Y|
X)
p(
X) can be further factorized as
p(
Z|
Y)
p(
Y|
X)
p(
X). Data with a temporal component are modeled by connecting arrows from variables at earlier time steps to variables at later time steps. In many graphical models, the number of observed and latent variables grows with the size of the data set under consideration. To avoid clutter, these variables are written once and placed in a box, often called a “plate”, labeled with the number of times the variables are repeated [
15]. This manuscript will denote hidden variables by
z and observed data by
d. Parameters which are vectors will be denoted as such by overhead arrows.
As an example of a simple GM, imagine trying to learn the number of boys and girls in an elementary school class of
N students from a table of their heights and weights. Here the hidden variable is gender and the observed variable, (height, weight), is a random variable conditionally dependent on the hidden variable. The resulting GM is shown in Fig. , with the parameters of
p(gender) denoted by

and the parameters
p(
height, weight|gender) denoted by

and

. The expression for the probability of the observed data ({
d1,…,
dN} =
D) and latent genders ({
z1,…,
zN} =
Z) is uniquely specified by the graph and the factorization it implies:
In such a simple case it is straighforward to arrive at the expression in Eq. 1 without the use of a GM, but such a chart makes this factorization far more obvious and interpretable.
Inference of GMs
In some contexts, one wishes to learn the probability of the hidden states given the observed data,

where

denotes the parameters of the model and
K denotes the number of allowed values of the latent variables (i.e. number of hidden states). If

is known then efficient inference of

can be performed on any loop-free graph with discrete latent states using the
sum-product algorithm [
16], or, if only the most probable values of
Z are needed, using the closely related
max-sum algorithm [
17]. A loop in a graph occurs when multiple pathways connect two variables, which is unlikely in a graph modeling time series data. Inference for models with continuous latent variables is discussed in [
18,
19]. For most time series inference problems in biophysics, both
Z and

are unknown. In these cases, a criterion for choosing a best estimate of

and an optimization algorithm to find this estimate are needed.
Inference via maximum likelihood
Estimating

is most commonly accomplished using the
maximum likelihood (ML) method, which estimates

as
The probability

is known as the
likelihood. The expectation maximization (EM) algorithm can be used to estimate

[
20]. In EM, an initial guess for

is used to calculate

. The

learned is then used to calculate a new guess for

. The algorithm iterates until convergence, and is guaranteed to converge to a local optimum. The EM algorithm should be run with multiple initializations of

, often called “random restarts”, to increase the probability of finding the globally optimal

.
ML solved via EM is a generally effective method to perform inference however, it has two prominent shortcomings [
14,
15]:
Model selection: The first limitation of ML is that it has no form of model selection: the likelihood monotonically increases with the addition of more model parameters. This problem of fitting too many states to the data (overfitting) is highly undesirable for biophysical time series data, where learning the correct K for the data is often an experimental objective.
Ill-posedness The second problem with ML occurs only in the case of a model with multiple hidden states and a continuous observable (a case which includes the majority of biophysical time series data, including the smFRET data we will consider here). If the mean of one hidden state approaches the position of a data point and the variance of that state approaches zero, the contribution of that datum to the likelihood will diverge. When this happens, the likelihood will be infinite regardless of how poorly the rest of the data are modeled, provided the other states in the model have non-zero probabilities for the rest of the data. For such models, the ML method is ill-posed; poor parameters can still result in infinite likelihood.
In practical contexts, the second problem (divergent likelihood) can be avoided either by performing MAP estimation (maximizing the likelihood times a prior which penalizes small variance) or by ignoring solutions for which likelihood is diverging. That is, one does not actually maximize the likelihood. Model selection can then be argued for based on cross-validation or by penalizing likelihood with a term which monotonically increases with model complexity [
15,
21,
22]. We consider, instead, an alternative optimization criterion which naturally avoids these problems.
Inference via maximum evidence
A Bayesian alternative to ML is to perform inference using the maximum evidence (ME) method. ME can be thought of as an extension of ML to the problem of model selection. Where ML asks which parameters maximize the probability of the data for a given model, ME asks which model, including nested models which differ only in K, makes the data most probable. According to ME, the model of correct complexity (K*) is
The quantity

is called the evidence. Sometimes it is also referred to as the marginal likelihood, since unknown parameters are assigned probability distributions and marginalized (or summed out) over all possible settings. The evidence penalizes both models which underfit and models which overfit. The second expression in Eq. 3 follows readily from the sum rule of probability provided we are willing to model the parameters themselves as random variables. That is, we are willing to specify a distribution over parameters,

. This distribution is called the “prior”, since it can be thought of as the probability of the parameters prior to seeing any data. The parameters for the distributions of the prior

are called
hyperparameters. In addition to providing a method for model selection, by integrating over parameters to calculate the evidence rather than using a “best” point estimate of the parameters, ME avoids the ill-posedness problem associated with ML.
Although ME provides an approach to model selection, calculation of the evidence does not, on its own, provide an estimate for

The VBEM approach to estimating evidence does, however, provide a mechanism to estimate

. VBEM can be thought of as an extension of EM for ME. Both the VBEM algorithm and considerations for choosing priors are discussed in Methods.
smFRET
Before building a GM describing smFRET data, it is helpful to review briefly the experimental method. The experimental technique is based on the spectroscopic phenomenon that, if the emission spectrum of a polar chromophore (donor) overlaps with the absorption spectrum of another polar chromophore (acceptor), electromagnetic excitation of the donor can induce a transfer of energy to the acceptor via a non-radiative, dipole-dipole coupling process termed florescence resonance energy transfer (FRET) [
23]. The transfer efficiency between donor and acceptor scales with the distance between molecules (r) as 1/
r6, with FRET efficiencies most sensitive to r in the range of 1 − 10nm. Because of this extraordinary sensitivity to distance, FRET efficiency can serve as a molecular ruler, allowing an experimentalist to measure the separation between donor and acceptor by stimulating the donor with light and measuring emission intensities of both the donor (
ID) and acceptor (
IA) [
24]. Usually a summary statistic called the “FRET ratio” is used to report on molecular distance rather than the “raw”, 2-channel {
IA,
ID} data, although inference of the raw 2-channel data is possible as well [
25]. The FRET ratio is given by
When the donor and acceptor are attached to an individual protein, nucleic acid, or other molecular complex, the FRET signal can be used to report on the dynamics of the molecule to which the donor and acceptor are attached (see Fig. ). When the experiment is crafted to monitor individual molecules rather than ensembles of molecules, the process is termed single molecule FRET (smFRET). For many biological studies, such as the identification and characterization of the structural dynamics of a biomolecule, smFRET must be used rather than FRET; the majority of molecular dynamics cannot be observed from ensemble averages. Often the molecule of interest adopts a series of locally stable conformations during a smFRET time series. From these data, the experimentalist would like to learn (1) the number of locally stable conformations in the data (i.e. states) and (2) the transition rates between states. Although it is theoretically possible use the FRET signal to quantify the distance between parts of a molecule during a time series, there are usually too many variables affecting FRET efficiency for this to be practical [
26]. Consequently, smFRET is usually used to extract quantitative information about kinetics (i.e. rate constants) but only qualitative information about distances.
The photophysics of FRET have been studied for over half a century, but the first smFRET experiments were only carried out about fifteen years ago [
27]. The field has been growing exponentially since, and hundreds of smFRET papers are published annually [
1]. Diverse topics such as protein folding [
28], RNA structural dynamics [
29], and DNA-protein interactions [
30] have been investigated via smFRET. The size and complexity of smFRET experiments has grown substantially since the original smFRET publication. A modern smFRET experiment can generate thousands of time series to be analyzed [
7].