PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Methods Mol Biol. Author manuscript; available in PMC 2012 August 2.
Published in final edited form as:
PMCID: PMC3410445
NIHMSID: NIHMS385225

A Maximum Likelihood Method for Reconstruction of the Evolution of Eukaryotic Gene Structure

Abstract

Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.

Keywords: Maximum likelihood, expectation-maximization, intron evolution, ancestral reconstruction, eukaryotic gene structure

1. Introduction

In eukaryotes, many protein-coding genes have their coding sequence broken into pieces – the exons – separated by the non-coding spliceosomal introns. These introns are removed from the nascent pre-mRNA and the exons are spliced together to form the intronless mRNA by the spliceosome, a large and elaborate macromolecular complex comprising several small RNA molecules and numerous proteins. No spliceosomal introns have ever been found in prokaryotes, and there are no eukaryotes with a completely sequenced genomes, not even the very basal ones, which would not possess introns (13) and the accompanying splicing machinery (4).

Despite the introns being such a remarkable idiosyncrasy of eukaryotic genomes, their origin and evolution are not thoroughly understood (5, 6). It is generally accepted that introns can be regarded as units of evolution and that their presence/absence pattern is a result of stochastic processes of loss and gain. However, the nature of these processes is vigorously debated. Recent large-scale attempts to study these processes using extant eukaryotic genomes led to incongruent conclusions.

In a study on reconstruction of intron evolution, Rogozin et al. (7) analyzed ~700 sets of intron-bearing orthologous genes from eight eukaryotic species. The multiple alignment of the orthologs within each set was computed, and the intron positions were projected on the alignments to form presence/absence maps. Using Dollo parsimony to infer ancestral states, these authors observed a diverse repertoire of behaviors. Some lineages endured extensive losses, while others experienced mostly gain events. Early forbearers, such as the last common ancestor of multicellular life, were shown to be relatively intron-rich. This work suggested that both gain and loss of introns played significant roles in shaping the modern eukaryotic gene structure. However, as these inferences rely upon the Dollo parsimony reconstruction, the number of gains in terminal branches (leaves of the phylogenetic tree) is overestimated, resulting in underestimation (potentially, significant) of the number of introns in ancient lineages.

The same data set was analyzed by Roy and Gilbert (8, 9) using a different methodology. They adopted a simple evolutionary model, according to which different lineages are associated with different loss and gain probabilities. Using a variation on maximum likelihood estimation, they obtained considerably higher estimates for the number of introns in early eukaryotes and a correspondingly lower level of gains in all lineages, i.e., a clear dominance of loss events in the evolution of eukaryotic genes. Roy and Gilbert have substantially simplified the mathematics involved in the estimation procedure, at the expense of introducing into the computation considerations of parsimony, which yielded an inference technique that is a hybrid between parsimony and maximum likelihood. This hybrid, however, excludes from consideration different evolutionary scenarios, resulting in inflated estimates of the number of introns in early eukaryotes (10).

The model of Roy and Gilbert is branch-specific, i.e., it assumes that the gain and loss rates depend only on the branch, thus tacitly presuming that all genes behave identically with respect to intron gain and loss. Exactly the inverse approach was adopted by Qiu et al. (11). These authors developed a gene-specific model, whereby different gene families are characterized by different rates of intron gain and loss, but for a particular gene these rates are constant across the entire phylogenetic tree. They used a different data set combined with a Bayesian estimation technique and concluded that almost all extant introns were gained during the eukaryotic evolution. This suggests evolution is dominated by intron gain events with few losses. However, the validity of a gene-specific model is disputable as it is hard to reconcile with the accumulating evidence on large differences between lineages (1215).

Recently, two maximum likelihood estimation techniques have been developed for essentially the same branch-specific evolutionary model as the one of Roy and Gilbert. Csuros (10) used a direct approach, while Nguyen et al. (16) developed an expectation-maximization algorithm. Both methods encountered the same problem of estimating the number of unobserved intronless sites. Each employed a technically different but conceptually similar method to evaluate this number. Both techniques were applied to the eight-species data of Rogozin et al. (7), yielding very close estimates. As expected, these methods predict intron occupancy level of ancient lineages higher than those predicted by Dollo parsimony and lower than those predicted by the hybrid technique of Roy and Gilbert. Notably, these estimates are generally closer to those obtained using Dollo parsimony, and they imply an evolutionary landscape comprising both losses and gains, with some excess of gains.

While the Dollo parsimony (7) and the hybrid technique of Roy and Gilbert (8, 9) showed some methodological biases, the other analyses of intron evolution (10, 11, 16) used well-established estimation techniques. Nevertheless, these studies kept yielding widely diverging inferences. The reason seems to be the differences in the underlying evolutionary models, neither being sufficient to describe the complex reality of intron evolution. The branch-specific model fails to account for important differences between genes, whereas the gene-specific model ignores the sharp differences between lineages. Additionally, rate variability between sites, known to be an important factor in other fields of molecular evolution (17, 18), should be taken into account also in the evolution of gene structure. This is particularly important for intron gain in light of the accumulating evidence in favor of the proto-splice model, according to which new introns are preferentially inserted inside certain sequence motifs (1921). This means that sites could dramatically differ in their gain rate depending on their position relative to a proto-splice site.

Here we describe a model of evolution that takes into consideration all of the above factors. In order to efficiently estimate the model parameters by maximum likelihood, we have developed an expectation-maximization algorithm. We also compiled a data set that is considerably larger than previously used ones, consisting of 400 sets of orthologous genes from 19 eukaryotic species. Applying our algorithm to this data set, we obtained high-precision estimates, revealing a fascinating evolutionary history of gene structure, where both losses and gains played significant roles albeit the contribution of losses was somewhat greater. Moreover, we identified novel properties of intron evolution: (i) all eukaryotic lineages share a common, universal, mode of intron evolution, whereby the loss and gain processes are positively correlated. This suggests that the mechanisms of intron gain and loss share common mechanistic components. In some lineages, additional forces come into play, resulting either in elevated loss rate or in elevated gain rate. Lineages exhibiting an increased loss rate are dispersed throughout the entire phylogenetic tree. In contrast, lineages with excessive gains are much rarer, and all of them are ancient. (ii) Intron loss rates of individual genes show no correlation with any other genomic property. By contrast, intron gain rate of individual genes show several remarkable relationships, not always easily explained. In brief, intron gain rate is positively correlated with expression level, negatively correlated with sequence evolution rate, and negatively correlated with the gene length. Moreover, genes of apparent bacterial origin have significantly lower rates of intron gain than genes of archaeal origin. (iii) We showed that the remarkable conservation of intron positions is, mainly (~90%), due to shared ancestry, and only in a minority of the cases (~10%), due to parallel gain at the same location. (iv) We determined that the density of potential intron insertion sites is about 1 site per 7 nucleotides.

2. Materials

The algorithm learns the parameters of the model by comparing the structure of orthologous genes in extant species. To carry out this comparison, it requires two sets of input data, to be described in this section. The first is a phylogenetic tree, defining topological relationships between a set of eukaryotic species. The second is a collection of genes, for which one can identify orthologs in at least a subset of the species above.

2.1. Multiple Alignments

Suppose that we have G sets of aligned orthologous genes from S species. To represent the gene structure, we transform these alignments into intron presence–absence maps by substituting for each nucleotide (or amino acid) 0 or 1, depending on whether an intron is present or absent in the respective position. We allow for missing data by using a third symbol (*), and consequently a gene might be included in the input data even if it is missing in part of the species. Every site in an alignment, called pattern, is a vector of length S over the alphabet (0,1,*). Let Ω be the total number of unique patterns in the entire set of G alignments, denoted ω1, . . . , ωΩ, and let ngp count the number of times pattern ωp is found in the multiple alignment of gene g. Assuming that the sites evolve independently, the set Mg = (ng1, . . . , ngΩ) fully characterizes the multiple alignment of the gth gene. Thus, all the relevant information about the multiple alignments is captured by the list of unique patterns ω1, . . . , ωΩ, and the list of vectors M1, . . . , MG.

2.2. Phylogenetic Tree

Let T be a rooted bifurcating phylogenetic tree with S leaves (terminal nodes) corresponding to the S species above. The total number of nodes in T is N = 2S – 1, and we index them by t = 0, 1, . . . , N – 1, with the convention that zero is the root node. The state of node t is described by the random variable qt, which can take the values 0 and 1 (and * in leaves). We use Vt for the set of all leaves such that node t is among their ancestors. The entire collection of leaves is, obviously, V0. The parent node of t is denoted P(t). We use the special notations qtP and VtP for qp(t) and VP(t), respectively. Analogously, the two direct descendents of node t are denoted L(i) and R(i), and we use the special notations qtL,qtR,VqL, and VqR for qL(t), qR(t), VL(t), and VR(t), respectively. We index branches by the node into which they are leading, and use Δt to denote the length (in time units) of the tth branch. We assume that the tree topology, as well as all the branch lengths Δ1, . . . , ΔN–1 are known.

3. Methods

3.1. The Probabilistic Model

A graphical model is a mathematical graph whose nodes symbolize random variables, and whose branches describe dependence relationships between them (22). A bifurcating phylogenetic tree, when viewed as a graphical model, depicts the probabilistic model

Pr(q0)[product]t=1N1Pr(qt[mid ]qtP).
[1]

We use the notation πi = Pr(q0 = i) to describe the prior probability of the root, and Aij(g,t)=Pr(qt=j[mid ]qtP=i,g) to describe the transition probability for gene g along branch t. In our model, we assume that the transition probability depends on both the gene and the branch, and that it takes the explicit form

A(g,t)=(1ξt(1eηgΔt)ξt(1eηgΔt)1(1ϕt)eθgΔt(1ϕt)eθgΔt).
[2]

Here, ηg and θg are nonnegative parameters, determining the intron gain and loss rates, respectively, of gene g. Complementarily, ξt and [var phi]t determine the intron gain and loss coefficients of branch t, respectively, and are bound to the range 0 ≤ ξt, [var phi]t ≤ 1. The probability of an intron present in gene g at the beginning of branch t to be retained along the branch is (1ϕt)eθgΔt, that is, it is retained only if the branch does not lose it (with probability 1 – [var phi]t), and also the gene does not lose it (with probability eθgΔt). This comes to reflect a reality where strong forces to strip a gene off its introns will be practically unaffected by the particular lineage, and, oppositely, strong forces to strip a lineage off its introns will be practically unaffected by the particular gene. In the same spirit, the probability of an intron to be gained in gene g along branch t is ξt(1eηgΔt), that is, it is gained only if both the branch “approves” it (with probability ξt) and the gene “approves” it (with probability 1eηgΔt).

In other fields of molecular evolution, it was long realized that analysis precision improves if one allows for rate variability across sites (17, 18). Typically, such rate variability is modeled by introducing a rate variable, r, which scales, for each site, the time units of the phylogenetic tree, Δtr · Δt. This rate variable is a random variable, distributed according to a distribution function with nonnegative domain and unit mean, typically the unit-mean gamma distribution. The rate variability reflects the idea that sites differ in their rate of evolution. Specifically, there are fast-evolving sites (r > > 1), as well as slow-evolving ones (r < < 1). In our model of intron evolution we extend this idea by assuming that the gain and loss processes are subject to rate variability, independently of each other. Hence, a site can have any combination of gain and loss rates. To accommodate this idea, we use two independent rate variables, rη and rθ, that are used to scale, for each site, the gene-specific gain rate, ηgrη[center dot]ηg, and the gene-specific loss rate, θgrθ[center dot]θg. We further assume that the distributions of these rate variables are independent of the genes, and are explicitly given by

rη~νδ(η)+(1ν)Γ(η;λη)rθ~Γ(θ;λθ).
[3]

Here, Γ(x, λ) is the unit-mean gamma distribution of variable x with shape parameter λ, δ(x) is the Dirac delta-function, and η is the fraction of sites that are assumed to have zero gain rate. These latter sites, denoted invariant sites, reflect these sites that are not a proto-splice site (1921). Intron loss does not have an invariant counterpart, as the assumption is that once an intron is gained, it can always be lost. Therefore, the loss rate variable is assumed to be distributed according to a gamma distribution, which is by far the most popular in describing rate variability (17, 18, 23).

In practice, the rate distributions in Eq. [3] are rendered discrete (24). We assume that the gain rate variable can take Kη discrete values r1η=0,r2η,,rKηη with probabilities f1η=ν,f2η,fKηη such that k=1Kηfkη=1. Analogously, we assume that the loss rate variable can take Kθ discrete values r1θ,,rKθθ with probabilities f1θ,,fKθθ such that k=1Kθfkθ=1. For a particular gain rate value rkη, we denote the actual gain rate rkη[center dot]ηg by ηkg. Similarly, for a particular loss rate value rkθ, we denote the actual loss rate rkθ[center dot]θg by θkg.

For notational clarity, we aggregate the model parameters into a small number of sets. To this end, let Ξt = {ξt, [var phi]t} be the set of parameters that are specific for branch t, and let Ξ = (Ξ1, . . . , ΞN – 1) be the set of all branch-specific parameters. Similarly, let Ψg=(ηg,θg) be the set of parameters that are specific for gene g, and let Ψ = (Ψ1 . . . ,ΨG) be the set of all gene-specific parameters. Additionally, we denote by Λ = (ν, λη, λθ) the parameters that determine the rate variability. When the distinction between the different sets of parameters is irrelevant, we shall use Θ = (Ξ, Ψ, Λ) as the set of all the model's parameters. We achieve further succinctness in notations by denoting the actual gene-specific rate values for particular values rkη and rkθ of the rate variables as Ψkkg=(ηkg,θkg).

3.2. The EM Algorithm

For each site, the S leaves form a set of observed random variables, their states being described by the corresponding pattern ωp. The state of all the internal nodes, denoted σ, form a set of hidden random variables, that is, random variables whose state is not observed. In order to account for rate variability across sites, we associate with each pattern two hidden random variables, ρpη and ρpη, that determine the value of the rate variables in that site. To sum up, the observed random variables are ωp, and the hidden random variables are (σ,ρpη,ρpθ).

We assume that sites within a gene, as well as the genes themselves, evolve independently. Therefore, the total likelihood can be decomposed as

L(M1,,MG[mid ]ϴ)=[product]g=1GL(Mg[mid ]Ξ,Ψg,Λ)=[product]g=1G[product]p=1ΩL(ωp[mid ]Ξ,Ψg,Λ)ngp.

and so

logL(M1,,MG[mid ]ϴ)=g=1Gp=1ΩngplogL(ωp[mid ]Ξ,Ψg,Λ).
[4]

According to the well-known EM paradigm (25) log L(M1, . . . , MG|Θ) is guaranteed to increase as long as we maximize the auxiliary function

Q(ϴ,ϴ0)=g=1Gp=1ΩngpQgp(Ξ,Ψg,Λ,Ξ0,Ψg0,Λ0),
[5]

where

Qgp(Ξ,Ψg,Λ,Ξ0,Ψg0,Λ0)=σ,ρpη,ρpθPr(σ,ρpη,ρpθ[mid ]ωp,Ξ0,Ψg0,Λ0)logPr(ωp,σ,ρpη,ρpθ[mid ]Ξ,Ψg,Λ).
[6]

Using some manipulations (see Note 1), this can be written as

Qgp(Ξ,Ψg,Λ,Ξ0,Ψg0,Λ0)=k=1Kηk=1Kθ[Pr(ρpη=k,ρpθ=k[mid ]ωp,Ξ0,Ψg0,Λ0)].[σPr(σ[mid ]ωp,Ξ0,Ψgkk0)[center dot]{logfkη+logfkθ+logPr(ωp,σ[mid ]Ξ,Ψgkk)}].

Denoting by wgpkk and Qgpkk the first and second square brackets, respectively, this expression becomes

Qgp(Ξ,Ψg,Λ,Ξ0,Ψg0,Λ0)=k=1Kηk=1KθwgpkkQgpkk,
[7]

and consequently

Q(ϴ,ϴ0)=g=1Gp=1Ωk=1Kηk=1KθngpwgpkkQgpkk.
[8]

3.2.1. The E-Step

In this step we compute the function Q(ϴ,ϴ0), or, equivalently, the set of coefficients wgpkk and Qgpkk. We accomplish this with the aid of an inward–outward recursion on the tree.

3.2.1.1. The Inward (γ) Recursion

Here we propose a variation on the well-known Felsenstein's pruning algorithm (26). Let us associate with each node t (except for the root) a vector γigpkk(t)=Pr(Vt[mid ]qtP=i,Ξ0,Ψgkk0). In words, γigpkk (t) is the probability of observing the nodes Vt (which are a subset of the pattern ωp) for a gene g, when the gain and loss rate variables are rkη and rkθ, respectively, and when the parent node of t is known to be in state i. By definition, this function is initialized at all leaves (t [set membership] V0) by

γ(t[set membership]V0)={(1ξt(1eηgkΔt)1(1ϕt)eθgkΔt)qt=0(ξt(1eηgkΔt)(1ϕt)eθgkΔt)qt=1.}
[9]

Here, and in the derivations to follow, we omit the superscript from γ. For all internal nodes (except for the root), γ is computed using the recursion

γi(t)=j=01Aij(g,t)γ~j(t),
[10]

where γ~j(t) is defined as γj[L(t)]γj[R(t)] (see Note 2).

The γ-recursion allows for computing the likelihood of any observed pattern ωp, given the values of the rate variables:

Pr(ωp[mid ]Ξ0,Ψgkk0)=Pr(V0[mid ]Ξ0,Ψgkk0)=Pr(V0L,V0R[mid ]Ξ0,Ψgkk0)==i=01Pr(V0L,V0R,q0=i[mid ]Ξ0,Ψgkk0)==i=01Pr(q0=i[mid ]Ξ0,Ψgkk0)[center dot]Pr(V0L[mid ]q0=i,Ξ0,Ψgkk0)[center dot]Pr(V0R[mid ]V0L,q0=i,Ξ0,Ψgkk0).

Given q0, V0R is independent of V0L, and so

Pr(V0R[mid ]V0L,q0=i,Ξ0,Ψgkk0)=Pr(V0R[mid ]q0=i,Ξ0,Ψgkk0),

and

Pr(ωp[mid ]Ξ0,Ψgkk0)=i=01πiγ~i(0).
[11]

This γ-recursion can be easily modified to incorporate missing data (see Note 3).

3.2.1.2. The Outward (α) Recursion

Once the γ-recursion is computed, we can use it to compute a second, complementary, recursion. To this end, let us associate with each node t (except for the root node) a matrix αijgpkk(t)=Pr(qt=j,qtP=i[mid ]ωp,Ξ0,Ψgkk0). It is beneficial to define for each node t (except for the root node) a vector βjgpkk(t)=i=01αijgpkk(t)=Pr(qt=j[mid ]ωp,Ξ0,Ψgkk0). Upon the computation of α, β is readily computed too. Again, omitting the superscripts, α can be initialized from its definition on the two direct descendants of the root,

α(D(0))=1Pr(ωp[mid ]Ξ0,Ψgkk0){(π0γ0(D(0))A00(g,D(0))0π1γ1(D(0))A10(g,D(0))0)D(0)[set membership]V0,q0D=0(0π0γ0(D(0))A01(g,D(0))0π1γ1(D(0))A11(g,D(0)))D(0)[set membership]V0,q0D=1(π0γ0(D(0))γ~0(D(0))A00(g,D(0))π0γ0(D(0))γ~1(D(0))A01(D(0))π1γ1(D(0))γ~0(D(0))A10(D(0))π1γ1(D(0))γ~1(D(0))A11(D(0)))D(0)[negated set membership]V0.}
[12]

Here, D(0) stands for any one of the direct descendants of the root, and D(0) is its sibling. For any other internal node, α is computed using the outward-recursion

α(t)=(β0(P(t))γ~0(t)A00(g,t)/γ0(t)β0(P(t))γ~1(t)A01(g,t)/γ0(t)β1(P(t))γ~0(t)A10(g,t)/γ1(t)β1(P(t))γ~1(t)A11(g,t)/γ1(t))
[13]

(see Note 4).

Finally, for each leaf that is not a descendant of the root,

α(t)={(β0(P(t))0β1(P(t))0)qt=0(0β0(P(t))0β1(P(t)))qt=1.}t[set membership]V0,P(t)0
[14]

Again, this recursion can be straightforwardly modified when missing data are present (see Note 5).

These inward–outward recursions are the phylogenetic equivalent of the backward–forward recursions known from hidden Markov models, and other versions of it have already been developed (27, 28). The version that we developed here can be shown to be the realization of the junction tree algorithm (29) on rooted bifurcating trees (see Note 6).

3.2.1.3. Computing the Coefficients wgpkk′

Here we show that the γ-recursion is sufficient to compute the coefficients wgpkk′. From the definition, wgpkk=Pr(ρpη=k,ρpθ=k[mid ]ωp,Ξ0,Ψg0,Λ0). Using the Bayes formula Pr(x, y|z) = Pr(x, y, z)/Σx,y Pr(x, y, z), we can rewrite it as

wgpkk=Pr(ρpη,=k,ρpθ=k,ωp[mid ]Ξ0,Ψg0,Λ0)h,hPr(ρpη=h,ρpθ=h,ωp[mid ]Ξ0,Ψg0,Λ0)==Pr(ρpη=k[mid ]Ξ0,Ψg0,Λ0)[center dot]Pr(ρpθ=k[mid ]Ξ0,Ψg0,Λ0)[center dot]Pr(ωp[mid ]Ξ0,Ψgkk0)h,hPr(ρpη=h[mid ]Ξ0,Ψg0,Λ0)[center dot]Pr(ρpθ=h[mid ]Ξ0,Ψg0,Λ0)[center dot]Pr(ωp[mid ]Ξ0,Ψghh0).

But Pr(ρpη=k[mid ]Ξ0,Ψg0,Λ0) is just the current estimate of the probability of the gain rate variable to have the value rkη, namely (fkη)0. Similarly, Pr(ρpθ=k[mid ]Ξ0,Ψg0,Λ0) is just (fkθ)0. Therefore, the expression for the coefficients wgpkk is reduced to

wgpkk=(fkη)0(fkθ)0Pr(ωp[mid ]Ξ0,Ψgkk0)h,h(fhη)0(fhθ)0Pr(ωp[mid ]Ξ0,Ψghh0).
[15]

The function Pr(ωp[mid ]Ξ0,Ψgkk0) is the likelihood of observing pattern ωp for gain and loss rate variables rkη and rkθ, respectively. This is readily computed upon completion of the γ-recursion, using Eq. [11].

3.2.1.4. Computing the Coefficients Q gpkk'

Here we show that these coefficients require the α, β-recursion. By definition,

Qgpkk=σPr(σ[mid ]ωp,Ξ0,Ψgkk0)[center dot][logfkη+logfkθ+logPr(ωp,σ[mid ]Ξ,Ψgkk)].

The probability Pr(ωp,σ[mid ]Ξ,Ψgkk) is just the likelihood of a particular realization of the tree, thus from Eq. [1]

logPr(ωp,σ[mid ]Ξ,Ψgkk)=i=01δ(q0,i)[center dot]logπi+i,j=01t=1N1δ(qt,j)δ(qtP,i)[center dot]logAij(g,t).
[16]

Here, δ(a, b) is the Kronecker delta function, which is 1 for a = b and 0 otherwise. Denote the expectation over Pr(σ[mid ]ωp,Ξ0,Ψgkk0) by Eσ. Applying it to Eq. [16], we get

Eσ[logPr(ωp,σ[mid ]Ξ,Ψgkk)]=i=01logπi[center dot]Eσ[δ(q0,i)]+i,j=01t=1N1logAij(g,t)[center dot]Eσ[δ(qt,j)δ(qtP,i)].

But Eσ[δ(q0,i)]=Pr(q0=i[mid ]ωp,Ξ0,Ψgkk0)=βi(0), and similarly Eσ[δ(qt,j)δ(qtP,i)]=αij(t). Hence, Qgpkk is given by

Qgpkk=σPr(σ[mid ]ωp,Ξ0,Ψgkk0)[logfkη+logfkθ+logPr(ωp,σ[mid ]Ξ,Ψgkk)]=logfkη+logfkθ+i=01βi(0)logπi+i,j=01t=1N1αij(t)logAij(g,t).
[17]

3.2.2. The M-Step

Substituting Eq. [17] in Eq. [8], we obtain an explicit form of the function whose maximization guarantees stepping up-hill in the likelihood landscape,

Q=g=1Gp=1Ωk=1Kηk=1Kθngpwgpkk(logfkη+logfkθ)++g=1Gp=1Ωk=1Kηk=1Kθngpwgpkk[β0gpkk(0)logπ0+β1gpkk(0)logπ1]++g=1Gp=1Ωk=1Kηk=1Kθt=1N1ngpwgpkkα00gpkk(t)log[1ξt(1eηgkΔt)]++g=1Gp=1Ωk=1Kηk=1Kθt=1N1ngpwgpkkα01gpkk(t)[logξt+log(1eηgkΔt)]++g=1Gp=1Ωk=1Kηk=1Kθt=1N1ngpwgpkkα10gpkk(t)log[1(1ϕt)eθgkΔt]++g=1Gp=1Ωk=1Kηk=1Kθt=1N1ngpwgpkkα11gpkk(t)[log(1ϕt)θgkΔt].
[18]

Actually, any increase in Q is sufficient to guarantee an increase in the likelihood, suggesting that a precise maximization of Q is not very important. Therefore, we speed computations by performing low-tolerance maximization with respect to each of the parameters individually. Except for the parameters λη and λθ, it is easy to differentiate Q twice with respect to any parameter. This lends itself into using simple zero-finding algorithms; we chose the Newton-Raphson algorithm (30). Maximizing Q with respect to the shape parameters λη and λθ is more involved, as Q depends on these parameters only through the discrete approximation of the rate variability distributions, Eq. [3] (see Note 7).

Footnotes

1If we replace the formal summing over all states of ρpη and ρpη in Eq. [6] by a direct sum, we get

Qgp(Ξ,Ψg,Λ,Ξ0,Ψg0,Λ0)=k=1Kηk=1KθσPr(σ,ρpη=k,ρpθ=k[mid ]ωp,Ξ0,Ψg0,Λ0)logPr(ωp,σ,ρpη=k,ρpθ=k[mid ]Ξ,Ψg,Λ).
[19]

Using our notational conventions, we can write the first term in Eq. [19] as

Pr(σ,ρpη=k,ρpθ=k[mid ]ωp,Ξ0,Ψg0,Λ0)=Pr(ρpη=k,ρpθ=k[mid ]ωp,Ξ0,Ψg0,Λ0)[center dot]Pr(σ[mid ]ωp,Ξ0,Ψgkk0),
[20]

and the second term as

logPr(ωp,σ,ρpη=k,ρpθ=k[mid ]Ξ,Ψg,Λ)=logPr(ρpη=k[mid ]Ξ,Ψg,Λ)++logPr(ρpθ=k[mid ]Ξ,Ψg,Λ)+logPr(ωp,σ[mid ]Ξ,Ψgkk)=logfkη+logfkθ+logPr(ωp,σ[mid ]Ξ,Ψgkk).
[21]

Substituting Eqs. [20] and [21] back in Eq. [19] gives the desired result.

2We expand

γi(t)=Pr(Vt[mid ]qtP=i)=Pr(VtL,VtR[mid ]qtP=i)=j=01Pr(VtL,VtR,qt=j[mid ]qtP=i)==j=01Pr(qt=j[mid ]qtP=i)[center dot]Pr(VtL[mid ]qt=j,qtP=i)[center dot]Pr(VtR[mid ]VtL,qt=j,qtP=i).
[22]

The first term is simply the definition of Aij(g,t). Given qt, VtL is independent on qtP, thus the second term is just Pr(VtL[mid ]qt=j)=γj(tL). By similar arguments the third term is just Pr(VtR[mid ]qt=j)=γj(tR). By substituting those results in Eq. [22], we recover the recursion formula, Eq. [10].

3One of the appealing features of this recursion is that it allows to treat missing data fairly easily. Only a single option has to be added to the initialization phase Eq. [9],

γ(t[set membership]V0)=(11)qt=*.

4To prove this recursion, let us start with the definition of α,

αij(t)=Pr(qt=j,qtP=i[mid ]ωp)=Pr(qt=j,qtP=i[mid ]V0)=Pr(qtP=i[mid ]V0)[center dot]Pr(qt=j[mid ]qtP=i,V0)=βi(P(t))[center dot]Pr(qt=j[mid ]qtP=i,V0).
[23]

Let us make the decomposition V0 = Vt + Vt, with Vt being the set of all leaves such that node t is not among their ancestors. But, given qtP, the state of node t is independent on Vt, and therefore Eq. [23] becomes

αij(t)=βi(P(t))[center dot]Pr(qt=j[mid ]qtP=i,Vt).
[24]

From Bayes formula,

Pr(qt=j[mid ]qtP=i,Vt)=Pr(qt=j,Vt[mid ]qtP=i)Pr(Vt[mid ]qtP=i)=Pr(qt=j[mid ]qtP=i)[center dot]Pr(Vt[mid ]qt=j,qtP=i)γi(t)=Aij(g,t)γi(t)[center dot]Pr(Vt[mid ]qt=j,qtP=i).
[25]

But given qt, Vt is independent of P(t) and therefore

Pr(Vt[mid ]qt=j,qtP=i)=Pr(Vt[mid ]qt=j)=γ~j(t).
[26]

Combining Eqs. [25] and [26] in Eq. [24], we get

αij(t)=γ~j(t)βi(P(t))γi(t)Aij(g,p),

which is just another form of writing Eq. [13].

5When missing data are present, two simple modifications are required. First, we have to add to the initialization phase Eq. [12] an option

α(D(0))=1Pr(ωp[mid ]Ξ0,Ψgkk0){π0γ0[D(0)]A00[g,D(0)]π0γ0D(0)]A01[D(0)]π1γ1[D(0)]A10[D(0)]π1γ1(D(0)]A11[D(0)]}D(0)[set membership]V0,q0D=*

Second, we have to add to the finalization phase Eq. [14] an option

α(t)={β0[P(t)]A00(g,t)β0[P(t)]A01(g,t)β1[P(t)]A10(g,t)β1[P(t)]A11(g,t)}qt=*.

6The junction tree algorithm is a scheme to compute marginal probabilities of maximal cliques on graphs by means of belief propagation on a modified junction tree. Indeed, the matrix α computes marginal probabilities of pairs (t, P(t)), but such pairs are nothing but maximal cliques on rooted bifurcating trees.

7In our implementation, we used Yang's quantile method (24) to compute the discrete levels of the gamma distributions such that each level has equal probability. Formally, f1η=ν,fkη=(1ν)/(Kη1) for k = 2, . . . , Kη, and fkθ=1/Kθ for k = 1, . . . , Kθ. To perform the maximization in this case, we used Brent's maximization algorithm that does not require derivatives (30).

References

1. Nixon JE, Wang A, Morrison HG, McArthur AG, Sogin ML, Loftus BJ, Samuelson J. A spliceosomal intron in Giardia lamblia. Proc Natl Acad Sci U S A. 2002;99:3359–3361. [PubMed]
2. Vanacova S, Yan W, Carlton JM, Johnson PJ. Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. Proc Natl Acad Sci U S A. 2005;102:4430–4435. [PubMed]
3. Simpson AG, MacQuarrie EK, Roger AJ. Early origin of canonical introns. Nature. 2002;419:270. [PubMed]
4. Collins L, Penny D. Complex spliceosomal organization ancestral to extant eukaryotes. Mol Biol Evol. 2005;22:1053–1066. [PubMed]
5. Lynch M, Richardson AO. The evolution of spliceosomal introns. Curr Opin Genet Dev. 2002;12:701–710. [PubMed]
6. Roy SW, Gilbert W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006;7:211–221. [PubMed]
7. Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV. Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Curr Biol. 2003;13:1512–1517. [PubMed]
8. Roy SW, Gilbert W. Complex early genes. Proc Natl Acad Sci U S A. 2005;102:1986–1991. [PubMed]
9. Roy SW, Gilbert W. Rates of intron loss and gain: implications for early eukaryotic evolution. Proc Natl Acad Sci U S A. 2005;102:5773–5778. [PubMed]
10. Csuros M. McLysaght A, Huson D, editors. Likely scenarios of intron evolution, Lecture Notes in Bioinformatics. Proc. RECOMB 2005 Comparative Genomics International Workshop (RCG 2005) 2005;3678:47–60.
11. Qiu WG, Schisler N, Stoltzfus A. The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol. 2004;21:1252–1263. [PubMed]
12. Fedorov A, Roy SW, Fedorova L, Gilbert W. Mystery of intron gain. Genome Res. 2003;13:2236–2241. [PubMed]
13. Cho S, Jin SW, Cohen A, Ellis RE. A phylogeny of caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res. 2004;14:1207–1220. [PubMed]
14. Roy SW, Hartl DL. Very little intron loss/gain in Plasmodium: intron loss/gain mutation rates and intron number. Genome Res. 2006;16:750–756. [PubMed]
15. Jeffares DC, Mourier T, Penny D. The biology of intron gain and loss. Trends Genet. 2006;22:16–22. [PubMed]
16. Nguyen HD, Yoshihama M, Kenmochi N. New maximum likelihood estimators for eukaryotic intron evolution. PLoS Comput Biol. 2005;1:e79. [PMC free article] [PubMed]
17. Nei M, Chakraborty R, Fuerst PA. Infinite allele model with varying mutation rate. Proc Natl Acad Sci U S A. 1976;73:4164–4168. [PubMed]
18. Uzzell T, Corbin KW. Fitting discrete probability distributions to evolutionary events. Science. 1971;172:1089–1096. [PubMed]
19. Dibb NJ. Proto-splice site model of intron origin. J Theor Biol. 1991;151:405–416. [PubMed]
20. Dibb NJ, Newman AJ. Evidence that introns arose at proto-splice sites. Embo J. 1989;8:2015–2021. [PubMed]
21. Sverdlov AV, Rogozin IB, Babenko VN, Koonin EV. Reconstruction of ancestral protosplice sites. Curr Biol. 2004;14:1505–1508. [PubMed]
22. Jordan IM, editor. Learning in Graphical Models. Kluwer Academic Publishers; Boston, MA: 1998.
23. Jin L, Nei M. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol Biol Evol. 1990;7:82–102. [PubMed]
24. Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994;39:306–314. [PubMed]
25. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Statist Soc B. 1977;39:1–38.
26. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. [PubMed]
27. Friedman N, Ninio M, Pe'er I, Pupko T. A structural EM algorithm for phylogenetic inference. J Comput Biol. 2002;9:331–353. [PubMed]
28. Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21:468–488. [PubMed]
29. Castillo E, Gutierrez JM, Hadi AS. Expert systems and probabilistic network models (Monographs in Computer Science) Springer; New York: 1996.
30. Press WH, Flannery BP, Teukolsky SA, Vetterling WT. Numerical recipes in C: The art of scientific computing. 2nd ed. Cambridge University Press; New York: 1992.