For the sake of concreteness, we present the model in terms of gene architecture, so the binary characters designate the presence or absence of an intron in a particular locus. However, with trivial notational changes, the model is equally applicable for describing evolution of other types of binary characters.
A length ng multiple alignment of a gene across S species can be described by a matrix of size S × ng. For analysis of binary characters, each such matrix is defined on the alphabet {0,1, *}, with a star (*) indicating a missing value. Each column represents the presence or absence of the binary character at this site (). For G genes aligned for the same set of species, we have a collection of G matrices. Any column in one of the multiple alignments is a vector of length S and is called a pattern. Let T be the phylogenetic tree whose tips match these S species, and let us index its nodes by 0,1,…, 2S − 2, with the convention that 0 is the index of the root. We use the same indexing for the branches, with the convention that branch i leads into node i (). We assume that the topology of T, as well as all branch lengths, Δ1,…, Δ2S−2, is known.
Let qt denote the state of node t (i.e., 0 or 1), and let qtP be the state of its parent node. We denote by T(g, t) the transition matrix, such that Tij(g, t) = Pr(qt = j | qtP = i, g) is the probability of finding node t in state j given that its parent node is in state i, and given that we are looking at gene g. The evolutionary model assumes
Here,
ξt and
ϕt are the intron gain and loss coefficients of branch
t, and
ηg and
θg are the intron gain and loss rates of gene
g. Clearly, the overall probability of gain, loss, or retention of an intron in gene
g along branch
t depends on both the particular gene and the particular branch. For instance, the probability of an intron to be gained in gene
g along branch
t includes a contribution of the branch (
ξt), and a contribution of the gene (1 −
e−ηgΔt). To complete the probabilistic model on the phylogenetic tree, we define
π0 as the probability of the root to be in state 0 (i.e., to be lacking an intron) at a particular site.
Note that, in the absence of the branch-specific coefficients (
ξt = 1 and
ϕt = 0), the transition matrix defines a two-state continuous-time Markovian process. Such a model is popular in evolutionary studies and is implemented in widely used software, such as PAML [
2], PHYLIP [
3], and PAUP* [
4]; a number of expectation-maximization algorithms have been designed to analyze similar Markov processes on phylogenetic trees [
5–
7]. However, these models cannot take into account the combined influences of the branches and the genes and therefore are not applicable under our model.
In sequence evolution, rate variability among sites is customarily accommodated by introducing a random variable
r with unit mean, known as the
rate coefficient. This coefficient is used to scale the time units of the phylogenetic tree for each particular site, reflecting a distinction between fast-evolving sites and slow-evolving ones [
8,
9]. We use a similar idea here, but in order to keep the gain and loss processes independent, two independent rate coefficients are introduced,
rη and
rθ, to scale the gene-specific gain and loss rate parameters,
ηg ←
rη ·
ηg and
θg ←
rθ ·
θg. The rate coefficients are assumed to be distributed according to the following distributions:
Here, Γ(
x;
λ) is the unit-mean gamma distribution of variable
x with shape parameter
λ,
δ(
x) is the Dirac delta function, and
ν is the fraction of sites that are incapable of gaining the character and are denoted
zero sites.
We have developed an expectation-maximization algorithm to find the maximum likelihood estimators of the model parameters [
10,
11]. The full model is described by 2
G + 4
S parameters. This number might become exceedingly high and results in high variance to the estimates. To circumvent this problem we have developed a two-phase analysis procedure [
12,
13]. In the first,
homogeneous phase, the genes are assumed to have the same gain and loss rates, thus they are all treated as one concatenated supergene. This sets
G = 1, reducing the number of parameters to 2 + 4
S. In the second,
heterogeneous phase, all the parameters estimated in the homogeneous phase are frozen, and only the gene-specific parameters are estimated. Extensive simulations had proved the effectiveness of this procedure (see below).
As indicated above, we used this model to study the evolution of the intron-exon gene architecture in eukaryotes. In this case, a coding nucleotide position is marked with 1 if an intron got inserted into the sequence right after it. We generated a set of
G = 391 orthologous genes upon which we marked the presence or absence of the introns. We ran the model upon this data and computed intron gain and loss coefficients for the different lineages [
12], and intron gain and loss rates of the different genes [
13].