Twenty-eight years have passed since the discovery of spliceosomal introns [1
], but their evolution remains poorly understood. In the ongoing debate on intron evolution, the central issues include: introns-early versus introns-late, the mini-gene hypothesis, the proto-splice site hypothesis, rates of intron gain and loss, and ratio of parallel gain. Reconstruction of intron evolution from observed data is an important step toward resolution of these issues.
Although introns have high mutation rates, making it difficult to trace lineages through sequence homology, their positions are well conserved [2
]. Therefore, it is possible to reconstruct intron evolution by comparing intron positions among sets of orthologous genes from different species. There are five basic steps involved in this reconstruction: step 1, sequence genomes and annotate the genes; step 2, select sets of orthologous genes among species; step 3, align orthologous genes to produce an intron presence/absence matrix; step 4, reconstruct a phylogenetic tree of the species studied; and step 5, make inferences of intron evolution based on the intron matrix and the phylogenetic tree. Among these, step 5 is perhaps the most critical, since erroneous inferences will lead to misunderstandings of intron evolution. However, to date this step remains the least investigated.
Two main approaches, maximum parsimony (MP) and maximum likelihood (ML), have been applied to determine intron evolution from the pattern of intron position conservation. The MP approach, based on the assumption that intron gain and loss events occur rarely in evolution, infers the most parsimonious scenario (measured as a function of gains and losses). Conversely, the ML approach infers a scenario with the highest probability of producing the observed data, for a given model of intron evolution.
In a large-scale study of intron evolution, Rogozin et al. [4
] applied an MP method to infer intron gains and losses along each branch of a phylogenetic tree of eight eukaryotes using 684 gene orthologs. Their results demonstrated that intron density at the crown (plant–animal) ancestor is relatively high (nearly one-third of the density found in humans). Gains outweighed losses in some lineages; in others, the opposite trend was observed. However, overall there were two gains per loss, suggesting a more important role for intron gain than loss during the course of eukaryotic evolution. In another large-scale study of 2,073 sets of orthologs from four fungal species, Nielsen et al. [5
] used an MP method to infer intron gains and losses, then used a probabilistic model to correct the numbers. They found that intron gains are on a par with intron losses in the four species investigated.
Roy and Gilbert [6
] applied an ML method to the above mentioned dataset of eight eukaryotes (excluding one species). Their results were somewhat surprising: the genes of the crown ancestor were rich in introns (about two-thirds the density found in humans), and many lineages exhibited a notable excess of intron losses over gains. The ratio between gains and losses in this study was 0.7, suggesting a general trend toward decreased intron density. In another study, Qiu et al. [8
] applied a Bayesian network method, also based on the ML principle, to infer the evolution of introns in ten gene families containing a total of 677 sequences. Their results suggest that many of the intron positions shared across various species are the result of independent gains, and are not due to conservation of intron position.
The results of the two ML methods clearly contradict each other and also differ significantly from results of the MP methods. This may be because of the different assumptions they made about the number of target sites; a target site is defined here as a possible position for intron insertion in the multiple-sequence alignment of all species. The number of target sites thereby includes all observed sites, plus possible sites that had introns in the past or may have introns in the future. The method of Roy and Gilbert [6
] assumes that parallel gains do not occur at all, which will happen only when the number of target sites is many orders of magnitude larger than the number of observed sites. In contrast, the method of Qui et al. [8
] assumes that the number of target sites is the number of observed sites.
In this paper, we propose a new ML method, the underlying basis of which is the treatment of target site number as a parameter requiring optimization. A log-likelihood function was formulated to allow optimization of the number of target sites, and an expectation-maximization (EM) algorithm was developed to optimize the log-likelihood function. Our results paint a different picture from those provided by previous ML methods.