The recent availability of complete genomic sequences of a diverse group of living organisms allows one to quantify basic mechanisms of molecular evolution on an unprecedented scale. The part of the genome consisting of all protein-coding genes (the full repertoire of its proteome) is at the heart of all processes taking place in a given organism. Therefore, it is very important to understand and quantify the rates and other parameters of basic evolutionary processes shaping thus defined proteome. The most important of those processes are:
• Gene duplications that give rise to new protein-coding regions in the genome. The two initially identical proteins encoded by a pair of duplicated genes subsequently diverge from each other in both their sequences and functions.
• Gene deletions in which genes that are no longer required for the functioning of the organism are either explicitly deleted from the genome or stop being transcribed and become pseudogenes whose homology to the existing functional genes is rapidly obliterated by mutations.
• Changes in amino-acid sequences of proteins encoded by already existing genes. This includes a broad spectrum of processes including point substitutions, insertions and deletions (indels), and transfers of whole domains either from other genes in the same genome or even from genomes of other species.
The BLAST (blastp) algorithm [1
] allows one to quickly obtain the list of pairs of paralogous proteins encoded in a given genome whose amino-acid sequences haven't diverged beyond recognition. The set of their percentage identities (PIDs) is a dynamic entity that changes due to gene duplications, deletions, and local changes of sequences. Duplication events constantly create new pairs of paralogous proteins with PID = 100%, while subsequent substitutions, insertions and deletions result in their PID drifting down towards lower values. A paralogous pair disappears from this dataset if one of its constituent genes is deleted from the genome, becomes a pseudogene, or when the PID of the pair becomes too low for it to pass the E-value cutoff of the algorithm. Thus the PID histogram contains a valuable if indirect information about past duplications, deletions, and sequence divergence events that took place in the genome. In what follows we propose a mathematical framework allowing one to extract some of this information and quantify the average rates and other parameters of the basic evolutionary processes shaping protein-coding contents of a genome.
The list of all paralogous pairs generated by the all-to-all alignment of protein sequences encoded in a given genome is generally much larger than the list of pairs of sibling proteins created by individual duplication events. For example, a family consisting of F paralogous proteins contributes up to F(F - 1)/2 pairs to the all-to-all BLAST output, while not more than F - 1 of these pairs connect the actual siblings to each other. The identification of the most likely candidates for these "true" duplicates is in general a rather complicated task which involves reconstructing the actual phylogenetic tree for every family in a genome. This goes beyond the scope of this study, where we employ a much simpler (yet less precise) Minimum Spanning Tree algorithm to extract a putative non-redundant subset of true duplicated (sibling) pairs.
The idea of quantifying evolutionary parameters using the histogram of some measure of sequence similarity of duplicated genes in itself is not new. It was already discussed by Gillespie (see [2
] and references therein) and later applied [3
] to measure the deletion rate of recent duplicates. There are two important differences between our methods and those of the Ref. [3
• We use relatively slow changes in amino-acid sequences of proteins as opposed to much faster silent substitutions of nucleotides used in the Ref. [3
]. This allows to dramatically extend the range of evolutionary times amenable to this type of analysis.
• In addition to PID distributions in the non-redundant set of true duplicated pairs used in the Ref. [3
] we also study that in the highly redundant set of all paralogous pairs detected by BLAST. It turned out that both these distributions contain important and often complimentary information about the quantitative dynamics of the underlying evolutionary process. The shape of the latter (all-to-all) histogram is to a first approximation independent of duplication and deletion rates and thus it allows us to concentrate on fine properties of amino-acid substitution.
The central results of our analysis are:
• The middle part of the PID histogram of all paralogous pairs detected by BLAST is well described by a powerlaw functional form with a nearly universal value of the exponent γ
-4 observed in a broad variety of genomes. Our mathematical model relates this exponent to parameters of intra-protein variability of sequence divergence rates.
• The upper part of the PID histogram corresponding to recently duplicated pairs (PID>90%) deviates from this powerlaw form. It is exactly this subset of paralogous pairs that was extensively analyzed in Ref. [3
]. This feature is consistent with the picture of frequent deletion of recent duplicates proposed in Ref. [3
• The analysis of various features of the PID histogram of all paralogous pairs and that of a subset consisting of true duplicated (sibling) pairs allows us to quantify both the long-term average duplication and deletion rates in a given genome as well as a dramatic increase in those rates for recently duplicated genes.
• Abnormally flat PID histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more Whole-Genome Duplications (WGD). This interpretation is corroborated by the genome of Paramecium tetraurelia where the PID-4 profile of the sequence identity histogram is gradually restored by the successive removal of paralogs generated in its four known WGD events.
• Applying the same methods to large individual families of paralogous proteins allows one to study the variability of evolutionary parameters within a given genome. It is shown that larger or slower evolving families are characterized by higher inter-protein variability of amino-acid substitution rates.