Linear motifs (LM) are short (3–10) amino acid sequences involved in numerous interactions including the modification-based regulation of protein function [
1]. In particular, LM allow the formation of dynamic modular protein complexes due to the transient and low energy nature of the interactions they mediate [
2]. Furthermore, LM are involved in targeting proteins to the appropriate cellular compartment [
3]. Therefore, even if LM alone do not determine the complete molecular function of a protein, they give valuable information about the protein's role and/or position in the cellular function networks [
4,
5]. The experimental discovery of LM is time consuming and laborious, hence recently considerable research interest has focused on computational predictive approaches.
LM prediction is focused on the discovery either of new LM patterns, or the finding of new instances of already known patterns. ¿From the algorithmic point of view these two approaches represent different challenges. The identification of significantly over-represented sequence patterns in the former, and the distinction between true and false occurrences of a given pattern in the latter. The length of LM creates difficulties in both cases. The significance assessment of new patterns against the background probability distribution of LM is not straightforward due to their shortness. For the same reason, prediction of new LM instances by pattern matching is prone to produce a high proportion of false positives.
Methods for LM prediction take into account the biological context of those short sequences to evaluate the reliability of the newly predicted patterns or instances. Simple keyword association may sometimes be used to find significant connection between motifs and a specific function. That is the case for the EH1 motif, that occurs mainly in proteins containing domains with a transcription factor function [
6]. The use of protein interactions has proven to be a fruitful approach to discover new LM patterns [
7-
11]. Currently DILIMOT [
7], SLiMDisc [
8] and more recently SLiMFinder [
9] are the main tools for
de novo LM discovery. The first one finds over-represented motifs in sets of proteins with a common functional attribute. The other two look for convergently evolved LM using evolutionary information derived from unrelated proteins that share a functional characteristic.
Resources for finding new instances of known motifs have begun to proliferate. Prosite is a large database of protein functional signatures. It initially included LM represented as regular expressions [
12,
13]. Currently, it is mainly devoted to domain profiles [
14]. Scansite is a profile based search engine that predicts LM instances using the amino acid frequency information gathered from experimentally determined sites [
15]. The ELM resource uses manually curated information about known eukaryotic LM to predict new instances, filtering out false positive matches with information about the structure, cellular compartment and species of the submitted sequence [
16]. A similar approach has been implemented subsequently in other resources like the Minimotif Miner [
17].
The use of evolutionary conservation has proved to be useful in the field of LM prediction. It improves the identification of truly functional instances of already known motifs [
17-
19] or allows to assess the strength of a new LM pattern [
7,
8]. The main assumption of this "phylogenetic footprinting" is that instances are conserved when they have a functional value and therefore conserved instances are less likely to be false positive occurrences of a motif.
When examining the conservation of LM, several specific problems arise. In contrast to domains that can be predicted using hidden Markov models [
20], LM cannot be easily detected from a set of homologous sequences, since their conservation signal is not significant due to their length. That is why for motif prediction it is not enough to find a pattern inside a multiple alignment, but it is crucial to also consider the evolutionary relationships among the set of sequences. Moreover, LM tend to localise in structurally disordered segments of the proteins that are difficult to align [
21]. This implies that the accuracy of the conservation scoring scheme also depends on the quality of the alignment.
An additional difficulty is the fact that LM have a non-linear pattern of conservation [
22]. They are far more ephemeral than globular domains and their signature can appear or disappear as a result of single mutations. Ancestrality is not always necessary. This means that LM can appear
de novo during protein sequence evolution, because they do not have to fulfil structural stability constraints in contrast to globular domains. LM losses are also possible in closely related sequences e.g. alternatively spliced forms.
Repetitive LM involved in the interaction with modular proteins, e.g. the DPW epsin motifs that mediate interaction with the adaptor protein AP-2, tend to be present in an inconstant number of copies in homologous proteins.
Finally, it is important to keep in mind that not all the amino acids forming a LM are equally informative. There are key positions, like the S/T/Y in a phosphorylation motif, that if changed result in the loss of function. Other positions accept more than one amino acid of the same physico-chemical group (e.g. acidic, hydrophobic), while some positions can be occupied by any amino acid. These differences have an impact in the definition of LM conservation.
This article presents a new scoring scheme that uses information about the degree of conservation to determine the reliability of a motif match or instance. The method has been developed inside the context of the ELM resource [
16] in order to improve its predictive power without excessively degrading real time server performance for users. It is a three stage algorithm that efficiently manages to distinguish between true and randomly generated instances, keeping low both the false negative and false positive rates. A set of homologous sequences is defined. This set is used to reconstruct the evolution of the predicted instance in terms of the conservation of the corresponding regular expression. Subsequently, weights are assigned to the observed evolutionary events. The final conservation score (CS) is computed using all the gathered information.