Many eukaryotic proteins have highly modular architectures. Multidomain proteins are usual for transmembrane receptors, signalling proteins, cytoskeletal proteins, chromatin proteins, transcription factors and so forth. As a consequence, many programs have been developed for the detection and alignment of protein domains. Online resources can now provide a good overview of the globular domain architecture of a polypeptide sequence and the functions these domains are likely to perform, e.g. Pfam [1
], SMART [2
], Interpro [3
]. However, less research has been directed towards the analysis of the large segments of multidomain proteins that are non-globular, intrinsically lacking the capability to fold into a defined tertiary structure [4
]. Sometimes such regions may simply act as linkers connecting globular domains and in this case, the sequence of amino acids is not critical to function. Very often, however, these unstructured regions contain important functional sites such as protein interaction sites, cell compartment targeting signals, post-translational modification sites or cleavage sites. Large parts of many proteins, such as the insulin receptor substrates, or sometimes even the entire protein, such as the Alzheimer's protein Tau [6
], are natively unstructured. The functional sites within these unstructured regions can often be defined as short, linear motifs (LMs) – linear in the sense that only the local peptide sequence is relevant to function. In order to avoid confusion, in this paper we will use the term 'sequence' to refer to the full-length protein, while a specific region of a protein sequence will be referred to as a 'segment' or a 'motif'. The Eukaryotic Linear Motif resource (ELM) has entries describing ~130 varieties of linear motif [7
], but it is not fully comprehensive with respect to current literature and it has been estimated that hundreds more have yet to be discovered [8
]. When eubacterial, archaebacterial and viral motifs are also considered, the true number of unknown functionally important LMs is likely to be huge. Given the fundamental roles these motifs play in cell regulation and signalling, identifying these motifs will be of crucial importance in many biological disciplines.
Until recently, and in stark contrast to protein domain discovery, the bioinformatics field has had a negligible impact on LM discovery: motif discovery is generally performed by low throughput experimental delineation of protein interaction segments. The central problem confounding computational methods has been the lack of significance of motif matches when searching sequence databases, making it impossible to confidently identify all the motifs present in a given protein sequence by simple sequence analysis tools. The majority of LMs are between 3 and 10 amino acids in length and most have one or more ambiguous (variable) or wildcard (totally variable) residues. Their short and degenerate nature makes real LMs difficult to distinguish from the background distribution of randomly occurring false positive motifs. Nevertheless, efforts are now underway to develop bioinformatics tools that will contribute to the linear motif discovery problem. As a first step, it is necessary to catalogue linear motifs and particular instances that are known to be functional. Such data collections include the eukaryotic linear motif (ELM) resource [7
] and ScanSite [9
]. A number of tools have been developed, e.g. ELM [7
], QuasiMotiFinder [10
], MiniMotif [11
] and the ACS method [12
], that employ various methods, such as domain masking and evolutionary filtering respectively, to discover new occurrences of previously known motifs. Other methods, such as the LMD method [8
] (implemented in the web server DILIMOT [13
]), SLiMDisc [14
], SLiMFinder [15
] and Miner [16
], explicitly attempt novel LM discovery using large scale interaction datasets and/or motif conservation.
One of the major limitations in predicting short linear motifs is the evaluation of the many potential motifs found in each protein, to distinguish between true functional sites and incorrect occurrences of a given pattern. In the worst case, there are motifs which have such low support and low information content as to be almost indistinguishable from random noise in most datasets, e.g. the PCSK cleavage site K/RR [17
] which plays a role in proteolytic processing of neuropeptide and peptide hormone precursors, or the peroxisomal targeting motif WxxxY/F (where x represents any arbitrary amino acid) [18
]. It is vitally important, therefore, to develop novel scoring methods or to consider other information, such as contextual information, e.g. loop region, N/C-terminus, cellular localisation, if such data is available, or evolutionary information, since motifs conserved during evolution are more likely to be functional. Conservation has been shown to be an essential factor in the prediction of functional motifs. For example, many motif discovery systems, such as LMD, QuasiMotiFinder Miner, MiniMotif and the ACS method use a combination of traditional motif scores and evolutionary conservation to rank potential motifs. It is worth pointing out though, that while LMD explicitly utilises conservation, the method used is alignment-free and, as such, would not be affected by the developments described in this article. SLiMFinder and SLiMDisc make use of automated multiple alignments and conservation scores to help visualise and interpret results. We have also recently developed a rapid automated conservation scoring pipeline suitable for real time operation in the ELM resource [19
It follows therefore that, in order to exploit evolutionary information optimally, we need to construct multiple sequence alignments of the highest quality. LMs that occur in several different phyla should appear as short patches of conservation in this alignment. However, a large majority of LMs are found in the natively disordered regions [20
] that are difficult to align using classical multiple sequence alignment programs, which are better adapted to protein domain alignments. The biological relevance of the alignments produced by these programs is usually assessed by systematic comparison with established benchmark sets, e.g. BAliBASE [21
], Prefab [22
] or Sabmark [23
], based on 3D structure superpositions of globular domains. The introduction of these objective benchmarks has had a considerable effect on the evolution of alignment algorithms and has led to a significant improvement in overall multiple alignment quality [24
]. However, there is also a risk that alignment software optimised on structure superpositions has been overfitted to globular domains and may not adequately account for awkward features of full length protein sequences, such as N- and C-terminal extensions and motif-rich non-globular sequence segments. Therefore, to evaluate the ability of multiple alignment methods to identify and align LMs, new test sets are now needed. Benchmarks have already been developed for motif discovery in genomic DNA sequences, such as transcription factor binding sites, e.g. [25
], but these benchmarks are not generally organised into evolutionarily related sets that might be used to evaluate multiple sequence alignment programs. Another reference database, IRMBASE [26
], consists of simulated conserved motifs implanted into non-related artificial protein sequences. However, this benchmark does not reflect the problems associated with identifying and aligning the short linear motifs that are essential for the function of real multimodular proteins.
The main objective of the work presented here is to provide a standard way to assess the ability of a multiple alignment program to correctly align the linear motifs occurring in a set of related sequences. However, if the multiple alignment is to be used in a subsequent motif discovery system, it is important that (i) the sequences containing the motif should be accurately aligned and (ii) the sequences that do not contain the motif should not be aligned in the corresponding region. To address these issues, we have developed a new Reference Set that has been incorporated in the BAliBASE benchmark suite [21
]. The benchmark includes example multiple alignments for most of the motifs annotated in the ELM resource [7
]. For each LM, a representative set of homologous sequences has been selected and a multiple alignment of the complete sequences (MACS) has been constructed and manually refined. A number of different test subsets are provided, representing typical scenarios and problems that occur when trying to align the motifs in the context of a global multiple alignment.
Using the new BAliBASE Reference Set, we then evaluated the accuracy of the motif alignments obtained from a number of widely used or recently developed multiple alignment programs. The performance of the different programs was assessed by comparing the alignments constructed by each program with the reference alignments. We show that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we highlight a number of specific problems. This will hopefully generate interest in developing new algorithms and should provide program developers with guidelines for future enhancements that will improve the quality of motif alignments.