The first crystal structure of a protein to be solved, myoglobin, revealed a compact globular structure with regular α-helical elements linked by short irregular loops (1
). Because single domain globular proteins are often, though not always, easy to crystallise, for a long time they dominated perception of typical protein structure (although fibrous proteins like collagen were of course well known). Gradually, as protein sequences have accumulated, the monodomain view of protein structure has been replaced by the realisation that most proteins are multidomain, at least in higher eukaryotes. The current champion in size is the giant muscle protein titin at >38
000 residues encompassing some 320 autonomously folded domains (2
). Multidomain architectures are usual for transmembrane receptors, signalling proteins, cytoskeletal proteins, chromatin proteins, transcription factors and so forth. There are now several globular protein domain databases accessible on the web, including Pfam (3
), SMART (4
), PROSITE (5
), INTERPRO (6
), PRODOM (7
) and BLOCKS (8
). Using these tools, a user can often get a good overview of the domain architecture of a polypeptide sequence and the functions these domains are likely to perform.
However, there remain protein sequence segments that are difficult to analyse productively. For example, there are often large segments of multidomain proteins that are non-globular, intrinsically lacking the capability to fold into a defined tertiary structure (9
). Sometimes the function of such regions may be as simple as linkers connecting globular domains and the sequence of amino acids is not important. The structure of yeast RNA polymerase II (12
) illustrates this point. Very often, however, these unstructured regions may contain functional sites such as protein interaction sites, cell compartment targeting signals, post-translational modification sites or cleavage sites. These sites are usually short and often reveal themselves in multiple sequence alignments as short patches of conservation, leading to their definition as short sequence motifs. In addition to occurring outside globular domains, some sites, for example, phosphorylation sites, are often found in exposed, flexible loops protruding from within globular domains. These short peptide functional sites are analogous to the linear epitopes of immunology. Considering the abundance of targeting signals and post-translational modification sites, it is reasonable to assume that there are more functional sites than globular domains in a higher eukaryotic proteome.
The PROSITE database has collected a number of linear protein motifs, representing them as regular expression patterns (5
). PROSITE patterns have been very useful, but also suffer from severe overprediction problems and more recently the database has emphasised globular domain annotation at the expense of linear motifs. However, the number of known categories of functional sites has burgeoned dramatically in the last few years and it is clear that there are more to be discovered. One only has to think of the huge current research activity into specific methylation and acetylation of histones and chromatin proteins, which erupted after decades of more indirect analyses (13
). There has been a growing gap in the bioinformatics resources available to researchers for dealing with small functional sites. Indeed, it is impossible for a researcher to find a list of currently known motifs, while going through the literature to retrieve them is impractical without foreknowledge in more areas than any one person will have.
The Eukaryotic Linear Motif (ELM) consortium has established a project to provide a hitherto missing bioinformatics resource for linear motifs. Our aim is to cover the set of functional sites that can be defined by the local peptide sequence, operating essentially independently of protein tertiary structure. The resource suffers from the overprediction problem inherent to small protein motifs, but we are developing context filters such as cell compartment, taxonomy and globular domain clash that can partly reduce the severity of the problem. In this resource, we use the term ELM to denote our bioinformatical representation of a functional site including the sequence motif and its context. ELM is an ongoing project but already provides a working server with >80 motif patterns and access to basic annotation. This manuscript provides an overview of the current status of the ELM resource and an indication of the future directions we hope to take.