For proteins in all sets, we identified domains using SMART [46
], including domains from Pfam-A [47
]. We also removed regions showing similarity between members in a set of sequences using BLAST (E
≤ 0.001) [48
], which removes the redundant measurements. We used TEIRESIAS [9
] to detect all non-overlapping motifs of three to eight residues, requiring at least two identical positions. The method essentially detects all motifs of a variable length (i.e., three to eight) in which positions can either be specified as a particular amino acid, or represented by a wildcard (i.e., “x”). We did not allow for conservative substitutions (e.g., D/E), and ignored any motif that occurred in fewer than three sequences in the set.
We assessed the significance of a particular motif occurring a certain number of times within a set of sequences (interaction set) using the binomial distribution:
where p is the probability of seeing the motif in a background database, n is how often the motif was seen in the set of proteins, and M the size of the set.
The probability (p)
was computed as a frequency of the motif in the background database of 15,000 randomly selected proteins. These proteins were taken from the SWISSPROT [49
] and were subjected to the same filtering procedure as the test protein sets.
Values agree well with intuition: Motifs that are complex and thus rare need only be observed a few times to be significant, for example, the motif PxVPLR occurring in four out of 21 proteins gives a probability of 10−11. More common motifs must be seen more often to reach the same significance; for example, the VxxR (a subset of the first motif) must be seen in 19 out of 21 to reach a similar probability.
True instances of linear motifs are typically conserved across closely related species [42
]. It is thus an advantage to use the information from the same (i.e., orthologous) protein in multiple genomes. Information from orthologues can be readily combined into a single value (Scons)
, which is the product of all binomial probabilities from the genomes considered:
This procedure will decrease the final value (and thus increase the significance) for all conserved motifs, but will have no effect if the motifs (or indeed the orthologues) are missing. The combined value is no longer a true probability, because the motifs from related species are not independent, but rather are a measure of likelihood of a conserved motif to occur at random in the set. To estimate significance we thus compare the values to those generated from random sets of proteins. These combined values greatly improve the sensitivity and specificity of the procedure: More known motifs are recovered and fewer clearly false predictions are made.
To get confidence thresholds for Scons, we created 50 random sets of sequences of the same number and length as seen in the interaction sets for each organism using the complete proteomes. We then ran the complete procedure for each random set and computed the distribution of Scons, which gave thresholds (p-value < 0.001) for each dataset: 3.0 × 10−17 for yeast, 7.5 × 10−14 for nematode, 8.0 × 10−15 for fly, and 7.0 × 10−38 for human. The differences between the thresholds are due largely to differences in the number and similarity of closely related species with complete genomes available: Four substantially similar genomes were available for human but only one for the fly and nematode.
We extracted orthologues from the STRING database [50
] and aligned those using MUSCLE [51
] with default parameters. We considered only closely related species because known instances of linear motifs are rarely conserved outside of them. We considered orthologues in the four other completely sequenced yeast genomes (Kluyveromyces lactis, Ashbya gossypii, Debaryomyces hansenii,
and Candida glabrata)
for yeast (S. cerevisiae)
motifs, D. pseudoobscura
for fly (D. melanogaster), C. briggsae
for nematode (C. elegans),
and Mus musculus, Rattus norvegicus, Gallus gallus,
and Fugu rubripes
for motifs found in human (H. sapiens)
The Linear Motif Discovery (LMD) program and all data related to this paper are available online (http://lmd.embl.de