For researchers who want to analyze the occurrence of a potential PTS1 signal, of a putative GPI lipid anchor attachment or myristoylation sites in their target protein sequences, this text provides application and output interpretation guidelines for the WWW-servers big-Π, NMT and PTS1. The methodology behind those tools and their validation is described in great detail elsewhere (Table ) except for the new big-Π plant predictor (B. Eisenhaber, M. Wildpaner, C.J. Schultz, G.H.H. Borner, P. Dupree and F. Eisenhaber, manuscript submitted). In the following, we summarize aspects that are important from the user's point of view.
Big-Π, NMT, PTS1: web URL, taxonomic range and prediction accuracy
A number of sequence motifs at the termini of proteins encode signals for targeting to cellular compartments and for posttranslational modifications. The N-terminal signal peptide responsible for export into the ER is the most well known, the mitochondrial and the chloroplast signals are also N-terminally located. In contrast, the peroxisomal targeting signal PTS1 is C-terminal. Many posttranslational modifications are attached N-terminally (N-myristoylation) or C-terminally (GPI lipid anchors, farnesylation, geranylgeranylation), to name just a few (1
Despite the functional importance of these sequence signals, the theoretical methods for their prediction from the sequence of query proteins has received less general attention than those for studying globular domains. With the concept of homology, the assumption of a common ancestor originating a family of sequentially similar sequences in an evolutionary process involving gene duplications and mutations, function can be assigned to globular domains (having a typical length of 100–150 amino acids) by annotation transfer from experimentally studied sequence family members (2
). Unfortunately, the signals for subcellular targeting and posttranslational modification are located in relatively short (<40 amino acids), non-globular regions with typical amino acid compositional bias and interpositional correlations. Therefore, the measures for quantifying remote sequence similarity cannot be directly applied for family classification of these signals.
Even in the absence of knowledge of the active complex responsible for translocation or modification of the substrate protein, the sequence requirements for productive binding with the active protein complex can be derived from the variability of sequences of experimentally verified substrate protein sequences. If the learning set is large, procedures of unsupervised, automated learning successfully extract complex sequence patterns [for example, in the case of SIGNALP (3
), the current standard for signal peptide prediction]. The same methodology is considerably less powerful if the learning set is an order of magnitude smaller and less reliable as for the mitochondrial or chloroplast targeting signal (4
), especially for rejecting false-positive predictions.
If the sequence motif in the substrate protein is considered from the view point of productive binding with the active complex, simple physical conditions for the rejection of non-permissive query sequences can be formulated (6
). Typically, a core of the sequence motif with several positions of amino acid type conservation is necessary for binding in the active site of the modifying enzyme or the recognition site of the translocator. Conformational flexibility in the motif region is required to adapt to the catalytic cleft. The sequence environment of the core has to provide accessibility of the sequence signal, mechanical linkage to the remainder of the substrate protein and appropriate interaction with the aqueous or membrane surrounding of the active complex. A combined score function with profile terms (for evaluating amino acid type preferences) and physical property terms (with only non-positive scores for rejecting unsuitable queries) can successfully discriminate queries even in cases of single-residue mutations that affect modification efficiency (1