|Home | About | Journals | Submit | Contact Us | Français|
Many posttranslational modifications (N-myristoylation or glycosylphosphatidylinositol (GPI) lipid anchoring) and localization signals (the peroxisomal targeting signal PTS1) are encoded in short, partly compositionally biased regions at the N- or C-terminus of the protein sequence. These sequence signals are not well defined in terms of amino acid type preferences but they have significant interpositional correlations. Although the number of verified protein examples is small, the quantification of several physical conditions necessary for productive protein binding with the enzyme complexes executing the respective transformations can lead to predictors that recognize the signals from the amino acid sequence of queries alone. Taxon-specific prediction functions are required due to the divergent evolution of the active complexes. The big-Π tool for the prediction of the C-terminal signal for GPI lipid anchor attachment is available for metazoan, protozoan and plant sequences. The myristoyl transferase (NMT) predictor recognizes glycine N-myristoylation sites (at the N-terminus and for fragments after processing) of higher eukaryotes (including their viruses) and fungi. The PTS1 signal predictor finds proteins with a C-terminus appropriate for peroxisomal import (for metazoa and fungi). Guidelines for application of the three WWW-based predictors (http://mendel.imp.univie.ac.at/) and for the interpretation of their output are described.
For researchers who want to analyze the occurrence of a potential PTS1 signal, of a putative GPI lipid anchor attachment or myristoylation sites in their target protein sequences, this text provides application and output interpretation guidelines for the WWW-servers big-Π, NMT and PTS1. The methodology behind those tools and their validation is described in great detail elsewhere (Table (Table1)1) except for the new big-Π plant predictor (B. Eisenhaber, M. Wildpaner, C.J. Schultz, G.H.H. Borner, P. Dupree and F. Eisenhaber, manuscript submitted). In the following, we summarize aspects that are important from the user's point of view.
A number of sequence motifs at the termini of proteins encode signals for targeting to cellular compartments and for posttranslational modifications. The N-terminal signal peptide responsible for export into the ER is the most well known, the mitochondrial and the chloroplast signals are also N-terminally located. In contrast, the peroxisomal targeting signal PTS1 is C-terminal. Many posttranslational modifications are attached N-terminally (N-myristoylation) or C-terminally (GPI lipid anchors, farnesylation, geranylgeranylation), to name just a few (1).
Despite the functional importance of these sequence signals, the theoretical methods for their prediction from the sequence of query proteins has received less general attention than those for studying globular domains. With the concept of homology, the assumption of a common ancestor originating a family of sequentially similar sequences in an evolutionary process involving gene duplications and mutations, function can be assigned to globular domains (having a typical length of 100–150 amino acids) by annotation transfer from experimentally studied sequence family members (2). Unfortunately, the signals for subcellular targeting and posttranslational modification are located in relatively short (<40 amino acids), non-globular regions with typical amino acid compositional bias and interpositional correlations. Therefore, the measures for quantifying remote sequence similarity cannot be directly applied for family classification of these signals.
Even in the absence of knowledge of the active complex responsible for translocation or modification of the substrate protein, the sequence requirements for productive binding with the active protein complex can be derived from the variability of sequences of experimentally verified substrate protein sequences. If the learning set is large, procedures of unsupervised, automated learning successfully extract complex sequence patterns [for example, in the case of SIGNALP (3), the current standard for signal peptide prediction]. The same methodology is considerably less powerful if the learning set is an order of magnitude smaller and less reliable as for the mitochondrial or chloroplast targeting signal (4,5), especially for rejecting false-positive predictions.
If the sequence motif in the substrate protein is considered from the view point of productive binding with the active complex, simple physical conditions for the rejection of non-permissive query sequences can be formulated (6,7). Typically, a core of the sequence motif with several positions of amino acid type conservation is necessary for binding in the active site of the modifying enzyme or the recognition site of the translocator. Conformational flexibility in the motif region is required to adapt to the catalytic cleft. The sequence environment of the core has to provide accessibility of the sequence signal, mechanical linkage to the remainder of the substrate protein and appropriate interaction with the aqueous or membrane surrounding of the active complex. A combined score function with profile terms (for evaluating amino acid type preferences) and physical property terms (with only non-positive scores for rejecting unsuitable queries) can successfully discriminate queries even in cases of single-residue mutations that affect modification efficiency (1,8,9).
Posttranslational modification with a GPI lipid anchor consists of two reactions executed by the transamidase complex in the endoplasmic reticulum, the attachment of the GPI moiety to the carboxyl terminus (ω-site) of the polypeptide after proteolytic cleavage of a C-terminal propeptide. Typically, a GPI lipid anchored protein is finally moved to the extracellular side of the cytomembrane via vesicular transport. The classical sequence pattern consists of four regions defined by the preferred pattern of physical properties of amino acid side chains (6,10). (i) The region ω−11…ω−1 is a flexible, polar linker. This stretch has been hypothesized to occupy a channel in the transamidase complex. In the structural model of the transamidase (11), access to the active site cleft of the cysteine protease PIG-K/gpi8 is regulated by the endoplasmic lumenal domain of PIG-T, a β-propeller structure with a central hole. (ii) The region ω−1…ω+2 has volume constraints and is occupied preferentially by small residues. (iii) The spacer region ω+3…ω+9 is composed of moderately polar residues. (iv) The typical hydrophobic tail begins with ω+9 or ω+10 and extends up to the C-terminal end.
The big-Π tool (Table (Table1)1) evaluates the concordance of a query with this sequence motif. In the output, the primary and, if available, the secondary ω-sites are reported. Together with their sequence position, the prediction quality [strong prediction or twilight zone (8)], the score and the probability of false positive prediction are presented. In the case of sequences without GPI lipid anchor motif, the nevertheless best site is listed. In either case, a detailed description of score components is shown that allows the evaluation of the agreement with amino acid type profile and with physical pattern properties and, especially, to analyze reasons for negative predictions. Therefore, the big-Π predictor is well suited for designing mutations aimed at abolishing GPI lipid anchoring capacity. For example, modified query sequences where the putative site is substituted by a residue with large side chain or with more immobile backbone can be tested prior to the experiment.
A positive prediction by big-Π does not necessarily mean capacity for GPI lipid anchoring in vivo. Big-Π assesses only the concordance of the C-terminus with the GPI lipid anchor modification motif. In the evaluation of the prediction outcome, the issue of ER export signal should receive special independent attention. One can routinely check for signal leaders (3) but alternative export signals [see, for example (12)] should also be taken into account.
Further, the function parameterization relies on the small set of known GPI lipid anchor modified proteins. Thus, a largely negative physical property term (‘profile independent score’) can be considered a sure sign for the absence of the GPI anchor motif because only a handful of very stably derived parameters enter this term (8). In contrast, a small profile score can also be a result of the still limited learning set with biased amino acid type preferences and, consequently, an insufficiently general profile matrix.
N-terminal N-myristoylation is the attachment of a myristoyl anchor to an N-terminal glycine by a myristoyltransferase (NMT) for modulation of interaction of the modified protein with intracellular membranes or with other proteins. At least the N-terminal 17 residues of the substrate protein experience amino acid type variability restrictions for N-myristoylation (7). Positions 1–6 with glycine in the leading position fit the binding pocket of the NMT, positions 7–10 interact non-specifically with the NMT's surface at the mouth of the catalytic cavity, and positions 11–17 form a hydrophilic linker. Thus, in addition to the segment physically interacting with the NMT, 10–11 more residues in a linker region experience weaker sequence variability restrictions and contribute to the recognition motif.
The NMT predictor (Table (Table1)1) scores the agreement of a query N-terminus with the N-myristoylation pattern and returns the corresponding probability of false-positive prediction (Fig. (Fig.1).1). We distinguish reliably predicted targets (score≥0), twilight zone predictions (0>score≥−2), and proteins that are predicted not to be NMT targets. It should also be noted that, for example in the case of viral polyproteins, internal glycines become N-terminal after protein processing and are myristoylated. Optionally, possible myristoylation at internal glycines (in typical processing patterns) may be analyzed, too (9).
The N-myristoylation signal is commonly applied to target proteins to membranes. The NMT predictor can be used for testing protein constructs with engineered N-terminal N-myristoylation motif prior to the experiment. With the complete output of score components, the agreement with the physical property pattern can be checked in detail (for example, the suitability of the linker region) and it becomes easy to examine the effects of changes in the construct.
To date, two different signals that can trigger peroxisomal import have been characterized, termed PTS1 and PTS2. PTS1, the major targeting signal, consists of the three C-terminal amino acids (mainly the canonical tripeptide S/A/C-K/R/H-L, but not only) that bind to the inner cavity of the receptor molecule Pex5 in addition to several residues further upstream (13) that either interact with the surface of Pex5 or serve as a short conformationally unrestricted linker to the remainder of the protein.
The concordance with this motif is searched for using an algorithm implemented in the PTS1 signal server (Table (Table1).1). Reliably predicted targets should have a non-negative total score; queries with a score larger than −10 are considered as twilight zone hits. In all other cases, the protein is predicted not to have a PTS1 signal. We must emphasize that the server analyzes exclusively the concordance of the query's C-terminus with the generalized PTS1 motif as described above. The PTS1 signal competes with other signals if contained in the sequence. Proteins with dual localizations, including for example a peroxisomal and a mitochondrial fraction (14), are known; proteins with a strong signal peptide are most probably co-translationally exported to the ER.
The authors are grateful for generous support from Boehringer Ingelheim. This project has been partly funded by the Austrian National Bank (Österreichische Nationalbank), by the Fonds zur Förderung der wissenschaftlichen Forschung Österreichs (FWF P15037) and by the Austrian Gen-AU bioinformatics integration network sponsored by BM-BWK.