State-of-the-art prediction algorithms need to address the prediction of non-canonical weak PTS1s. The accuracy of prediction tools is determined by two parameters, high sensitivity and high specificity. The prediction sensitivity in detecting plant PTS1 proteins depends mainly on the ability to identify all functional PTS1 tripeptides of Spermatophyta and, hence, to predict novel “unseen” PTS1 tripeptides that have been absent from training datasets of positive example sequences. Most previously developed prediction tools for fungi and animals were not designed to infer novel PTS1 tripeptides or predict low-abundance proteins because they employed tripeptide-based selection filters [
29-
32]. By contrast, our newly developed prediction tools for plants could infer novel PTS1 tripeptides, many of which were validated as correct predictions by experimental
in vivo analyses [
16]. By demonstrating in this study that three additional tripeptides are novel non-canonical PTS1 tripeptides, we show that novel tripeptides, even if positioned close to the prediction threshold, are correctly predicted as containing non-canonical PTS1 tripeptides. Thereby, this study increases the number of known plant PTS1s from 51 to 54. With this knowledge more low-abundance plant peroxisomal PTS1 proteins carrying non-canonical PTS1 tripeptides such as QRL>, SQM>, or SDL

>

can now be identified.
On top of the 32 plant PTS1 tripeptide residues experimentally validated previously [
16], the PWM model predicted that ten additional residues might be allowed in plant PTS1 tripeptides ([HKQR][IAVW][QR]>, see Supplemental Dataset 2 online in [
16]). One of these residues was validated in the present study, namely Q (pos. -3). Moreover, D (pos. -2) was validated as an allowed PTS1 tripeptide residue, even though the corresponding Arabidopsis sequence was scored slightly below prediction threshold (Table). Due to the underrepresentation of sequences with non-canonical PTS1 tripeptides in the underlying dataset of positive example sequences, the correct prediction of non-canonical sequences remains challenging, leading to the present inaccuracy that a few false positive (i.e., non-peroxisomal) sequences will be located above prediction threshold (see below) and a few true positive (PTS1 protein) sequences are located below threshold in a prediction grey-zone roughly down to PTS1 protein score position 1100 [
16].
The new experimentally verified PTS1 tripeptides add another two residues, Gln (pos. -3) and Asp (pos. -2) to yield in total 34 experimentally validated position-specific residues for the previously reported plant PTS1 motif ([SAPCFVGTLKIQ][RKNMSLH GETFPQCYD][LMIVYF]>), leading to twelve (pos. -3), 16 (pos. -2), and six (pos. -1) allowed aa residues in plant PTS1 tripeptides (Figure). Hence, the tolerated plant PTS1 motif variation is much higher than previously thought. The former “basic” pos. -2, which was previously considered to be the most conservative aa residue, emerges as the most flexible, with 16 possible residues allowed out of 20 (80

%), even including both acidic residues, Glu and Asp (Figure). Notably, only specific combinations of the residues of the plant PTS1 tripeptide motif yield functional plant PTS1 tripeptides. All experimentally verified plant PTS1 tripeptides identified to date follow the pattern that at least two high-abundance residues of presumably strong targeting strength ([SA][KR][LMI]>) need to be combined with one low-abundance PTS1 residue to yield functional plant PTS1 tripeptides (x[KR][LMI]>, [SA]y[LMI]>, [SA][KR]z>, Figure).
In the present study three Arabidopsis proteins that had previously not been associated with peroxisomes were shown to carry functional PTS1 domains. The QRL

>

decapeptide validated as a functional PTS1 domain derived from the second alternative splice variant of a DNAJ homolog (Figure, Table). No DNAJ homolog has been previously shown to be targeted to Arabidopsis peroxisomes. An HSP70 and a DNAJ homolog are reported to be associated with the glyoxysomal membrane in cucumber, and the latter was shown to specifically interact with a cytosolic Hsp70 [
33,
34]. A watermelon Hsp70 was shown to be dually targeted to both chloroplasts and peroxisomes regulated by alternative translation [
35]. The fact that the other three variants of the DNAJ homolog do not carry potential PTS1 domains indicates that the protein is dually targeted to both the cytosol and peroxisomes regulated by alternative splicing. More detailed studies need to address under which conditions this second splice variant is expressed and the full-length protein is targeted to peroxisomes. To date, only a few plant proteins are reported to be dually targeted to peroxisomes and a second cell compartment by alternative splicing. The most prominent example is Arabidopsis transthyretin-like protein, a bifunctional enzyme involved in purine catabolism [
17,
27,
36].
The functional PTS1 domain terminating with the newly identified PTS1, SQM>, belongs to RDH3H2 (At5g45160), a GTP-binding protein and paralog of RDH3 (At3g13870, 67

% sequence identity, 82

% similarity at the aa level, [
37], Table). Loss-of-function mutants of RDH3 are suppressed in epidermal cell file rotation, root skewing, and waving on hard-agar surfaces. RHD3 is involved in the control of vesicle trafficking between the ER and the Golgi compartments [
37-
40]. Future research needs to address whether the full-length RHDH2 protein is indeed located in peroxisomes.
The functional PTS1 domain terminating with SDL

>

belongs to the cytosolic Ser/Thr protein kinase CONSTITUTIVE TRIPLE RESPONSE 1 (CTR1, At5g03730), which is an important negative regulator of the ethylene signal transduction pathway regulating plant growth and development [
41](for review see [
42]). Dark-grown seedlings of “triple response” mutants show an altered response to ethylene. The kinase activity of CTR1 is reported to be regulated by multiple reversible phosphorylation events, leading to significant conformational rearrangements [
41]. This mode of post-translational regulation offers the possibility that differential surface exposure of the C-terminal PTS1 domain might cause peroxisome targeting, for instance to transiently eliminate CTR1 from the cytosol.
On the other hand, two predicted non-canonical PTS1 tripeptides could not be validated as functional PTS1 tripeptides for the chosen Arabidopsis sequences, namely those terminating with HKL

>

and RKM>. The reasons might be several-fold, starting from insufficient sensitivity in detecting weak peroxisome targeting, omission of targeting enhancing elements located upstream of the C-terminal decapeptide in the native protein, to incorrect predictions.
When the full-length cDNA of HSP70T-2 (RKM>) was cloned to the C-terminal end of the reporter protein, the reporter fusion remained cytosolic as well (data not shown). Alternative expression systems including stable Arabidopsis lines might be needed to conclusively investigate whether the two predicted proteins are cytosolic
in vivo. As a note of caution, PWM predictions of plant proteins with novel non-canonical tripeptides that have not yet been confirmed as functional tripeptides for other sequences should be considered with greater caution compared to predictions of other Arabidopsis proteins carrying confirmed PTS1 tripeptides. Notably, R (pos. -3) was one of the few residues that could also not be confirmed for one positive example sequence [
16]. It is important to mention that the PWM prediction algorithms do not consider the similarity of biophysical properties of a residues and deduce predictions solely based on discriminative position-specific aa abundance. Due to the high abundance of SKL

>

sequences in the underlying dataset and the close codon similarity between Ser (AG[TC]) and Arg (AG[GA]), the two RKL

>

positive example sequences could have been created by sequencing errors in ESTs and caused the false prediction of RKL

>

and RKM

>

sequences as peroxisomal.
Our PWM algorithm combines the C-terminal PTS1 tripeptide and the upstream region (up to 12-aa residues) into a single prediction model. Peroxisome targeting by weak non-canonical PTS1s essentially depends on the presence of targeting enhancing elements in the upstream region. These elements, however, had only been vaguely defined until now. It has been reported for a few sequences that basic residues enhance peroxisome targeting, primarily if located at pos. -4 [
26]. Except for the SDL

>

sequence, none of the other two sequences carried a basic residue directly in front of the non-canonical PTS1, and the SQM

>

sequence even contained two acidic residues, which are generally very rare in PTS1 domains [
14]. It is therefore of interest to identify specific aa residues in a given upstream PTS1 domain that enhance and are essential for peroxisome targeting. To this end, we established in this study a so-called position-specific permutation analysis for non-canonical PTS1 sequences. For each of the newly identified Arabidopsis PTS1 domains carrying novel non-canonical PTS1 tripeptides, we calculated to what extent single aa exchanges in the upstream domain affected the prediction score for peroxisome targeting. In all three sequences, four to five aa residues were identified in the Arabidopsis proteins that represented close-to-optimal residues in term of peroxisome targeting prediction. These data strongly suggest that these residues function as targeting enhancer elements for peroxisome targeting. The exact positioning of these predicted enhancer elements appears relatively flexible between pos. -4 to −12. Most interestingly, not only basic residues and proline, but also hydroxylated (Ser, Thr), hydrophobic (Ala, Val), and even acidic residues are predicted to be able to enhance peroxisome targeting. These predictions are challenging to validate experimentally due to the moderate (SQM>) to low (SDL>) peroxisome targeting efficiency of the original Arabidopsis decapeptides, making it difficult to investigate further reductions semi-quantitatively. Future studies shall address whether such experimental analyses are feasible, for instance, in case of the QRL

>

sequence.