All descriptors gave good prediction results with mean AUCs ranging from 0.9535 to 0.9934 (). In contrast to the classification results obtained from the ANN classification in our earlier study,17
these RF classifications are highly stable with sds of the AUC distributions of about 0.0004 (cv = 0.0004).
Boxplot of descriptor performances (mean AUC). On the y-axis the AUC values for all descriptors analyzed are shown as a boxplot.
The best results for the prediction of small GTPases were obtained with RFs using “Normalized positional residue frequency at helix termini N1”26
as a descriptor ().
Normalized positional residue frequency at helix termini N1. The descriptor values for each amino acid (single letter code) are shown
In , the interval for the descriptor for each position and each family is shown. Rab proteins seem to have very compact descriptor values within the sequences, whereas all other families exhibit high diversity in descriptor values.
Structural plot of descriptor for all families.
The sensitivity, specificity and accuracy are comparable to those published in our earlier study,17
but nevertheless, the RFs use only one descriptor instead of two (hydrophobicity and secondary structure). The sensitivity is 94% with a corresponding specificity of 98.94%, and thus, an accuracy of 96.49%.
The OOB-error for the best RF is 3.51%. The best cutoff is found at 0.4568, which is slightly smaller compared to the results obtained from the classification with artificial neural networks in our earlier study.17
The ROC curve is shown in . The mean AUC of the RF is 0.9934 with a standard deviation of 0.0004 (cv = 0.0004). The most important normalized sequence positions for the classification process are shown in .
ROC curve of the best performing random forest. (1-specificity) against sensitivity, ranging from 0 to 1 on both axes.
Importance plot of the GTPases classification. The x-axis represents the normalized sequence positions, whereas the y-axis denotes the percental increase in misclassification rate.
By applying a retransformation to the real sequence lengths, the 30 most important positions for discriminating between small GTPases and other proteins (>4% increase in misclassification when left) can be reassigned to the sequence. The most important positions for the definition of a small GTPase (>15%) are displayed using the structure of Rab6A as a representative small GTPase27
(). The C-terminal region is unstructured, and thus, is not shown within the Figure. The most important positions near the N-terminus are at position 20 and 21 (G and E). These amino acids are located within the highly conserved switch I region, a common structural feature of all small GTPases, and are involved in nucleotide binding.
Figure 5 Most important positions for the identification of small GTPases. The most important regions (>4%) for the discrimination whether a protein belongs to the class of small GTPases are highlighted in red within the Rab6A structure.27 Furthermore, (more ...)
The results for the assignment of the small GTPases to the specific families are shown in .
Family classification. The mean AUC values, standard deviations (sd) and coefficient of variation (cv) are shown for each family of small GTPases
The RFs identified the most important positions (>4% increase in misclassification) within the protein families as followed:
- Rab family (Rab6A): 11 (L), 31 (T), 186–188 (D,M,I)
- Rho family (Rho6 = RND1): 6 (A), 9 (P), 12 (A), 16 (L), 18 (L), 20 (G), 32 (Q), 75 (N), 162 (E), 194 (L)
- Arf/Sar family (Arf1): 1 (M), 2 (G), 16 (K), 68 (V)
- Ran family (Ran): 212–214 (E,D,D)
In the alignment in , the respective positions in representative human proteins are shown for all families. For Rab, Rho and Arf/Sar families, representative structures are shown in and the important positions are highlighted.
Most important positions for the classification of small GTPase families.
The important residues for Rab family assignment are concentrated within the so called Rab subfamily (RabSF) regions. These RabSF regions were defined by Pereira-Leal and Seabra and represent conserved sequence motifs within the Rab family that allow a specific subclassification of the family.28
The relevant amino acids for the assignment of the Arf/Sar family are located mainly within the N-terminal region. This is notable because in contrast to Rab, Rho and Ras families, that are attached to their target membrane via C-terminal geranylgeranylation or farnesylation, Arf is N-terminally associated with membranes. Therefore, the N-terminus of Arf is myristoylated and forms an amphipatic alpha helix.7
The respective residues for Ran classification are concentrated close to the C-terminus of the protein. Ran is not membrane bound and, in contrast to Rab, Rho and Ras family members, does not exhibit a cystein containing lipid modification motif at its C-terminus.6
For the Rho family, the residues identified by the RF are located mainly within the N-terminal region and some amino acids spread over the sequence. Remarkably, those are all found within or adjacent to critical structural elements, as can be seen in the Rho6/Rnd1 sequence in . Nobes et al29
report Rho&/Rnd1 to exhibit only a weak intrinsic GTPase activity and propose that it might thus be constitutively GTP bound. This might be an explanation for an unusual dispersal of important positions within the Rho6/Rnd1 sequence in structural elements that usually play a role in GTPase function. Hence, Rho6/Rnd1 might not be a good representative of the family and it might be useful to map the important residues to other Rho family members. Furthermore, Rho GTPases show differences in their primary sequence in comparison to the other families, for example the “Rho insert” (which can also be seen in the alignment in between beta5 and alpha4 within the G domain), which might interfere with a correct reassignment of the exact positions after interpolation.10
GTPasePred (see additional file 1) can be used to predict novel potential small GTPases. It uses the aforementioned “Normalized positional residue frequency at helix termini N1”26
descriptor to predict whether a protein sequence belongs to the superfamily of small GTPases, and subsequently, to which family it belongs. GTPasePred is implemented in Java [http://java.sun.com
] and R [http://www.R-project.org
], and thus, needs the Java JRE 1.6 and R (with the random forest package) installed.
In order to predict one or more novel potential small GTPases, simply copy the protein sequences in the file sequences and, in the case of a Linux/Unix system, start the classification process by typing/start in the terminal. The results are stored in the file Results.txt. In the case of a Windows system, use start bat to encode the sequences, start R in the current directory and type in source (“program”). The results of the classification process are shown on the screen.
First, the protein sequences are classified whether to be a small GTPase, and in the case of a positive classification, they are subsequently classified by the family RFs. The RF with the highest probability output (positive classification) is selected (). If this RF has an output ≥0.5, the protein sequences are assigned to its specific family; otherwise it is assigned as a GTPase in general and classified as “Ras or not further specified small GTPase”.
Figure 8 Classification processing flow. A sequence is only forwarded as an input sequence to the subfamily RFs, if it was assigned and identified as a GTPase by the GTPase-RF. If the highest output value of a family of RFs exceeds 0.5 for such an input sequence, (more ...)
Application to newly sequenced genomes
The algorithm can also be applied, when newly sequenced genomes are available.
The work flow is as follows:
- Identify the correct open reading frames (ORFs), eg, with the ORF Finder (http://www.ncbi.nlm.nih.gov/projects/gorf/), incrementally for all genes within the newly sequenced genome.
- The translated protein sequences have to be saved in the file sequences, which subsequently can be used as the input for GTPasePred.
- All proteins will be encoded using the aforementioned descriptor and classified whether to be a small GTPase or not. The results of the classification process will be saved in Results.txt.
Thus, combining ORF Finder with GTPasePred can be used to identify potential novel GTPases in newly sequenced genomes. An example of our application to newly sequences genomes can be found in additional file 2.