|Home | About | Journals | Submit | Contact Us | Français|
In this study we used a Random Forest-based approach for an assignment of small guanosine triphosphate proteins (GTPases) to specific subgroups. Small GTPases represent an important functional group of proteins that serve as molecular switches in a wide range of fundamental cellular processes, including intracellular transport, movement and signaling events. These proteins have further gained a special emphasis in cancer research, because within the last decades a huge variety of small GTPases from different subgroups could be related to the development of all types of tumors. Using a random forest approach, we were able to identify the most important amino acid positions for the classification process within the small GTPases superfamily and its subgroups. These positions are in line with the results of earlier studies and have been shown to be the essential elements for the different functionalities of the GTPase families. Furthermore, we provide an accurate and reliable software tool (GTPasePred) to identify potential novel GTPases and demonstrate its application to genome sequences.
The assignment of proteins to functional classes is an important principle in the understanding of complex cellular processes. The function of a protein is defined by its three dimensional structure, which in turn is determined by its amino acid sequence. However, different amino acid compositions can fold into similar or nearly identical three dimensional structures that can fulfill analog functions. After the detection of a novel amino acid sequence, the corresponding protein has to be assigned to existing functional classes by either homology search of protein sequences or functional classification using descriptors. For functional classification, different machine learning approaches exist, such as artificial neural networks (ANNs),1 support vector machines (SVMs),2 Random Forests (RFs) or hidden Markov models (HMMs). Additionally, different descriptors can be used that vary from elementary descriptors like physicochemical attributes to very complex and computationally overcharged properties. The classification accuracy depends heavily on the selected descriptor sets, and thus, the composition of the descriptor set is the most critical part in classifier development.3,4
The main objective of the work presented here is to analyze and classify protein sequences from the superfamily of small guanosine triphosphate proteins (GTPases).
The small GTPases, also termed the “Ras” (rat sarcoma) superfamily of GTPases, consists of small monomeric proteins that can act as “molecular switches”. The basis for this switch function is their ability to bind and hydrolyze GTP: when GTP is bound, the switch is turned “on” and downstream effectors are activated; hydrolysis of GTP to guanosine diphosphate (GDP) converts the protein into its inactive conformation, the switch is turned “off ”.5
The Ras superfamily of small GTPases is typically divided into five families: Ras related in brain (Rab), Rho, Ras related nuclear protein (Ran), adenosine diphosphate (ADP) robosylation factors (Arf)/secretion associated and Ras related (Sar),and the eponymous Ras proteins.6 These families share a common core structure, the G-domain that consists of five alpha helices and six beta sheets. Here, binding of GTP and cofactor magnesium takes place. A conserved structural feature within the G domain of all small GTPases are the switch I and switch II regions, where the major conformational changes upon GTP binding and hydrolysis take place.5
One important feature of most small GTPases are lipid modifications that are posttranslationally attached and facilitate the specific targeting and attachment of the GTPase to intracellular membranes. Ras, Rho and Rab carry farnesyl or geranylgeranyl isoprenoids that are attached to specific cysteincontaining recognition motifs at the C-terminus. Arf/Sar proteins are modified at their N-terminus by myristoylation, whereas Ran is not lipid modified at all and thus not membrane bound.6,7
Due to differences in structure, posttranslational modifications and subcellular localization, the small GTPase families fulfill different functions within the cell. The Ras family proteins are major regulators in signal transduction events and have been shown to play important roles in the development of a variety of human carcinomas.6,8 Rho GTPases are involved in processes linked to the cytoskeleton like cell morphology and mobility.9–11 The small GTPase Ran facilitates transport into and out of the nucleus.12 Members of the Arf/Sar family regulate different steps in intracellular membrane transport.13 Proteins from the Rab family, the largest family of small GTPases, are important factors in membrane trafficking events and in the definition of organelle identity.14,15
The involvement of a variety of Ras superfamily proteins in human tumorigenesis makes these proteins interesting subjects in cancer research, and hence, the identification and functional characterization of novel GTPases is an important topic in molecular cell biology.8,10,16
In a recent study we developed a neural network cluster (NNC) for the identification and classification of small GTPases.17 Using this NNC we were able to distinguish between small GTPases and nonGTPases from primary sequence data, and to assign the small GTPase sequences to one of the specific families. In this new study, we use another type of machine learning algorithm, namely random forests (RFs),18 for this task.
ANNs, as used in our earlier study,17 are universal approximators that can be used to solve nonlinear classification problems, but are prone to overtraining.19,20 In contrast to ANNs, RFs are also excellent nonlinear models and highly stable, and in general – due to the fact that they belong to the classifier ensembles – perform better than single decision trees (DTs).21 They are less easily interpretable than DTs, but provide variable importance measures.18
From this importance analysis we were able to identify the most important positions within the protein sequences for the classification process, and thus, get more detailed insights into the molecular differences of those proteins belonging to the family of small GTPases.
The data set of this study was taken from Heider et al.17 It consists of 399 Rab GTPases, 134 Rho GTPases, 78 Arf/Sar GTPases, 52 Ran GTPases and 772 protein sequences not belonging to the superfamily of small GTPases. These sequences represent a wide range of different organisms.
The 772 nonGTPases have a similar sequence length compared to the small GTPases and are used as negative samples in the classification process. First, GTPases are differed from nonGTPases. Then, a protein once being assigned as a small GTPase is subsequently classified by four independent random forests trained on either the Rab, Rho, Arf/Sar or Ran family as positive and all other families as negative samples. Proteins that can be identified as a small GTPase but cannot further be classified by one of the RFs are grouped into “Ras or not further specified small GTPase”.
We analyzed 544 descriptors derived from the amino acid (AA) index database22 for our study. Thirteen descriptors were incomplete, and thus, not investigated further.
Due to the fact that protein sequences differ in their primary sequence length, all sequences were normalized to 300. For this normalization procedure we used a linear interpolation as previously described.17
We trained random forests (RFs)18 for the identification of small GTPases and their assignment to specific GTPase families, using the implementation in the RF package of R [http://www.R-project.org]. In our application each RF consisted of 2000 randomly grown decision trees. The decision is made on a majority vote, where at least 50% of the trees assign the specified class.
The importance of each variable, ie, the normalized sequence position, for the correct classification can be assessed by determining the increase in misclassification rate due to leaving this variable.18
For our study we performed a 30-fold leave-one-out validation procedure in order to assess the ability to generalize to unseen sequences for each classifier. Thus, we calculate the mean sensitivity (SN), specificity (SP) and accuracy (AC) as follows:
with TP: true positives, FP: false positives, FN: false negatives and TN: true negatives.
Moreover, we used Receiver Operating Characteristics curves (ROC) (Fawcett, 2006) to visualize and the mean area under the curve (AUC), standard deviation (sd) and coefficient of variation (cv) to compare the classifiers. Furthermore, we report the out-of-bag error (OOB) for the best random forest.18
All descriptors gave good prediction results with mean AUCs ranging from 0.9535 to 0.9934 (Figure 1). In contrast to the classification results obtained from the ANN classification in our earlier study,17 these RF classifications are highly stable with sds of the AUC distributions of about 0.0004 (cv = 0.0004).
In Figure 2, the interval for the descriptor for each position and each family is shown. Rab proteins seem to have very compact descriptor values within the sequences, whereas all other families exhibit high diversity in descriptor values.
The sensitivity, specificity and accuracy are comparable to those published in our earlier study,17 but nevertheless, the RFs use only one descriptor instead of two (hydrophobicity and secondary structure). The sensitivity is 94% with a corresponding specificity of 98.94%, and thus, an accuracy of 96.49%.
The OOB-error for the best RF is 3.51%. The best cutoff is found at 0.4568, which is slightly smaller compared to the results obtained from the classification with artificial neural networks in our earlier study.17 The ROC curve is shown in Figure 3. The mean AUC of the RF is 0.9934 with a standard deviation of 0.0004 (cv = 0.0004). The most important normalized sequence positions for the classification process are shown in Figure 4.
By applying a retransformation to the real sequence lengths, the 30 most important positions for discriminating between small GTPases and other proteins (>4% increase in misclassification when left) can be reassigned to the sequence. The most important positions for the definition of a small GTPase (>15%) are displayed using the structure of Rab6A as a representative small GTPase27 (Figure 5). The C-terminal region is unstructured, and thus, is not shown within the Figure. The most important positions near the N-terminus are at position 20 and 21 (G and E). These amino acids are located within the highly conserved switch I region, a common structural feature of all small GTPases, and are involved in nucleotide binding.
The results for the assignment of the small GTPases to the specific families are shown in Table 2.
The RFs identified the most important positions (>4% increase in misclassification) within the protein families as followed:
In the alignment in Figure 6, the respective positions in representative human proteins are shown for all families. For Rab, Rho and Arf/Sar families, representative structures are shown in Figure 7 and the important positions are highlighted.
The important residues for Rab family assignment are concentrated within the so called Rab subfamily (RabSF) regions. These RabSF regions were defined by Pereira-Leal and Seabra and represent conserved sequence motifs within the Rab family that allow a specific subclassification of the family.28
The relevant amino acids for the assignment of the Arf/Sar family are located mainly within the N-terminal region. This is notable because in contrast to Rab, Rho and Ras families, that are attached to their target membrane via C-terminal geranylgeranylation or farnesylation, Arf is N-terminally associated with membranes. Therefore, the N-terminus of Arf is myristoylated and forms an amphipatic alpha helix.7
The respective residues for Ran classification are concentrated close to the C-terminus of the protein. Ran is not membrane bound and, in contrast to Rab, Rho and Ras family members, does not exhibit a cystein containing lipid modification motif at its C-terminus.6
For the Rho family, the residues identified by the RF are located mainly within the N-terminal region and some amino acids spread over the sequence. Remarkably, those are all found within or adjacent to critical structural elements, as can be seen in the Rho6/Rnd1 sequence in Figure 6. Nobes et al29 report Rho&/Rnd1 to exhibit only a weak intrinsic GTPase activity and propose that it might thus be constitutively GTP bound. This might be an explanation for an unusual dispersal of important positions within the Rho6/Rnd1 sequence in structural elements that usually play a role in GTPase function. Hence, Rho6/Rnd1 might not be a good representative of the family and it might be useful to map the important residues to other Rho family members. Furthermore, Rho GTPases show differences in their primary sequence in comparison to the other families, for example the “Rho insert” (which can also be seen in the alignment in Figure 6 between beta5 and alpha4 within the G domain), which might interfere with a correct reassignment of the exact positions after interpolation.10
GTPasePred (see additional file 1) can be used to predict novel potential small GTPases. It uses the aforementioned “Normalized positional residue frequency at helix termini N1”26 descriptor to predict whether a protein sequence belongs to the superfamily of small GTPases, and subsequently, to which family it belongs. GTPasePred is implemented in Java [http://java.sun.com] and R [http://www.R-project.org], and thus, needs the Java JRE 1.6 and R (with the random forest package) installed.
In order to predict one or more novel potential small GTPases, simply copy the protein sequences in the file sequences and, in the case of a Linux/Unix system, start the classification process by typing/start in the terminal. The results are stored in the file Results.txt. In the case of a Windows system, use start bat to encode the sequences, start R in the current directory and type in source (“program”). The results of the classification process are shown on the screen.
First, the protein sequences are classified whether to be a small GTPase, and in the case of a positive classification, they are subsequently classified by the family RFs. The RF with the highest probability output (positive classification) is selected (Figure 8). If this RF has an output ≥0.5, the protein sequences are assigned to its specific family; otherwise it is assigned as a GTPase in general and classified as “Ras or not further specified small GTPase”.
The algorithm can also be applied, when newly sequenced genomes are available.
The work flow is as follows:
Thus, combining ORF Finder with GTPasePred can be used to identify potential novel GTPases in newly sequenced genomes. An example of our application to newly sequences genomes can be found in additional file 2.
Taken together, the important amino acid positions for Rab, Arf/Sar and Ran family assignment that we identified using RFs, represent motifs that have been described to be unique features of the respective family. Hence, we can take these results as a proof of reliability of our RF based classification approach. In this paper we developed and provide a useful and reliable tool (GTPasePred) for the identification of small GTPases, and furthermore, for the specific families. Furthermore, we demonstrated the application of GTPasePred in genome sequences to identify potential novel GTPases (additional file 2).
Additional file 1: GTPasePred. Available from http://www.uni-due.de/~hy0546/GTPasePred/
Additional file 2: An example of our application to newly sequenced genomes
Example: We used yeast chromosome XIII (http://www.yeastgenome.org) to identify potential small GTPases. Therefore, we downloaded chromosome XIII in FASTA format and subsequently uploaded it on the ORF Finder webpage (http://www.ncbi.nlm.nih.gov/projects/gorf/). After starting the ORF search, we received the potential ORFs. We selected all ORFs having a similar length compared to small GTPases (here we select only ORFs in the range of 600 to 630 nucleotides for demonstration purposes). We selected ten protein sequences and copied them to the file sequences. Now, we used GTPasePred to analyze the sequences. GTPasePred identified one Rab protein (sequence 8), all other sequences were classified as non-GTPases. We subsequently used BLAST32 in order to identify sequence 8 as Ypt7.33 Ypt7 belongs to the family of small GTPases and is a homolog of mammalian Rab7.33
The authors report no conflict of interest in this work.