|Home | About | Journals | Submit | Contact Us | Français|
Cytochrome b561 (Cyt-b561) proteins are important for plant growth, development, and prevention of damage to plants from excess light under drought condition. Because of their low sequence similarity, thorough mining of Cyt-b561 proteins from plants are not easy. Currently there is only one Cyt-b561 gene in the maize, none in the soybean and twenty two in the Arabidopsis thaliana genomes, respectively. We used alignment-free protein classifiers based on partial least squares and support vector machines to identify Cyt-b561 from plant genomes. These classifiers performed better than profile hidden Markov models and PSI-BLAST in identifying Cyt-b561 related proteins from plant genomes.
Cytochrome b561 (Cyt-b561) proteins are important in plants because they transfer electrons from a soluble cytoplasmic donor (ascorbate) across a membrane bilayer to a soluble intravesicular acceptor (semidehydroascorbate). In plants, ascorbate plays a role in antioxidative defense reactions (Asard et al., 2001) , and the regeneration of ascorbate can be made possible by the transportation of electrons by Cyt-b561. Verelst and Asard (2001)  also reported that dopamine-β-hydroxylase Cyt-b561 is involved in catecholamine action in plants, which is important for plant growth and development. Furthermore, Cyt-b561 proteins prevent plants from damage due to excess light under drought condition (Nanasato et al., 2005) .
Cyt-b561 is a protein family with four well-conserved histidine (His) residues and six transmembrane helices. The four conserved His residues possibly coordinate two heme molecules. Furthermore, substrate-binding sites for ascorbate and monodehydro-ascorbate (MDHA) have been predicted (Asard et al., 2001) . The average length of Cyt-b561 proteins is 237 amino acids (aa). In addition to these single-domain Cyt-b561s (those without any additional functional domains), some multiple-domain proteins have sequence regions similar to Cyt-b561 (Cyt-b561 domains) linked to other domains such as dopamine-hydroxylase (DOH; Asard et al., 2001) , cellulose dehydrogenase, carbohydrate binding protein, protein of unknown function (DUF569), and cobalamin domains. These different domains are considered to add functions by cooperatively working with the Cyt-b561 domain. The length of multiple-domain Cyt-b561 proteins varies and their average length is 511 aa. The sequence identities among these Cyt-b561 protein/domain sequences are as low as 30% (Verelst and Asard, 2003)  reflecting their functional diversity.
Because of their low sequence similarity, these protein sequences are not easy to be aligned, hence making them more difficult to be identified from diverse plant genomes. In the current release of InterPro database (17.0; Mulder et al., 2005) , 193 Cyt-b561 sequences (including both single and multiple-domain proteins) are identified from 22 plant species. The majority of plants are represented by very small numbers of Cyt-b561 proteins: e.g., one single-domain Cyt-b561 for maize (Zea mays) and none for soybean (Glycine max). On the other hand, twenty two (seven single-domain and 15 multiple-domain) Cyt-b561 protein sequences are known from the A. thaliana genome and twenty three (eight single-domain and 15 multiple-domain) are identified from the grapevine (Vitis vinifera) genome. While there is a possibility that those plants have fewer Cyt-b561 related proteins compared to Arabidopsis and grapevine, it is more likely that currently often used methods are not capable of efficiently identifying these proteins from newly sequenced plant genomes because of low sequence similarities among these protein sequences.
The majority of currently used methods for protein classification require building alignments. Using multivariate analysis methods on various amino acid properties avoids this alignment-building process. Such alignment-free methods can be sensitive for remotely similar protein identification such as for Cyt-b561 related proteins where reliable alignments are difficult. Another advantage of using multivariate analysis methods is that they use both positive (protein of interest) and negative (unrelated protein) samples in their training. Alignment-based methods like PSI-BLAST and profile hidden Markov models (profile HMMs; used in, e.g., Pfam database; Bateman et al., 2004) , on the other hand, are trained using only positive samples since unrelated proteins cannot be included in the model alignments.
Several studies have previously compared the performance of profile HMMs with alignment free methods for protein classifications. Karchin et al., (2002)  used support vector machines (SVMs) with the Fisher kernel for classifying G-protein coupled receptor (GPCR) subfamilies and reported better performance than profile HMMs. Liao and Noble (2003)  used SVMs with descriptors based on pairwise similarity scores, which also performed better than profile HMMs for discriminating SCOP protein families (Andreeva et al., 2004) . Better performance than profile HMMs was further shown by Kim et al. (2000)  and Moriyama and Kim (2005)  with GPCR classifiers based on parametric and non-parametric discriminant functions. Various alignment-free descriptors and kernels have been successfully used with superior performance especially for identifying remotely similar protein families. Examples include: mismatch string kernel used by Leslie et al., (2003) ; dipeptide composition used with SVMs by Bhasin and Raghava, (2005) ; n-gram frequencies used with decision tree and nave Bayes classifiers by Cheng et al. (2005) ; and self-organizing maps used by Otaki et al., (2006) .
In Opiyo and Moriyama (2007) , we tested partial least squares regression (PLS) methods using phyisco-chemical properties of amino acids for identifying Cyt-b561 proteins from short expressed sequence tags (ESTs) derived from the A. thaliana genome. The PLS methods successfully identified Cyt-b561 EST sequences that profile HMMs and PSI-BLAST could not identify. In our more recent study (Opiyo and Moriyama, 2009) , we used PLS classifiers for identification of single-domain and multiple-domain cyclophilins from the Arabidopsis and rice genomes. PLS classifiers again performed better than profile HMMs and PSI-BLAST. In this study, our objective is to further examine PLS multivariate classifiers for identifying Cyt-b561 related proteins and to mine these proteins from diverse plant genomes (A. thaliana, grapevine, soybean, and maize).
Three hundred Cyt-b561 sequences (140 from plants, 130 from animals, 30 from fungi) were downloaded from InterPro (Release 17.0; Mulder et al., 2005 ). Table 1 shows the number of sequences from plants, animals, and fungi included in our Cyt-b561 datasets. Two hundred sequences were single-domain Cyt-b561 (70 from plants, 110 from animals, and 20 from fungi), and one hundred sequences were multiple-domain Cyt-b561 (70 from plants, 20 from animals, and 10 from fungi). Plant sequence data included all Cyt-b561 proteins from the A. thaliana and grapevine genomes. Other plant species included maize, rice, watermelon, banana, sugar beet, etc.
In order to search sequences similar to Cyt-b561 both from single and multiple domain proteins, we used all 300 Cyt-b561 protein sequences in the positive dataset including both of single-domain Cyt-b561 proteins as well as Cyt-b561-domain regions from multiple-domain proteins. Three hundred Non-Cyt-b561 proteins longer than 100 amino acids were randomly sampled from Swiss-prot database (Boeckmann et al., 2003)  and used as the negative samples.
Cyt-b561 related proteins were mined from four plant genomes. Their entire protein sequences were downloaded from the following sources:
For alignment-based classifiers (profile HMMs and PSI-BLAST), only positive (Cyt-b561) samples were included in the training datasets. For alignment-free classifiers, both of positive and negative samples were used for training. For mining the plant genomes, the entire 600 sequences were used for training classifiers. As mentioned above, only 300 positive sequences were used for training alignment-based classifiers.
From each protein sequence, frequencies of 20 amino acids were calculated. Strope and Moriyama (2007)  applied SVMs with amino acid composition for GPCR classification problems, and showed that such classifiers outperformed profile HMMs and decision trees classifiers for discriminating GPCRs from non-GPCRs. In this study, amino acid composition was used as descriptors for an SVM classifier (SVM-AA).
Dipeptide composition represents all 400 frequencies of consecutive amino acid pairs in a protein sequence and corresponds to a 400 (20 × 20) feature vector. It can encapsulate information on composition of amino acids as well as their local order. We used dipeptide composition as descriptors for an SVM classifier (SVM-DIP).
In Opiyo and Moriyama (2007) , we developed five descriptors (PC1-PC5) using the principal component analysis (PCA) of 12 physico-chemical properties of amino acids (mass, volume, surface area, hydrophilicity, hydrophobicity, isoelectric point, transfer of energy solvent to water, refractivity, non-polar surface area, and frequencies of alpha-helix, beta-sheet, and reverse turn). The five principal components (PCs) selected explained 93.2% of the total variance. The first principal component (PC1) covered 40.4% of the total variance of the original 12 physico-chemical properties. It represented all properties except isoelectric point and non-polar surface. PC2 (28% of the total variance) had negative relationships with the hydrophobicity properties. PC3 and PC5 had 15% of the combined total variance, and represented secondary structure properties. PC4 (8.9% of the total variance) represented isoelectric point and volume. We used the same five descriptors in (Opiyo and Moriyama, 2009)  for classifying cyclophilin proteins. The same descriptors were used in this study with PLS and SVM classifiers.
Auto/cross covariance (ACC) transformation method discussed in (Opiyo and Moriyama, 2007)  was used to transform each amino acid sequence using the five physicochemical property based descriptor set (PC1-PC5). ACC with the maximum lag of 30 residues yielded 775 descriptors for each sequence. The calculation of ACC was performed using the R package (version 2.60; R Development Core Team, 2007 ).
In Opiyo and Moriyama (2007) , we reported that the PLS classifier using descriptors transformed by ACC had high false positive rates. Our subsequent study showed that reducing the number of descriptors (690 for single-domain and 647 for multiple-domain cyclophilins) by selecting significant descriptors by the t-test decreased the number of false positives (Opiyo and Moriyama, 2009) . In this study, similarly, the t-test was used to select descriptors that showed significant difference between Cytb-561 and Non-Cyt-b561 in training datasets at the alpha level of 0.01. From the 775 ACC descriptors, 459 were selected. These descriptors are available in Supplementary Table 1 at: http://bioinfolab.unl.edu/emlab/Cyt-b561/
Partial least squares (PLS; Geladi and Kowalski, 1986)  is a projection method similar to PCA where the independent variables, represented as the matrix X, are projected onto a low dimensional space. PLS uses both independent variables X (sequence descriptors such as amino acid composition) and dependent variables Y (positive or negative label). PLS using descriptors transformed by ACC (PLS-ACC) was used in (Opiyo and Moriyama, 2007 ) for GPCR classification and in (Opiyo and Moriyama, 2009)  for cyclophilin classification. As in (Opiyo and Moriyama, 2009) , we also included PLS with ACC descriptors selected by the t-test (PLS-T_ ACC). The cut-off point for PLS-T_ACC classification was chosen as 0.562 based on the minimum error point (MEP; Karchin et al., 2002 ). PLS analysis was performed using an R implementation, the PLS package developed by Wehrens and Mevik version 1.2.1 (Wehrens and Mevik, 2006 ).
Support vector machines (SVMs; Vapnik, 1999 ) learn to separate a set of labeled training data by remapping them in a high-dimensional space and by discovering a hyperplane that separates the two classes in this space. The hyperplane is optimized in such a way that the distance, called margin, between the hyperplane and the closest training example is maximized. Support vectors are those data points that define the margin. Once the hyperplane is found, predicting the label of a new, unlabeled data point involves determining on which side of the hyperplane that point lies. SVMs use kernel functions to represent data. The kernel function defines similarities between remapped data points. We used the radial basis kernel represented by the equation (1) (Cristianini and Shawe-Taylor, 2000 ):
where x and y are input vectors, and γ is a parameter. In this study, SVMs were used with the following descriptors: five physico-chemical property based descriptors selected by t-test after ACC transformation (SVM-T_ACC), amino acid compositions (SVM-AA), and dipeptide compositions (SVM-DIP). One advantage of SVMs is that it can be used to classify linearly separable data as well as nonlinearly separable data. This is because SVMs employ kernel functions. One disadvantage of SVMs, on the other hand, is that their performance is closely tied to the choice of optimal kernel function parameters. We performed the grid-search of the optimal set of parameters for the radial basis kernel function with 10-fold cross-validation analysis using the training dataset (including 300 Cyt-b561 and 300 Non-Cyt-b561). Table 2 shows the results of the selected optimal parameters for the radial basis kernel functions used with SVM-T_ACC, SVM-AA, and SVM-DIP. SVM-T_ACC had the lowest error rate (3.8%). All SVM analyses were performed using R implementation (version 2.60: http//www.R-project.org).
In the regular use of PSI-BLAST (Altschul et al., 1997 ), position-specific scoring matrices (PSSMs) are built from multiple alignments of significantly similar sequences obtained by similarity search. In this study, multiple alignments of positive (Cyt-b561) sequences were generated using Mafft version 6.5 (Kazutaka and Hiroyuki, 2008 ) with the default parameters. This multiple alignments were used to build the PSSMs and for the search. The cut-off E-value of 1.12 was obtained using MEP. For the E-value calculations, regardless of the actual dataset size, a constant sample size of 137,000 (based on the number of predicted proteins in the maize genome) was used for performance comparisons.
Profile HMMs are full probabilistic representation of sequence profiles (Durbin et al., 1998 ). As mentioned before, profile HMMs are built using only positive samples as is the case with PSI-BLAST. In this study, profiles HMMs were built using the w0.5 script of the Sequence Alignment and Modeling Software System (SAM version 3.5; Hughey and Krogh, 1996 ). The cut-off E-value of 1.80 was obtained based on the MEP. The E-values were also calculated using the constant sample size of 137,000.
Ten-fold cross-validation analysis was done for examining classifier performance. The 300 positive (Cyt-b561) sequences were randomly partitioned into 10 subsamples. The same procedure was applied to 300 negative (Non-Cyt-b561) sequences. Combining positive and negative data, each subsample dataset included 60 sequences. Of the 10 subsamples, a single subsample was retained as the validation dataset for testing the classifiers, and the remaining 9 subsamples were used for training. The cross-validation process was then repeated, with each of the 10 subsamples being used exactly once as the validation data. The ten testing results were combined to calculate performance statistics. The advantage of this method is that all samples were used for both training and validation.
Predictions were grouped as follows:
Performance statistics were calculated as follows:
Table 3 shows the classification test results of Cyt-b561 proteins from ten-fold cross validation analysis. PLS-T_ACC and SVM-T_ACC outperformed other classifiers as indicated by their higher accuracy rates and higher MCC values. The false positive rate of PLS-T_ACC was the lowest among the four alignment-free classifiers. While SAM and PSI-BLAST had lower false positive rates than alignment-free classifiers, these two classifiers showed significantly higher false negative rates. Tables 4 and and55 show the classification test results for single-domain and multiple-domain Cyt-b561 proteins separately. The results show that the classifiers performed consistently for both single-and multiple-domain Cyt-b561 classifications. Difference in performance is more pronounced for multiple-domain Cyt-b561 classification with significantly higher false negative rates with SAM and PSI-BLAST, likely due to higher sequence divergence among multiple-domain Cyt-b561 sequences.
Figure 1 shows the numbers of true positives identified by the two best alignment-free classifiers (PLS-T_ACC and SVM-T_ACC) and the two alignment-based classifiers (PSI-BLAST and SAM). The majority of the positive sequences (260 of 300) were correctly identified by all four classifiers. Most of these 260 sequences were animal Cyt-b561 proteins. Thirty other Cyt-b561 proteins, mostly from plants, were identified only by PLS-T_ACC and/or SVM-T_ACC. PLS-T_ACC and SVM-T_ACC had higher false positive rates than SAM and PSI-BLAST. As shown in Figure 2, four Non-Cyt-b561 sequences were commonly identified by all four classifiers as false positives. Three of these four sequences in fact included DOH (or DOMON) domains. It would be interesting to see if these sequences contain unannotated weakly similar to Cyt-b561 sequences.
SVM-T_ACC had the highest number of false positives that were not misidentified by any other classifiers. As shown in Figure 3, PLS-T_ACC and SVM-T_ACC missed the same ten Cyt-b561 sequences (Six from plants, two from animals and two from fungi). All four classifiers misidentified five of these ten Cyt-b561 proteins. Four of these five were multiple-domain Cyt-b561 (three from plants and one from fungi), and one was a plant single-domain Cyt-b561. PSI-BLAST and SAM missed commonly 24 other Cyt-b561 sequences. Most of them were multiple-domain Cyt-b561 from plants. Selection of significant and reduced numbers of descriptors after ACC transformation appeared to have contributed to higher accuracy and higher MCC values observed in PLS-T_ACC and SVM-T_ACC classifiers. It did not affect the sensitivity of the two classifiers as shown by low % false negative in classifying Cyt-b561 proteins. SVMs trained using amino acid and dipeptide compositions missed some of the Cyt-b561 proteins as indicated by higher false negative rates compared to PLS-T_ACC and SVM-T_ACC. These results indicate that reduced ACC descriptors are better for classifying Cyt-b561 proteins compared to amino acid and dipeptide compositions.
All the classifiers performed better in identifying Cyt-b561 sequences from animals. SAM and PSI-BLAST missed most of the multiple-domain Cyt-b561 proteins from plants. This can be explained by higher divergence among plant Cyt-b561 proteins especially among multiple-domain Cyt-b561 proteins. Sequence identity of Cty-b561 proteins among plants compared to that among animals is lower (22% among plant proteins compared to 35% among animal proteins). This observation is consistent with Cyt-b561 proteins in plants being mostly in multiple-domain forms while animal Cyt-b561 being single-domain proteins. For example, in the mouse genome, there are eleven single-domain Cyt-b561 proteins and a single multiple-domain protein, and in the human genome, there are six single-domain Cyt-b561 proteins and a single multiple-domain Cyt-b561 (based on InterPro release 17.0). On the other hand, there are seven single-domain and fifteen multiple-domain Cyt-b561s in Arabidopsis, and eight single-domain and fifteen multiple-domain Cyt-b561s found in the grapevine genome. It indicates that the alignment-based classifiers, SAM and PSI-BLAST, would miss many Cyt-b561 proteins from plants.
We used the two best alignment-free classifiers (PLS-T_ACC and SVM-T_ACC) as well as the two alignment-based methods (SAM and PSI-BLAST) to mine Cyt-b561 proteins from four plant genomes. Table 6 summarizes the results. From the A. thaliana genome, PLS-T_ACC and SVM-T_ACC predicted 311 and 659 proteins from A. thaliana genome as Cyt-b561 proteins, respectively, while SAM and PSI-BLAST predicted 37 and 29 sequences, respectively, as Cyt-b561 proteins. Twenty four sequences were predicted as Cyt-b561 proteins from the A. thaliana genome by all the four classifiers. It includes all 22 Cyt-b561 proteins identified in InterPro and the five Cyt-b561 proteins annotated in the Arabidopsis genome project. All the proteins identified by the four classifiers from the Arabidopsis genome are presented in Supplementary Table 2 (available at: http://bioinfolab.unl.edu/emlab/Cyt-b561/).
From the grapevine, maize, and soybean genomes, the numbers of Cyt-b561 protein candidates identified were comparable to those from the Arabidopsis genome for all four classifiers (Table 6). This implies that even though very few number of Cyt-b561 proteins are currently known from many plants, more Cyt-b561 protein candidates are yet to be identified. Supplementary Tables 3, 4, and 5 present all Cyt-5561 protein candidates identified by the four classifiers from grapevine, maize and soybean genomes, respectively (available at: http://bioinfolab.unl.edu/emlab/Cyt-b561/).
As expected, SAM and PSI-BLAST identified fewer Cyt-b561 protein candidates from all genomes. PLS-T_ACC identified fewer Cyt-b561 protein candidates from all genomes compared to SVM-T_ACC. Based on our performance test results; it is likely that some of the Cyt-b561 protein candidates identified by SVM-T_ACC are false positives. Although PLS-T_ACC prediction would also include false positives, this classifier as well as SVM-T_ACC should miss much fewer true positives compared to SAM and PSI-BLAST.
In this study, we showed that alignment-based methods, SAM and PSI-BLAST, are too conservative when they are used to search highly divergent proteins as Cyt-b561. We also showed that reduced ACC descriptors are better than amino acid composition as well as dipeptide composition for identifying Cyt-b561 proteins. In conclusion, PLS-T_ACC classifier will be useful for identifying new Cyt-b561 candidates from diverse plant genomes as they become available.
This work was in part supported by Grant Number R01LM009219 from the National Library of Medicine to ENM.
Stephen O. Opiyo is a postdoctoral fellow in the School of Biological Sciences, University of Nebraska-Lincoln.
Etsuko N. Moriyama, PhD., is an Associate Professor at the School of Biological Sciences and Center for Plant Science Innovation, University of Nebraska-Lincoln.
S. O. Opiyo, School of Biological Sciences, University of Nebraska-Lincoln, NE 68588.
E. N. Moriyama, School of Biological Sciences and Center for Plant Science Innovation, University of Nebraska-Lincoln, 403 Manter Hall, Lincoln, NE 68588-0118, USA.