7TMR protein mining from the Arabidopsis thaliana genome
7TMR proteins form the largest receptor superfamily in vertebrates and other metazoans (
e.g., ~800 in human, ~1,000 in
Caenorhabditis elegans) [
29]. However, few 7TMR candidates are reported in plants and fungi. Only 22 candidate
Arabidopsis 7TMRs were described to date [
55] (more recent review is found in Moriyama and Opiyo, in press
65). We explored the possibility of finding more divergent groups of 7TMR candidates from the
A. thaliana genome using both alignment-free and alignment-based methods [
14]. For the
7TMRmine server, we updated all classifiers using a larger training dataset, and added new classifiers (SAM1, SAM2, GPCRHMM, and Phobius). The server also includes a newer release of the
A. thaliana genome (TAIR8; 32,690 proteins excluding those shorter than 35 amino acids; 27,066 proteins further excluding predicted alternative-splicing products).
Table summarizes the results obtained from the classifiers based on profile HMMs and TM-prediction methods. GPCRHMM predicted 39 proteins (46 including predicted alternative-splicing products) as 7TMR candidates. In
A. thaliana, currently 22 (27 including predicted alternative-splicing products) are known to be 7TMRs: 15 MLOs (19 including predicted alternative-splicing products), G-protein-coupled receptor 1 (GCR1),
Arabidopsis thaliana regulator of G-protein signaling 1 (AtRGS1), and five heptahelical transmembrane proteins (HHPs; 6 including predicted alternative-splicing products). GCR1 and AtRGS1 are known to directly interact with the plant Gα subunit GPA1 [
56]. AtRGS1 is a putative membrane receptor for D-glucose and also functions as a GTPase activating protein to AtGPA1 [
57]. Two proteins, GTG1 and GTG2 (four proteins including predicted alternative-splicing products; [
58]), were claimed to be plant GPCRs based on co-immunoprecipitation of AtGPA1 with these membrane proteins. However, GTG1/GTG2 are treated separately here as their animal homologues are reported to be likely channel proteins with no topological similarity to GPCRs [
59]. Of the 22 known 7TMR proteins in
A. thaliana, GPCRHMM recognized only GCR1 as a candidate. The AtRGS1 protein contains the RGS domain (120 amino acids) attached to the 7-TM region. As described also by Gookin
et al.[
15], GPCRHMM does not recognize AtRGS1 as a 7TMR protein unless the C-terminal RGS domain is removed. As expected, none of the MLOs and HHPs was identified by GPCRHMM. As mentioned before, the training dataset used for GPCRHMM excluded any such extremely diverged proteins [
27]. On the other hand, the SAM classifiers were trained using the dataset that included wider ranges of 7TMR proteins. Thus both SAM1 and SAM2 identified all 15 MLOs (19 including alternative-splicing products) as well as GCR1 correctly. However, even after removing the RGS domain sequence, SAM classifiers could not identify AtRGS1 positively; only GCR1 was identified positively by both SAM2 and GPCRHMM.
| Table 1Number of 7TMR candidates predicted from 27,066 A. thaliana proteins.a |
By using either Phobius or HMMTOP, ~200 of 27,066
A. thaliana proteins (or ~250 of 32,690 including alternative-splicing products) were predicted to have exactly seven TM-regions. 103 proteins (134 including alternative-splicing products) were predicted to be 7-TM proteins by both methods. The 22 (or 27 including alternative-splicing products) known
A. thaliana 7TMR proteins were predicted to have between six and eight and between seven and ten TM-regions by Phobius and HMMTOP, respectively. Only 11 of the 22 proteins (or 13 of 27 including alternative-splicing products) are predicted to have exactly seven TM-regions by the both methods. Note that GTG1 and GTG2 are predicted to have eight or nine TM-regions (one of the two GTG2 alternative-splicing products, AT4G27630.1, is predicted to have only five TM-regions by both methods). Of the 27,066
A. thaliana proteins, 969 proteins have between five and ten TM-regions by both methods. The range "5–10TMs" (by HMMTOP) was also used by Moriyama
et al. [
14] as the best coverage against the entire GPCR dataset for the hierarchical classification.
Figure shows an example of hierarchical classification of the
A. thaliana genome. Four hierarchical levels were generated (Figure ). The first level included six alignment-free classifiers chosen in our previous study [
14] ("6 class" in Additional file
1). Taking the intersection of all these classifier results ('AND' logic), 952 proteins were identified as 7TMR candidates (positives). At the second level, both TM methods were chosen with the options for 5–10TMs (with no N-terminal preference). Among the 952 proteins identified at the first level, 562 proteins remained as positive. Application of more strict options, seven TMs by the both methods, yielded 100 7TMR candidates at the third level. When SAM2 and GPCRHMM options were used for the final level, only 10 proteins were identified as positives by each of these methods. As shown in Table , as few as 50% of currently known
A. thaliana 7TMRs are predicted to have exactly seven TM-regions. Therefore, the requirement of having exactly seven TM-regions seems to be excessively strict. Removing this requirement (Figure ), SAM2 identified 20 positives, which included all known MLOs and GCR1. GPCRHMM, on the other hand, identified 37 positives, including only one known 7TMR (GCR1). The positive set predicted by either SAM2 or GPCRHMM (the union set) included 56 proteins (Figure ). One can easily change the level-2 options to restrict TM ranges. For example, using 6–10 TMs gave 487 positives with no effect on the SAM2 and GPCRHMM results (20 and 37 positives, respectively). With 7–8 TMs, 156 (132 after excluding alternative transcripts) proteins were identified (see Additional file
3 for the list). This list included all of the 16 high-ranking 7TMR candidates reported by Gookin
et al. [
15] as well as 15 of the 22 known 7TMRs (or 18 of 27 including predicted splicing alternatives). Seven known 7TMRs (6 MLOs and 1 HHP; or nine including predicted splicing-alternatives) were excluded from this list because their number of TM regions did not fit within the chosen range. Both of GTG1 and GTG2 (including all four predicted splicing alternatives) were not included in this list since either or both TM-prediction methods predicted eight or nine TM-regions in GTG1 and GTG2 (one splice form of GTG2 has only five TM-regions). However, GTG1 was positively identified by the all six classifiers, and can be identified as a 7TMR candidate if we relax the TM-number requirement to be between 7 and 9.
As shown in this example, users can choose classifiers in any combination in any number of levels (currently up to six) to create their own hierarchical filtering system. By using less strict methods at the earlier level and more strict methods at the later level, the
7TMRmine Web server facilitates the prioritization of the 7TMR protein candidate set and generation of a protein set in a manageable size for further investigation. The union and intersection of positive or negative sets can be easily obtained as shown in Figure . Figure shows an example of the list of all classifier prediction results. Protein sequences as well as the classification results can be downloaded from this page for further analysis. For example, protein sequences can be submitted to GPCR classification tools such as GPCRsIdentifier [
60], GPCRsclass and GPCRpred [
31,
61], and GPCRTree [
62] for further family classification.
Distribution of transmembrane proteins among eukaryotic genomes
Using 7TMRmine, we examined the distribution of transmembrane proteins among various eukaryotes. The server currently has classification results from 68 organisms across the major eukaryotic phyla: 10 land plants (including 1 moss and 1 fern), 8 green algae, 2 diatoms, 14 fungi, 6 vertebrates, 1 urochordate, 1 cephalochordate, 1 echinoderm, 7 arthropodes, 1 nematode, 2 annelida, 1 mollusca, 1 cnidaria, 1 placozoa, and 11 protists (including 1 red alga, 1 choanoflagellate and 2 Dictyostelium species). From each genome, proteins shorter than 35 amino acids and proteins with unidentified residues (irregular letters other than the 20 alphabets, most often 'X') over more than 30% of the length are excluded. The summary statistics are shown in the "TM/7TMR Mining Summary Statistics" page (Figure ). As mentioned in the earlier section, Phobius predicts fewer TM proteins compared to HMMTOP. The proportion of TM proteins to the entire proteins encoded by the genome was uniform across different organisms, yielding 20–25% by Phobius and ~40% by HMMTOP. In the "Transmembrane Protein Prediction Statistics" page (Figure ), one can compare the numbers of proteins predicted to have certain numbers of TM regions among different organismal groups. When we compared the TM-prediction results by Phobius with those by HMMTOP, the majority of differences were found in the numbers of 1TM proteins (Figure , red) and 2 to 4TM proteins (Figure , orange). In all organisms, these two groups of TM proteins were predicted twice more often by HMMTOP than by Phobius, which results in the reduced number of non-TM (0TM) proteins in HMMTOP prediction (Figure , light blue). More detailed comparison for each species is presented in histograms (clicking anywhere on the pie charts on the Web page brings the user to the detailed statistics page for the corresponding organism; Figure also shows the histograms only for Phobius prediction). In comparing the histograms of TM numbers predicted by Phobius and HMMTOP, one finds that all of 2-, 3-, and 4-TM proteins are over-presented by HMMTOP, contributing to the increased number of 2–4TM proteins predicted by HMMTOP in Figure (shown with orange). Proteins with higher numbers of TMs also show consistent but much smaller differences between Phobius and HMMTOP. Further examinations showed that among 7,175 A. thaliana proteins predicted as non-TM by Phobius and TM by HMMTOP (0, >0), 2,847 proteins (39.7%) were predicted to have signal peptides by Phobius. Among the 18,221 proteins predicted to be non-TM by both methods (0, 0), only 1,177 (6.5%) were predicted to have signal peptides by Phobius. This observation clearly shows that Phobius takes advantage of signal-peptide prediction to avoid misidentifying signal-peptide regions as TM regions. Proteins predicted to have no TM by both methods (0, 0) constitute 60% of any eukaryotic genome; they are most likely truly non-TM proteins. The maximum proportion of non-TM proteins could be ~80% (Figure , light blue).
Distributions of TM proteins among four representative organismal groups are compared in Figure . While six vertebrates have a greater representation of 7TM proteins among those with multiple TM regions, urochordate (
Ciona intestinalis) and cephalochordate (
Branchiostoma floridae) have much smaller numbers of 7TM proteins compared to other vertebrates (Figure ). This is consistent with many vertebrates having the largest 7TMR superfamily. Among the other metazoa including protostomes (six insects,
Daphnia pulex,
C. elegans, two annelida, one mullusca, as well as
Nematostella vectensis and
Trichoplax adhaerens),
C. elegans shows a significantly higher number of 7TM proteins, the largest among the 68 organisms accounting almost for 7% of its genome (Figure ). The majority of these
C. elegans 7TM proteins belong to chemoreceptors [
3,
63]. It is also interesting to note that two basal metazoa,
N. vectensis (cnidaria) and
T. adhaerens (placozoa) have greater representation of 7TM proteins compared to protostomes. On the other hand, plants and protists show no such over-representation of 7TM proteins. Among fungi, there appears to be species-specific over-representation of 7TM proteins in
Encephalitozoon cuniculi, an animal pathogen with the smallest genome among eukaryotes [
64]. Of 1,996 proteins, 91 genes (more than 4% of the genome) are predicted to encode proteins that have seven TM-regions by either Phobius or HMMTOP. Considering that other fungal genomes have only less than 2% (
e.g., 126 out of 9,838
Neurospora crassa proteins) of predicted 7TM proteins and that
E. cuniculi has reduced gene sets adapted to its parasitic life style, this over-representation of 7TM proteins is significant.
Distribution of 7TMR proteins among eukaryotic genomes
The "TM/7TMR Mining Summary Statistics" page also summarizes the distribution of 7TMR protein candidates among eukaryotes (Figure ). Clearly 7TMR proteins are under-represented in plants, fungi, and protists. For each organismal group, classification results are summarized using Venn diagrams (Figure ; Venn diagrams for all species are presented on the website). The positives obtained by SAM2 and GPCRHMM have very few overlaps for plant, fungal, and protist proteins (with exception of D. discoideum). This result indicates that use of only GPCRHMM, which is not trained for the largest plant 7TMR family (MLO), would omit many 7TMR candidates from these organisms. On the contrary, but as expected, the predictions for deuterostomes by these two classifiers significantly overlap. As described earlier, GPCRHMM is trained to identify canonical GPCRs obtained from these organisms. C. elegans of the "protostome" group and D. discoideum of the "protist" group show the similar prediction pattern as those for deuterostomes. This is because chemoreceptors from C. elegans and cyclic AMP receptors from D. discoideum, while divergent, are more closely related to vertebrate types of 7TMRs and GPCRHMM included these sequences for training. On the other hand, insect odorant receptors (ORs) are not included in the training set of GPCRHMM. Therefore, it is not surprising that GPCRHMM does not find the 60 ORs found in D. melanogaster. Drosophila ORs are included in the 139 proteins recognized by both the 6-classifiers and SAM2 but not by GPCRHMM (Figure ). Gustatory receptors, similarly divergent insect chemoreceptors, of D. melanogaster are also included in this protein set.
7TMR candidates in the A. thaliana, rice, and poplar genomes
As described earlier, from the
A. thaliana genome, the 16 high-ranking proteins identified by Gookin
et al. [
15] as well as 15 of the 22 known 7TMRs are found in the 132 proteins (156 including predicted alternative-splice forms) obtained from the intersection of the "6 classifiers" AND "7–8 TM" predictions (see Venn diagrams for
A. thaliana in Figure ). All six MLOs of the remaining seven known 7TMRs are included in the 49 proteins (57 including predicted alternative-splice forms) obtained from the intersection between "5–10 TM" AND "SAM2+GPCRHMM" (Venn diagrams including "5–10 TM" are available on the website). The remaining HHP5 as well as GTG1 are predicted as positives by both "5–10 TM" and "6 classifiers" but neither by GPCRHMM nor SAM2. GTG2 is not predicted by "6 classifiers" because PLS-ACC does not identify it as positive. Based on these results, we consider the 162 proteins (excluding predicted alternative-splicing forms; obtained by combining 132 proteins identified by both of "6 classifiers" AND "7–8 TM" with 49 proteins identified by both of "SAM2+GPCRHMM" AND "5–10 TM") to be the most likely 7TMR candidates from the
A. thaliana genome (see Additional file
3). Similar lists generated for
Oryza sativa (rice) and
Populus trichocarpa (California poplar) include 84 and 153 candidates, respectively (see Additional files
4 and
5). High-ranking protein sets identified by Gookin
et al. [
15] included 13 rice and 20 poplar proteins. Of their rice GPCR candidates, six proteins are included in our intersection set of "7–8 TM" AND "6 classifiers", and two proteins are included in the intersection set of "5–10 TM" AND "SAM2+GPCRHMM". Two of the remaining five proteins are included in the intersection set between "5–10 TM" AND "6 classifiers". Three are not identified by any of these criteria due to negative predictions by SVM-AA (for three proteins) and SVM-di (one protein). Among 20 poplar GPCR candidates claimed by Gookin
et al. [
15], 17 proteins are included in our intersection set of "7–8 TM" AND "6 classifiers". Among the three proteins not included in our list, two proteins are predicted to be negatives by SVM-AA.