|Home | About | Journals | Submit | Contact Us | Français|
Most proteins from higher organisms are known to be multi-domain proteins and contain substantial numbers of intrinsically disordered (ID) regions. To analyse such protein sequences, those from human for instance, we developed a special protein-structure-prediction pipeline and accumulated the products in the Structure Atlas of Human Genome (SAHG) database at http://bird.cbrc.jp/sahg. With the pipeline, human proteins were examined by local alignment methods (BLAST, PSI-BLAST and Smith–Waterman profile–profile alignment), global–local alignment methods (FORTE) and prediction tools for ID regions (POODLE-S) and homology modeling (MODELLER). Conformational changes of protein models upon ligand-binding were predicted by simultaneous modeling using templates of apo and holo forms. When there were no suitable templates for holo forms and the apo models were accurate, we prepared holo models using prediction methods for ligand-binding (eF-seek) and conformational change (the elastic network model and the linear response theory). Models are displayed as animated images. As of July 2010, SAHG contains 42581 protein-domain models in approximately 24900 unique human protein sequences from the RefSeq database. Annotation of models with functional information and links to other databases such as EzCatDB, InterPro or HPRD are also provided to facilitate understanding the protein structure-function relationships.
Nowadays, genome sequencing projects are producing complete genome sequences at an extremely high rate (1,2). With the rise of next-gen sequencers (3–5), this is the continuous trend for the future without a doubt. Consequently, the number of known protein sequences (6) grows more rapidly than the number of known protein structures experimentally determined (7). However, to make full use of genome sequences, proteins encoded in genomes should be analysed and for this purpose, protein three-dimensional (3D) structures provide much information (8,9). Computational methods for protein 3D structure prediction are anticipated to bridge the gap between the number of known protein sequences and the number of known protein structures. According to assessments of the accuracy of those methods, e.g. recent Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments (10,11), template-based protein structure prediction often produced 3D models accurate enough for functional annotations, modification of protein functions or even for structure-based drug design (12,13). In addition, in the CASP7 and 8 experiments, fully automated structure prediction methods had reached a comparable level to the best prediction performance by methods with human intervention (14).
In the CASP experiments, target protein sequences are ones whose 3D structures will be determined. It means that such protein structures are expected to be single domains or a couple of domains and suitable for the experimental structure determination. Therefore, sometimes protein sequences are truncated from their full-length forms. On the other hand, most protein sequences coded in genomes from higher organisms are known to be long and should be multi-domain proteins (15), and contain a significant portion of intrinsically disordered (ID) regions (16–19). Clearly, these proteins are unsuitable for experimental structure determination in the full-length form and distinct from the target protein sequences of CASPs. To analyse such proteins, we have developed a special protein-structure-prediction pipeline, by integrating and arranging various computational tools, either developed by us or widely used as global standards. This pipeline was applied to all proteins coded in the human genome. The resulting 3D models as well as other annotations for protein functions were accumulated in the Structural Atlas of Human Genome (SAHG) database and presented through the web interface at http://bird.cbrc.jp/sahg.
There are other databases of protein structure models, e.g. SWISS-MODEL Repository (20) or ModBase (21). Both databases contain annotated protein structure models generated by original automated modeling pipelines. They also allow the users to build models on demand. Compared with them, the SAHG database is distinct mainly in the following points: (i) The 3D models in SAHG were generated by an original pipeline, specific for multi-domain proteins with substantial ID regions; (ii) Conformational changes of proteins upon ligand-binding are predicted by simultaneous modeling using templates of the ligand-bound state (holo form) and the unbound state (apo form) and displayed as animated images; and (iii) Functional annotations for protein interactions, e.g. ligand-binding and protein–protein interactions, are available. All these features are suitable for analysing eukaryotic proteins toward a deep understanding of their functions and interactions.
Schematically, two types of prediction systems were used to analyse protein sequences [RefSeq sequence (22)] automatically. One is the ‘Structure prediction pipeline’ (right pink regions in Figure 1) in which several homology search and protein structure prediction tools, conducting sequence–sequence, sequence–profile and profile–profile alignments, are combined sequentially, and it processes protein sequences, assigns them with 3D templates and finally produces 3D models. If available, 3D models of apo and holo forms were generated. The other components are ‘Other structure and function predictors’ (bottom light blue regions in Figure 1). They are an ensemble of independent prediction tools, which analyse protein sequences. All the results from these systems were accumulated in SAHG in XML formats.
Protein structure prediction consists of the following procedures: template searches and selection, alignment of target sequence and template, building 3D models and evaluation of model quality.
The template searches and their assignments to a target protein are the ‘step-wise-multi-methods’ approach. In the first step, a BLAST (23) search against all the latest Protein Data Bank (PDB) (7) and Structure Classification of Proteins (SCOP) (24,25) sequences is performed with 10−5 E-value cut-off. We selected templates, at least 90% of whose sequence could be aligned with the target, to ensure that the 3D models corresponded to stable domains or proteins. The resulting target sequence-template alignments were ranked based on their E-values. The best combination of templates for each domain was determined using an original algorithm to maximize the coverage of the target sequence (label I in Figure 1). In the second step, a PSI-BLAST (23) search with the same parameters was conducted for the remaining regions of the target sequence, where no models had been assigned and the best templates were assigned onto the target sequence (II in Figure 1). Protein sequence profiles were prepared using the latest NCBI-nr database. In the third step, a Smith–Waterman profile–profile alignment method (SWPPA) (26) was applied to the remaining regions against restricted templates (SCOP and PDB subsets with less than 40% sequence identity) with a cut-off of Z-score>10, the comparable threshold to E-value<10−5 in PSI-BLAST (III in Figure 1). Finally, the FORTE (27) search, a profile–profile comparison method, was performed for the remaining regions, with a strict cut-off of Z-score>20, to detect distantly related templates (V in Figure 1). FORTE is based on the global–local alignment method and was adjusted to perform best (28) when the target proteins were almost the same length as the PDB entries (around 400aa) (29). However, more than half of human proteins (53%) are larger than 400amino acids and even the remaining regions are sometimes over 2000amino acids. Thus, prior to the FORTE search, potential domains were carved out from the remaining regions using an algorithm based on the prediction of ID regions (IV in Figure 1) and fed into FORTE (see ‘Prediction of potential domains’ section for details).
Once the target sequence-template alignments were obtained, all templates were checked against our ‘apo and holo form table’ originally prepared by us (see ‘Apo and holo form table’ section in Supplementary data). For the template in apo form, the corresponding template (>90% sequence identity) in holo form was selected from the table and vice versa. For both the templates, alignments to target sequences were prepared (VI in Figure 1). In the model building and quality assessment step, 10 models were constructed using the MODELLER (30) software. The quality of the models was evaluated using Stability score (31) and the best 3D model for each alignment was chosen (VII in Figure 1).
As of July 2010, 24878 RefSeq sequences [(22), 14012591 residues] encoded in the human genome were processed by the pipeline. In total, 42581 structure models were constructed, of which 18228, 14577, 9163 and 613 templates were detected by BLAST, PSI-BLAST, SWPPA and FORTE, respectively. For 4083 models (9% of all models), both the apo and holo forms were assigned. In total, 35275 residues were predicted to form long ID regions and removed from target sequences, in advance of the FORTE search. In total, 295309 residues were eliminated because they were fragmented into small pieces (<26 residues). Multiple models were generated for 9057 RefSeq sequences, while only one model was generated for 12310 RefSeq sequences. In total, 3511 RefSeq sequences remain without any predicted model. Note that one model does not necessarily correspond to one domain (sometimes it corresponds to a protein chain), but at least more than one-third of human proteins were estimated to be multi-domain proteins. In some cases, we assessed predictions by comparing models with the protein structures recently revealed. Even the sequence identities of the alignments are quite low (<20%), more than half predictions detect correct folds (Supplementary Table S1), indicating that our prediction pipeline worked well.
Many human proteins are composed of multiple domains and contain a significant fraction of ID regions, as was described above. These factors often prevent predicting protein structures in their full-length forms. As a result, SAHG principally exhibits protein structure as an array of domains. However, when multi-domain structures are available in the templates, the prediction pipeline implicitly prioritizes them to take advantage of the relative domain orientations. The pool of templates consists of SCOP (24,25) domains and whole PDB (7) structures, some of which are not deposited in SCOP. At the template assignment step (I, II, III, V in Figure 1), a set of templates was chosen to maximize the length of modeled regions. This approach is effective in accepting PDB structures spanning multiple domains, as the templates.
ID regions were predicted using the POODLE-S (18) software, which calculates the probability of being in ID regions for each residue (XIII in Figure 1). As ID regions are considered to play fundamental roles in biological activities (17), their detections should be important. On the other hand, it is necessary to remove long ID regions from the target sequences and assign potential domain regions to assure better performance in structure prediction (FORTE search, V in Figure 1). For this purpose, we evaluated an existing method to predict domain boundaries [Domcut (32)] and found that it was likely to overcut potential domain regions into segments. For other methods (33–35), the same tendency was reported. We considered that the over-prediction was rather disadvantageous for arranging the input sequences for FORTE and developed a new method whose prediction was more ‘moderate’ (containing fewer false positives but more false negatives) based on the results of ID region prediction (IV in Figure 1), since ID regions act as linkers of structural domains (36). First, the results of POODLE-S for a target sequence were converted into a binary sequence in which 0 (P<0.5) and 1 represent residues in structured regions and that in ID regions, respectively. Next, to detect regions where 0 were continuously abundant, we employed a simple two-state Hidden Markov Model. In this model, one state, ‘a mostly structured region’ (STR), emits 0 more frequently than 1 and the other state, ‘a mostly ID region’ (IDR), emits 1 more frequently than 0. The transition probability between STR and IDR and all the emission probabilities were empirically adjusted to eliminate over-prediction by referring to known domain data in PDB. Finally, the STR regions were estimated from the input binary sequence by calculating a Viterbi path.
When templates for both the ligand-bound state (holo form) and unbound state (apo form) were detected using the ‘apo and holo form table’, two types of models were constructed and their structural changes upon ligand-binding are visualized by means of a morphing technique (the MORPH2 program in Martz-Authored PDB Tools see http://www.umass.edu/microbio/rasmol/pdbtools.htm) (X in Figure 1). The animation of conformational change provides significant information for protein function when it is shown with functional residues and ligands.
When there was only the template for apo form available and accordingly, only the model for apo form was constructed, its putative ligand and the binding sites were predicted by the eF-seek software (37) (VIII in Figure 1). eF-seek finds potential ligand-binding sites in the model of the apo form, if similar structures were deposited in eF-site, the database of representative ligand-binding sites (38). eF-seek employs a clique search algorithm. As this method is sensitive to the input 3D coordinates, the application was limited to the case of highly accurate structure models being available, i.e. the templates were detected by BLAST search with more than 90% sequence identity to the target sequences. The structural changes upon the predicted ligand-binding were then deduced using the elastic network model (39) and linear response theory to construct a model of the holo form (40) (IX in Figure 1).
Note that this approach and presentation is one of the key features of the SAHG database. Animated views of the conformational change of the domains upon ligand-binding could present a deep insight into the protein structure and function relationship (X in Figure 1). As of July 2010, conformational changes upon ligand-binding were predicted for 4083 modeled domains among 42581 3D models.
In total, 33687 protein complex structures were gathered from the PQS database (41). If all the subunits from two complexes were paired with more than 95% sequence identity, the complexes were clustered together in the single-linkage manner. The complex structure with the highest resolution was selected in each cluster of complexes and we obtained a non-redundant set composed of 12730 template complexes. If a target sequence was related to a given subunit of a template complex with >80% sequence identity by the BLAST search and all the other subunits were related to any target sequences, the complex model was constructed by MODELLER. In total, 8667 complex models were prepared for 3650 target sequences (XI in Figure 1).
The ligands and their binding sites were retrieved from constructed models. The ligands were mainly small molecules, such as peptides, nucleotides, metal ions, etc. and some trivial chemicals from buffers or precipitants were excluded. Binding sites were residues whose distances from any ligand atoms were within 5Å.
For the target sequences of enzymes, catalytic residues were predicted using the EzCatDB database (42) (XII in Figure 1). The EzCatDB database provides annotations on catalytic residues with PDB structure data. The catalytic residues and their positions were already denoted for sequences in the UniProt database (6), as mapped from the catalytic residues on the PDB sequence data, by BLAST search with 10−10 E-value cut-off and POA ver. 2.0 (43). From the human proteins in the UniProt database, target sequences were detected and catalytic residues were assigned in the same manner. Only chemically consistent residues were regarded as catalytic residues. The annotated ‘ACT_SITE’ residues for the human proteins in the UniProt database were also mapped on the target sequences using BLAST search.
SAHG provides its graphical web interface at http://bird.cbrc.jp/sahg. By clicking a chromosome's image, all proteins coded in the chromosome are listed with the predicted models. By choosing an image of a domain, detailed information of the target protein is shown. More practically, detailed information of specific proteins can be accessed by querying with Gene ID, RefSeq ID, annotation keywords or their combinations or by sequence homology search (BLAST), from an ‘Advanced search page’. In the detailed information page (Figure 2A), all contents for a given protein are shown. The ‘Protein information’ panel provides the information of the protein's RefSeq ID (I in Figure 2A). The sequence in FASTA format is displayed by clicking a ‘Sequence’ button. Predicted protein complexes are shown via a ‘Complex’ button if available (II in Figure 2A). An example of a ‘complex information’ page is shown in Figure 2B. Links to EC number, EzCatDB (42), HPRD (45), Swiss-Prot(6) and InterPro (46) are provided if available. A bar indicator is convenient for seeing the position of the predicted models in the full-length protein (III in Figure 2A). It also shows the annotation of ligand-binding residues (retrieved from the holo models), protein–protein interface residues (from protein complexes), catalytic residues (from EzCatDB), ID regions (by POODLE-S) and transmembrane regions (by TMHMM). By pointing at the colored pins on the bar indicator with a mouse, precise locations (residue numbers) of ligand-binding residues (green pins), protein–protein interface residues (blue) or catalytic residues (red) are shown (see IV in Figure 2A, an example of a catalytic residue). When a modeled region in the bar indicator (blocks on the bar) is selected by clicking, the predicted 3D model appears in the Jmol window (an open-source Java viewer for chemical structures in 3D; see http://www.jmol.org/Jmol) (V in Figure 2A). When models of both apo and holo forms are available (green block on the bar), their structural changes upon ligand-binding are visualized by the morphing technique (the MORPH2 program in Martz-Authored PDB Tools; see http://www.umass.edu/microbio/rasmol/pdbtools.htm) and displayed as an animated image including the ligand molecules in this window. By clicking the bar indicator of ligand-binding or catalytic residues, the corresponding residues are highlighted in ‘CPK spacefill’ scheme in the Jmol window. The ‘Domain Information’ panel shows structural and functional information about a selected model (VI in Figure 2A). The target sequence-template alignments are displayed by an ‘Alignment button’. The predicted model can be downloaded in a pdb format via ‘model PDB’ button. Ligand-binding residues, protein–protein interface residues and catalytic residues are also listed as ‘Functional Residues’ in the same color of the bar indicator. (In Figure 2A, the ‘Domain information’ panel should be scrolled up).
To improve the accuracy of structure prediction we are implementing a probabilistic profile–profile alignment method in our prediction pipeline. The method is an enhanced version of the probabilistic sequence–sequence alignment method (47), which has been proven to perform better than PSI-BLAST, in particular for orphan proteins. New versions of structure models provided by the new pipeline will appear in fall of 2010. The results of predictions are being examined to clarify the function and the interaction of human proteins. For some proteins, predicted ligands are being verified experimentally. The structure model set in SAHG will be downloadable in bulk in future.
Supplementary Data are available at NAR Online.
Japan Science and Technology Agency (JST) – Institute for Bioinformatics Research and Development (BIRD). Funding for open access charge: National Institute of Advanced Industrial Science and Technology (AIST).
Conflict of interest statement. None declared.
The authors are grateful to Takatsugu Hirokawa and Kiyoshi Asai for their support of the project, to Martin Frith for his critical reading of the article and to Mari Saito for her contribution to website design.