|Home | About | Journals | Submit | Contact Us | Français|
Serine proteases account for over a third of all known proteolytic enzymes; they are involved in a variety of physiological processes and are classified into clans sharing structural homology. The PA clan of endopeptidases is the most abundant and over two thirds of this clan is comprised of the S1 family of serine proteases, which bear the archetypal trypsin fold and have a catalytic triad in the order Histidine, Aspartate, Serine. These proteases have been studied in depth and many three dimensional structures have been experimentally determined. However, these structures mostly consist of bacterial and animal proteases, with a small number of plant and fungal proteases and as yet no structures have been determined for protozoa or archaea. The core structure and active site geometry of these proteases is of interest for many applications. This study investigated the structural properties of different S1 family serine proteases from a diverse range of taxa using molecular modeling techniques.
Our predicted models from protozoa, archaea, fungi and plants were combined with the experimentally determined structures of 16 S1 family members and used for analysis of the catalytic core. Amino acid sequences were submitted to SWISS-MODEL for homology-based structure prediction or the LOOPP server for threading-based structure prediction. Predicted models were refined using INSIGHT II and SCRWL and validated against experimental structures. Investigation of secondary structures and electrostatic surface potential was performed using MOLMOL. The structural geometry of the catalytic core shows clear deviations between taxa, but the relative positions of the catalytic triad residues were conserved. Some highly conserved residues potentially contributing to the stability of the structural core were identified. Evolutionary divergence was also exhibited by large variation in secondary structure features outside the core, differences in overall amino acid distribution, and unique surface electrostatic potential patterns between species.
Encompassing a wide range of taxa, our structural analysis provides an evolutionary perspective on S1 family serine proteases. Focusing on the common core containing the catalytic site of the enzyme, this analysis is beneficial for future molecular modeling strategies and structural analysis of serine protease models.
Serine proteases represent over a third of all known proteolytic enzymes and are implicit in a wide range of physiological processes including digestion, immunity, blood clotting, fibrinolysis, reproduction and protein folding . The proteolytic mechanism of these proteases involves nucleophilic attack of the carbonyl atom of the substrate peptide bond by a catalytic serine (Ser) residue in the active site of the enzyme. In addition to the nucleophilic Ser residue, this reaction is dependent on two other amino acids in the catalytic site, Histidine (His) and an Aspartate (Asp) that together form what is referred to as the catalytic triad (or a dyad in some cases) . The presence of this catalytic triad in at least four distinct protein folds indicates evolutionary success in four different contexts .
The MEROPS classification system (http://merops.sanger.ac.uk/) has grouped proteases into clans that typically have structural homology and/or the same linear order of catalytic triad residues . Of all serine proteases, the PA clan of endopeptidases is the most abundant and has been studied the most in-depth. Although most members of this clan utilize a nucleophilic Ser residue (S sub-clan), there are several viral PA proteases that alternatively use a nucleophilic cysteine (Cys) residue (C sub-clan) . However, this study focuses solely on the PA clan serine proteases and more specifically members of the S1 family that bear the archetypal trypsin fold. Although extensively distributed in nature, clan PA proteases are highly represented in eukaryotes – vertebrates in particular have a vast array of proteases that are involved in a variety of extracellular processes . Most clan PA proteases have trypsin-like substrate specificity, cleaving the polypeptide substrate on the carboxyl side of an arginine (Arg) or lysine (Lys) amino acid . Nucleophilic attack by the Ser195 (standard chymotrypsin numbering) hydroxyl group on the carbonyl of the peptide substrate initiates the proteolytic mechanism. This reaction is catalyzed by the His57 acting as a general base, which itself is supported by a hydrogen bond to Asp102. The resulting tetrahedral intermediate is stabilized by Gly193 and Ser195, which contribute to a positively charged pocket known as the oxyanion hole. This tetrahedral intermediate breaks down to an acylenzyme intermediate, followed by the formation of a second tetrahedral intermediate. With the protonation of Ser195 by His57, the second tetrahedral intermediate breaks down and the carboxyl terminus of the substrate is released .
The S1 proteases are comprised of 2 β-barrels that align asymmetrically in a classical Greek key formation, bringing the catalytic residues together at their interface. The His57 and Asp102 reside in the N-terminal β-barrel with the nucleophilic Ser195 and oxyanion hole generated by the C-terminal β-barrel . Many of the trypsin-like proteases are produced as an inactive zymogen precursor protein . Cleavage of the proprotein precursor from the N terminus and subsequent conformational change of the tertiary structure is required for enzyme activation. In the case of trypsin, this regulatory mode of activation prevents autodegradation of the pancreas where it is produced, but allows efficient activity in the small intestine where it is activated by enteropeptidase and further trypsin molecules are activated by autocatalysis . In blood coagulation and complement activation, serine protease zymogens are sequentially activated in a cascade pathway, which eventually generates effector molecules by limited proteolysis. High specificity of their catalytic domains, interactions among the regulatory regions, and efficient removal of active serine proteases by irreversible protease inhibitors ensure local, transient reactions to physiological or pathological cues [11,12]. The S1 proteases have numerous functions including intestinal digestion (eg. trypsins, chymotrypsins, elastases), blood coagulation (eg. thrombin, coagulation factors), immunity (eg. complement factors, tryptases in secretory granules of mast cells, granzymes of cytotoxic cells) and homeostatic regulation (eg. kallikreins) .
This study investigates the structural properties of different S1 family serine proteases from a diverse range of taxa using molecular modeling techniques. Although the catalytic core geometry shows evolutionary divergence between taxa, the relative positions of the catalytic triad residues were conserved, as were other highly conserved residues that possibly provide stabilization. There was also large variation in secondary structure features outside the core, the overall amino acid distribution, and surface electrostatic potential patterns between species.
Structural data for 3 bacterial, 1 fungal, and 12 animal PA clan serine protease structures (Table (Table1)1) were obtained from the Protein Data Bank (PDB, http://www.rcsb.org/pdb). Our in-house modeling software package MODELYN  was developed to perform customized molecular editing and in silico structural analysis. It has a set of powerful menus for batch processing commands leading to automated implementation of complicated tasks, including complete model building based on sequence homology and batch processing of replacement mutations. ANALYN  is an ancillary protein sequence analysis program that assists MODELYN by analyzing homologous sequences and formulating the strategy for model building. In addition to the experimental structures, amino acid sequences of PA serine proteases (Table (Table1)1) for 1 protozoan (Plasmodium falciparum), 1 archaeon (Pyrococcus furiosus), 1 fungus (Neurospora crassa) and 1 plant (Arabidopsis thaliana) were obtained from the MEROPS protease database (http://merops.sanger.ac.uk) in FASTA format . Sequences were initially submitted to SWISS-MODEL for homology-based structure prediction . If this analysis was unsuccessful (due to less than 35% sequence similarity with known experimental structures), these sequences were submitted to the LOOPP server  for threading based structure prediction as previously described . This analysis reported a ranked list of possible structure predictions for each of the protease sequences, including match scores, sequence identity (%) and the extent of sequence coverage (%). Predicted structures were superposed with respect to a selected set of Cα atoms and a suitable starting scaffold was determined. Root mean square deviation (RMSD) values helped to identify the common segments, corresponding to the structurally conserved regions. The starting structures were refined using the DISCOVER and ANALYSIS modules within the software package Insight II  through energy minimization and molecular dynamics. The side chains were regenerated using SCRWL  and the overall structure was energy minimized. The SCWRL software package is used for prediction of protein side-chains of a fixed backbone, using graph theory to solve the combinatorial problem. PROCHECK was used to check the distribution of -ψ dihedral angles and identify Ramachandran outliers . The CHARMm module within InsightII was used to apply dihedral constraints in these segments. MOLPROBITY  and MODELYN were used to validate the structural models against experimental structure data. MOLPROBITY provides all-atom contact analysis and gives quantitative information on the steric interactions (H-bond and van der Waals contacts) at the interfaces between components. This program is widely used for quality validation of three-dimensional (3D) protein structures by measuring deviations of bond lengths, bond angles from standard values, overall atom clashscores and rotamer outliers. MODELYN was used to analyze other structural parameters, including the distance between Cα atoms of the catalytic triad. Verify3D , ProSA  and ERRAT  were also used to further assess the quality of the protease models. Verify3D analyzes the compatibility of the model against its own amino acid sequence. The Verify3D score (the sum of scores for individual residues using a 21-residue sliding window) is normalized to the length of the sequence: log2(Verify3D score/L2) . ProSA calculates an overall quality score (Z score) of a model in comparison to a range of characteristics expected for native protein structures. ERRAT analyzes the statistics of non-bonded interactions between different atom types (9-residue sliding window) and provides an overall quality factor that is expressed as the percentage of the protein for which the calculated error value falls below the 95% threshold. The ribbon structure and electrostatic potential surface of the structures were determined by MOLMOL . To determine sequence conservation between species, CLUSTALW  was used for multiple sequence alignment. For each sequence, PEPSTATS  was used to determine the molar percentage of each amino acid physico-chemical class.
The protozoan protease from P. falciparum was the only sequence that had significant homology with proteases of known experimental structure for successful structure prediction using SWISS-MODEL. The homology model was essentially built on the structures 1L1J (a heat shock protease from the hyperthermophilic bacterium Thermotoga maritime) and 2AS9 (a splC protease from the bacterium Staphylococcus aureus), with sequence identity ranging from 29 to 38% (Table (Table2).2). Homology-based structure prediction for the P. furiosus, N. crassa and A. thaliana proteases was unsuccessful due to insufficient sequence similarity with known experimental structures. The sequences of these proteases were then submitted to the LOOPP server for threading-based structure prediction, which yielded a list of 5 different experimental structures that matched to each of the sequences. The best matched structures for each showed high confidence scores ranging from 3.1 to 6.4 and sequence identity ranging from 24 to 44%, with best length coverage between 92 and 95%. For P. furiosus (Table (Table3),3), the matched structures were superposed with respect to a selected set of Cα atoms (43% superposition), with the structure 1GBI (an α-lytic protease from the proteobacterium Lysobacter enzymogenes) having the best score of 3.41 (RMSD values were between 0.357 and 0.563 Å, which helped to identify common segments corresponding to structurally conserved regions). From these superposed structures, the variable loop regions were identified on the starting scaffold derived from 1GBI. For N. crassa (Table (Table3),3), structures were superposed with respect to selected Cα atoms (39%) with the structure 1VCW (a degS protease from the bacterium Escherichia coli) having the highest score of 3.08 (RMSD values between 0.439 and 0.724 Å). For A. thaliana (Table (Table3),3), structures were superposed with respect to selected Cα atoms (48%), with the structure 1L1J having the best score of 6.4 (RMSD values were between 0.392 and 0.537 Å). Structural refinement using Insight II and SCRWL is provided in detail as additional information, including the refined energy status for each structural model (see Additional file 1: Table S1, Table S2, Table S3 and Table S4). PROCHECK was used to measure the overall backbone conformations of the predicted structures and identify Ramachandran outliers. The CHARMm module of Insight II was used to apply dihedral constraints in these segments (Table (Table4;4; see Additional file 1: Figure S1, Figure S2, Figure S3 and Figure S4). The general structural parameters of the refined model, such as deviations of bond lengths, bond angles from standard values, overall atom clashscores (overlaps >0.4 Å) and rotamer outliers (first two χ angles >20° from its nearest associated rotamer) were compared to experimental structure data using MOLPROBITY and MODELYN. This analysis indicated that the general structural parameters of experimental and predicted structures were comparable (Table (Table5).5). Further validation using Verify3D and ProSA gave good scores for overall model quality (Table (Table5).5). However, the ERRAT validation of the P. falciparum and N. crassa protease models indicated regions where the calculated errors were higher than expected, which decreased the overall quality score of these models (Table (Table5).5). In both cases, the low quality regions in the P. falciparum (Leu377-Asp387) and N. crassa (Ala168-Arg178) models were possibly due to steric clashes created by Phe379 (P. falciparum), Arg173 (N. crassa) and others. Significantly, these regions were not within close proximity (< 6 Å) of the catalytic site.
Superposition of the P. falciparumP. furiosusN. crassa and A. thaliana PA proteases on the representative 1SGI protease structure found that 13 to 20% of the Cα atoms superposed with a RMSD below 2Å (Table (Table6).6). In comparison, the animal proteases had 41 to 46% of the Cα atoms superposed with a RMSD below 0.8Å and the bacterial proteases of this clan had 10 to 19% of the Cα atoms superposed with a RMSD below 1Å. The superposed structures have a common core structure with large variation in loops outside the core (Figure (Figure1).1). The Cα atom distances of Asp to His, His to Ser and Asp to Ser averaged over the experimentally determined structures were 6.4 ± 0.01, 8.4 ± 0.03 and 9.8 ± 0.02 Å, respectively (Table (Table6).6). The small standard deviations (SDs) indicated that the structural environment around the catalytic triad was highly conserved. Averaged over the predicted structures, the Cα atom distances between the catalytic triad residues were 6.5 ± 0.06, 8.9 ± 0.19 and 10.1 ± 0.14 Å respectively, in good agreement with the values averaged over the experimental structures. Multiple sequence alignment (Figure (Figure2)2) confirmed sequence conservation of the catalytic triad residues at His57, Asp102, and Ser195 (chymotrypsin numbering). Other highly conserved amino acids have been described, including Thr54, Ala56 and Ser214, which stabilize the catalytic triad through a network of additional H-bonds . These residues were highly conserved showing the occupancy percentage of 76%, 71% and 71%, respectively, among the sequences analyzed. In conjunction with the catalytic Ser195, the Gly193 residue (which was conserved in 81% of the sequences analyzed) is known to generate a positively charged pocket within the active site known as the oxyanion hole. Through intramolecular electrostatic interactions, Asp194 (71% conservation) is known to stabilize both the oxyanion hole and the substrate binding site . In addition, other highly conserved amino acids such as Ala55 (81%), Cys58 (71%), Gly196 (100%), Gly197 (86%), and Pro198 (90%) were in close proximity to the catalytic residues.As confirmed in other serine proteases [28,29], such residues may confer stabilization of the catalytic site via a hydrogen-bonding interaction or via a disulfide bond in the case of the Cys residue (see Additional file 1: Figure S5, Tables S5 and Table S6). This analysis incorporates an evolutionarily diverse range of PA serine proteases and it indicates that although the core structures deviated considerably during evolution, the relative positions of the catalytic triad Cα atoms maintained very close relative distances and were stabilized by other highly conserved residues.
The S1 family of PA proteases is typically comprised of 2 β-barrels that align asymmetrically in a classical Greek key formation, bringing the catalytic residues together at their interface . Figure Figure33 is a representative X-ray structure of a S1 family bacterial protease (1SGC, protease A from Streptomyces griseus), comprising 13 β-sheets and 4 α-helices. The protease model from P. falciparum had 9 β-sheets, with His328 situated in a turn, Asp359 in a coil and Ser438 in a turn (Figure (Figure3C).3C). The surface electrostatic potentials around the catalytic site were similar to those of the 1SGC X-ray structure, showing mostly electroneutral regions with some patches of electronegative potential (Figure (Figure3D).3D). In comparison with the other species analyzed (see Table S7), the P. falciparum protease had a higher proportion (> SD of the mean) of polar residues (55%, molar percentage) and less (< SD of the mean) smaller amino acids (43%), which indicates it could favor a more hydrophilic environment. According to UniProt annotation (Q687H5), this protease is thought to be an ortholog of the E. coli degP protease, which is possibly involved in protein folding and is essential for growth at high temperatures .
The protease model from P. furiosus had 7 β-sheets with His286 situated in a turn, Asp320 in a coil and Ser389 in a turn (Figure (Figure3E).3E). The pattern of surface electrostatic potential was very different from others analyzed, with the surface containing mostly electronegative regions around the catalytic site (Figure (Figure3F).3F). In comparison with the other species analyzed (see Table S7), the P. furiosus protease had a slightly higher proportion (> SD of the mean) of aromatic residues (12%) and less (< SD of the mean) smaller amino acids (45%). These distinctive features, which have also been observed in another P. furiosus protease , may be associated with increased stabilization and hyperthermophilic adaptation. Closely packed aromatic interactions have been proposed to increase the ΔG of unfolding, thereby increasing thermal stability [31,32]. Further investigation of these properties could be utilized for protein engineering strategies.
The protease model from N. crassa had 6 β-sheets and 2 α-helical segments, with His120 situated in a short α-helix and the Asp151 and Ser234 residues in separate coil regions (Figure (Figure3G).3G). The surface electrostatic potential pattern shows the catalytic Ser residue is in an electroneutral zone whereas the His and Asp residues are in a mostly electronegative region (Figure (Figure3H).3H). In general, the N. crassa protease had a higher proportion (> SD of the mean) of acidic residues (13%) compared to the other species analyzed (see Table S7). This protease is an ortholog of the S. cereviseae Nma111p nuclear serine protease, which mediates apoptosis and promotes survival under heat stress . Mutational analysis of the N. crassa protease would be useful to explore these features in this highly studied model organism.
The A. thaliana PA protease model had 7 β-sheets and 1 α-helix, with His99 situated in the α-helix and Asp130 and Ser208 in separate turn structures (Figure (Figure3I).3I). The electrostatic potentials around the His and Ser catalytic residues were mostly electroneutral with the Asp residue of the catalytic triad in a very electronegative region (Figure (Figure3J).3J). The A. thaliana protease had a higher proportion (> SD of the mean) of aromatic residues (14%) compared to other species (see Table S7). According to UniProt annotation (Q9C691), this protease is thought to be an ortholog of degP6 and like the modeled protease from P. falciparum it is potentially involved in protein folding and promotes growth at high temperatures . A. thaliana is a highly studied model organism and like the N. crassa protease, mutational analysis of this protease would be useful to explore these features.
The pronounced differences in electrostatic surface features between the protease catalytic sites possibly have functional significance. In general, the catalytic sites were mostly electroneutral with regions that were electronegative. The P. falciparumA. thaliana and N. crassa proteases are orthologs of the oligomeric HtrA (or HtrA-like) family of serine proteases, which have a critical role in protein quality control [34,35]. Using a hold-and-cut mechanism, the PDZ domain of most HrtA complexes selectively binds small hydrophobic residues at the C-terminus of a misfolded protein substrate, which is then successively degraded in the proteolytic site . It is not surprising given the variety of functions in a wide range of different organisms that most HrtA enzymes have selective substrate specificity, although often for a number of substrates [34,35]. The electronegative patches in the catalytic sites of the modeled PA proteases could facilitate this specificity by favoring positively charged C-terminal amino acid side chains at specific sites within the binding pocket. Likewise, the largely electronegative catalytic site of the P. furiosus protease suggests it favors a positively charged substrate. The largely electroneutral regions possibly relax the stringency of the substrate binding, allowing for a number of different protein substrates. Further investigation of substrate specificity and other properties contributing to it would be needed for functional analysis of these proteases, particularly for the P. falciparum protease as it could be a potential target for rational anti-malarial drug design.
The following predicted structures are available in the Protein Model Database (PMDB) (http://mi.caspur.it/PMDB/):
1. PA serine protease from Plasmodium falciparum (PMDB ID: PM0075793)
2. PA serine protease from Pyrococcus furiosus (PMDB ID: PM0075794)
3. PA serine protease from Neurospora crassa (PMDB ID: PM0075795)
4. PA serine protease from Arabidopsis thaliana (PMDB ID: PM0075796)
In conjunction with 16 experimentally determined 3D protein structures, our analysis of predicted structures from a protozoan, an archaeaon, a plant and a fungus encompassed an evolutionarily diverse range of PA clan proteases. The structural geometry of the catalytic core clearly deviated considerably during evolution, but the relative positions of the catalytic triad residues were conserved and other highly conserved residues possibly provide stabilization of the core. Evolutionary divergence was also exhibited by large variation in secondary structure features outside the core, differences in overall amino acid distribution, and unique surface electrostatic potential patterns between species. These features are probably associated with environmental adaptation, subcellular localization, and the diverse functions of the different protease orthologs. Interestingly, each of the modeled proteases appear to be orthologs of heat shock proteases that are involved in protein folding and promote cell growth at high temperatures. Indeed, some of the proteases’ features are known to confer structural stability, such as a higher proportion of aromatic residues  or negatively charged residues around the catalytic site . Further investigation of these features would be useful for protein engineering strategies and to elucidate their functional significance in each of the modeled proteases.
AL participated in the design of the study and carried out the modeling, structural analysis and sequence alignment. AC contributed to MODELYN and CLUSTALW analysis. EJR drafted and revised the manuscript, with contributions by AC and AL. CM participated in the design and coordination of the study. All authors read and approved the final manuscript.
The authors have no competing interests to declare.
Figure S1. Ramachandran plot of -ψ dihedral angles of a modeled PA serine protease structure fromPlasmodium falciparumbefore and after backbone refinement. PROCHECK was used to check the distribution of -ψ dihedral angles and eliminate Ramachandran outliers in the modeled protease structure (A, before; B, after refinement). Residues whose -ψ pairs fell outside the most favourable (red) and additional allowed (yellow) zones are annotated in red. Figure S2. Ramachandran plot of -ψ dihedral angles of a modeled PA serine protease structure fromPyrococcus furiosus before and after backbone refinement. PROCHECK was used to check the distribution of -ψ dihedral angles and eliminate Ramachandran outliers in the modeled protease structure (A, before; B, after refinement). Residues whose -ψ pairs fell outside the most favourable (red) and additional allowed (yellow) zones are annotated in red. Figure S3. Ramachandran plot of -ψ dihedral angles of a modeled PA serine protease structure fromNeurospora crassabefore and after backbone refinement. PROCHECK was used to check the distribution of -ψ dihedral angles and eliminate Ramachandran outliers in the modeled protease structure (A, before; B, after refinement). Residues whose -ψ pairs fell outside the most favourable (red) and additional allowed (yellow) zones are annotated in red. Figure S4. Ramachandran plot of -ψ dihedral angles of a modeled PA serine protease structure fromArabidopsis thalianabefore and after backbone refinement. PROCHECK was used to check the distribution of -ψ dihedral angles and eliminate Ramachandran outliers in the modeled protease structure (A, before; B, after refinement). Residues whose -ψ pairs fell outside the most favourable (red) and additional allowed (yellow) zones are annotated in red. Figure S5. Predicted disulfide bond in Modeled PA protease structure ofPyrococcus furiosus(PMDB ID: PM0075794). The ribbon model shows secondary structures (β-sheets with arrow directed to C-terminus, α-helices and turn/loops) in alternating colors and cysteine residues Cys 267 (blue) and Cys287 (red) forming a predicted disulfide bond (2.04 Å). Table S1. Energy parameters of modeled PA protease structure from Plasmodium falciparum.Table S2. Energy parameters of modeled PA protease structure from Pyrococcus furiosus.Table S3. Energy parameters of modeled PA protease structure from Neurospora crassa.Table S4. Energy parameters of modeled PA protease structure from Arabidopsis thaliana.Table S5. Predicted hydrogen bonds in modeled PA protease structures. Table S6. Disulfide bonds in close proximity to catalytic histidine residue of experimental structures and modeled structures of PA serine proteases. Table S7. Relative comparison of PA serine protease amino acid composition based on physico-chemical properties.
AL and CM are grateful for the funding and infrastructural support provided by the Indian Institute of Chemical Biology, Kolkata, West Bengal, India. EJR and AC gratefully acknowledge the support provided by Professor Ian Morison and the Department of Pathology, University of Otago, Dunedin, the Health Research Council (EJR), and the National Research Centre for Growth and Development (AC), New Zealand. These entities did not have a role in: study design; collection, analysis, or interpretation of data; writing the manuscript or decision to submit manuscript for publication. The authors are also grateful for Hester Roberts’ helpful comments regarding the manuscript.