|Home | About | Journals | Submit | Contact Us | Français|
High-resolution structures of proteins remain the most valuable source for understanding their function in the cell and provide leads for drug design. Since the availability of sufficient protein structures to tackle complex problems such as modeling backbone moves or docking remains a problem, alternative approaches using small, recurrent protein fragments have been employed. Here we present two databases that provide a vast resource for implementing such fragment-based strategies. The BriX database contains fragments from over 7000 non-homologous proteins from the Astral collection, segmented in lengths from 4 to 14 residues and clustered according to structural similarity, summing up to a content of 2 million fragments per length. To overcome the lack of loops classified in BriX, we constructed the Loop BriX database of non-regular structure elements, clustered according to end-to-end distance between the regular residues flanking the loop. Both databases are available online (http://brix.crg.es) and can be accessed through a user-friendly web-interface. For high-throughput queries a web-based API is provided, as well as full database downloads. In addition, two exciting applications are provided as online services: (i) user-submitted structures can be covered on the fly with BriX classes, representing putative structural variation throughout the protein and (ii) gaps or low-confidence regions in these structures can be bridged with matching fragments.
Proteins are by far the most versatile and complex molecules in the cell. It is commonly accepted that protein function directly relates to three-dimensional (3D) structure. Yet, for just over a quarter of all single-domain protein families detailed structural information is available (1), a number that can be extended through threading and homology modeling (2). Due to experimental constraints of X-ray crystallography or NMR, the rate at which new structures are determined is considerably slower than the amount of new sequence data that is being determined by next-generation sequencing methods.
In order to understand the structural protein universe, proteins have been classified on the architecture of the fold and evolutionary relationships in databases such as SCOP (3) or CATH (4). However, proteins often perform their functions using just a limited number of residues, making it worthwhile to find structural similarities at the level of protein fragments. Seeking for a ‘parts list’ of proteins—with α-helices and β-sheets as prime examples of common parts—fragment libraries have been constructed based on the similarity of the polypeptide backbone (5,6). These protein fragment libraries have been widely used for a range of applications such as structural comparison of protein folds through a simplified representation with fragments (7), homology modeling at the level of fragments (8,9), investigating sequence-to-structure relationships (10), approximating tertiary structure of proteins using fragments (11–14), loop prediction (15–17) or even novel fold prediction (18,19).
Unfortunately, many of the available fragment libraries are either limited in fragment classes or ‘states’ (6,20) or not publicly accessible (13). Moreover, existing databases are often biased towards short stretches of residues, typically three to nine residues long, or contain an extensive parts list but are not clustered based on backbone similarity, thereby complicating comparative studies (21). Although limited alphabets have been shown to successfully reconstruct existing proteins to global fits of 0.5Å root mean square distance (RMSD) or serve successfully as templates to efficiently sample the protein space, they are too limited to describe protein structure at sub-ångström resolution, especially in the case of loops (22). To overcome these limitations we have constructed BriX, a database of protein fragments from 4 to 14 residues, hierarchically clustered on backbone similarities (22).
Here we describe how we updated the BriX database, which previously contained fragments from 1259 structures, to incorporate over 7000 structures from the ASTRAL40 set (a curated set of proteins with <40% sequence homology) (23). Furthermore, we enriched the database with all loops from over 14000 structures in the ASTRAL95 set (sharing <95% sequence homology) and clustered these loops in their own respect. We also provide a user-friendly web interface to explore both BriX and Loop BriX (http://brix.crg.es). Finally, to illustrate the potential of our database we allow users to upload their own PDB structure and ‘cover’ parts or ‘bridge’ gaps with BriX or Loop BriX fragments. The new release of BriX is expected to be helpful to the scientific community by facilitating the use of fragments in structural biology, protein modeling and design.
The first version of the BriX database (22) was constructed from the Whatif set of 1259 non-redundant proteins (24). Using a sliding-window technique, we segmented all proteins into fragments of 4 to 14 residues long and clustered them on their backbone similarity with a hierarchical clustering algorithm. The similarity between two fragments is defined as the average RMSD between the backbone atoms (N, Cα, C, O) of each corresponding residue.
The updated version of the BriX database is enriched with the much larger ASTRAL40 set of 7290 proteins sharing <40% of sequence homology. The ASTRAL40 set is a complete representation of the variety present in structural databases such as SCOP (Supplementary Figure S1). Once more, we fragmented all proteins and assigned each fragment to the closest class represented by its centroid. As it turns out, we were able to fit most of the ASTRAL40 fragments into existing BriX classes, showing the completeness of our structural alphabet in the updated version of BriX, while increasing its content 7-fold (Figure 1).
As expected, the number of classes varies with the length of the clustered fragments: even for short fragment length (n=4) and strict threshold (≤0.4Å RMSD) a large number of classes (2000) were observed. The largest amount of structural classes is detected when applying a clustering threshold of 0.5Å to fragments of length 7: 3613 classes can be distinguished. Hereafter the number of classes steadily decreases until 1500 classes at length 14 (Figure 1A). As expected, the number of classes per length decreases with increasing classification thresholds (Supplementary Figure S2) as more different fragments are classified into a single class. Also, the percentage of classified fragments decreases steadily with increasing fragment length. To compensate for this, increasing the covering thresholds for a specific length improves the classification rates (Supplementary Figure S3).
Furthermore, we analyzed the secondary structure content in classes derived for different fragment lengths and thresholds. Not surprisingly, α-helical and β-strand fragments remain well represented in structural classes of higher length (Supplementary Figure S4), while loop fragments are under-represented in classes of all lengths, indicating that they are harder to classify. Clearly the majority of unclassified fragments are composed of loop structures (Supplementary Figure S5). This indicates that a separate classification scheme, more suited to the particularities of loop structures, could significantly enrich the BriX database.
The Loop BriX database was built using 14525 protein structures derived from the ASTRAL95 set containing protein structures sharing <95% sequence identity (23). A loop fragment starts and ends with a single residue belonging to a regular secondary structure such as a helix or a strand and contains any number of irregular residues in between. As shown by different studies, the structural loop space can be partitioned by four combinations of flanking regular elements: α-α, α-β, β-α and β-β Added proper references at the reference section (25–27) (Supplementary Figure S6).
We have introduced a novel way to compare the similarity between two loop fragments based on the (i) the distance between their end points (‘end-to-end distance’) rather than the overall structure similarity used in BriX and (ii) the superposition of two regular anchor residues at each side of the loop with a RMSD <1Å. First, loops in each of the four loop classes described above were clustered on end-to-end distance using the same hierarchical clustering algorithm. These ‘super classes’ are composed of varying sizes and thus show a considerable amount of variation in the part between the end points (Figure 2A). Secondly, super classes were clustered in ‘sub classes’, grouping loops of the same length and similar structure.
In contrast to the relatively limited conformational space of regular structure elements, loop structures are much more variable. In Loop BriX, loop fragments are between 4 and 117 irregular residues long and classes are generally less populated (Figure 2B). Intriguingly, we observe a clear distinction between classes of loops connecting different secondary structure: the number of super-classes having more than 100 fragments is much lower for α-α (8) than β-β classes (20), showing less regularity for α-α classes than for β-β classes (Supplementary Figure S7). This is explained by the fact that α-helices, being cylindrical, show much more variation at their end points, while β-strands have more regular end-to-end distances.
We then examined the results of our loop classification scheme, looking at the percentage of loops we were able to classify. At the super class level our approach classified almost 90% of 6-residue loops and 45% of 14-residue loops while the success of sub-clustering in equally sized groups decreased more rapidly (Supplementary Figure S8A). We found that the sub-classification was successful up to fragments of length 16, after which no regular loop patterns could be identified (Supplementary Figure S8B).
The first version of the BriX database already inspired many applications in the fields of structural biology and protein design. Baeten et al. showed that proteins from the widely used Park & Levitt set could be reconstructed using BriX fragments to a global 0.48Å RMSD accuracy, improving existing results using more limited structural alphabets (22).
Demon et al. used BriX database fragments in combination with the FoldX protein design algorithm to construct a model of murine caspase 3 and 7 in complex with substrate peptides. These models were subsequently used to explain experimentally observed differences in substrate specificity between caspase 3 and 7 (28,29).
In other recent work, we have shown that the structural space of protein–peptide interactions can be approximated using fragments from the BriX database (30). The interfaces of over 300 protein–peptide complexes from the PepX database (31) were reconstructed to within 1Å RMSD, using observed fragment interactions to reconstruct the binding modes. The sheer size of the database allowed us to extract structural knowledge on protein–peptide interactions.
Until now, all of these services have been limited to internal use of the database. With the updated version of the BriX and Loop BriX databases, the website and the addition of the covering and bridging algorithms (see below), we open up the possibilities to use the BriX database to the scientific community at large.
A user-friendly browsing interface is available on the website (http://brix.crg.es, Figure 3A). BriX contains two levels: the class level and the fragment level (Figure 3C). Classes can be sorted and filtered on (i) class size, (ii) fragment length (from 4 to 14 residues), (iii) clustering threshold describing the compactness of the classes, (iv) minimum and maximum percentage of helix, loop, sheet and turn content and (v) regular expressions of the amino acid sequence and secondary structure as determined by DSSP (32) (Figure 3B). For each BriX class, we generated images of the superposed fragments using Chimera (33) and logos of the sequence and structure distributions using Weblogo (34). Subsequently, the fragments of each class can be filtered on PDB ID (35), sequence or secondary structure.
Loop BriX contains three levels: (i) the superclass level with fragments of similar end-to-end distance and matching end residues, (ii) the subclass level with fragments of similar backbone patterns and length and finally, (iii) the fragment level (Figure 3D). The Loop BriX superclasses and subclasses can be queried with the same parameters as the BriX database plus end-to-end distance.
To explore the vast size of our database we provide two algorithms to query BriX and Loop BriX with a user-submitted structure: ‘covering’ and ‘bridging’. The covering algorithm covers backbone coordinates of the input structure with similar BriX classes. The bridging algorithm spans the distance between any pair of anchoring residues regardless of backbone coordinates in between them. This is extremely useful to derive plausible loop conformations where backbone coordinates are not present or poorly defined.
In Figure 4A, we show the application of the covering algorithm to a PDZ domain (PDB ID 2WL7), covering a part of the β-strand with classes from the BriX database. Residues 112–116 are selected for covering. The algorithm matches the selected region to the BriX classes by calculating the distance to each class centroid. Here, the user can select the class threshold that defines their compactness (0.6Å in this example). Fragments are returned for every class having a centroid close enough to the query fragment. The user can also select the maximum number of fragments per class, the total minimum and maximum number of fragments (between 1 and 1000) and superposition thresholds are adapted accordingly. In the case of the β-strand of the PDZ, over 3000 fragments superposing with 0.6Å are matched, of which 1000 are returned to the user as a set of downloadable fragment PDB files. Moreover, the service provides a snapshot of these fragments superposed on the query PDB as well as logos depicting sequence and structure propensities of the matched fragments, useful to derive sequence or structure relationships. Finally, the set of matching classes and fragments can be further inspected online using the previously described search interface.
The bridging algorithm works in a similar fashion. To illustrate this, we removed a loop of the same PDZ domain from the input structure (Figure 4B), which is involved in binding the peptide ligand of this domain. This loop is anchored by residue 104 on the left and residue 112 on the right, spanning a gap of 12.7Å end-to-end distance. The algorithm reconstructs a backbone with fragments from the Loop BriX database between the two anchor residues. As one might expect, the results contain loops from other PDZ domains (e.g. PDB ID 1WIF), but also loops derived from proteins with unrelated SCOP classes.
Given the vastness of our database, calculations can be demanding. We allocated a dedicated cluster (40 nodes) that runs the algorithms independent from the web server.
The BriX and Loop BriX databases are accessible through a web portal at http://brix.crg.es. The portal is built on the open-source Drupal Content Management System for full flexibility. The entire database with annotations is available for download in the SQL format, describing the relations between classes and fragments. As an additional service for automated high-throughput querying, all information contained within the BriX and Loop BriX database can be downloaded as CSV (comma-separated values) lists. For example, prompting the URL http://brix.crg.es/classes?Length=10&Structure=HHHHHHHHHH returns a CSV file containing BriX classes of length 10 with an α-helical structure. Finally, BriX will be updated automatically when new versions of the ASTRAL sets will become available.
Supplementary Data are available at NAR Online.
PhD scholarship from the Institute for Science and Innovation Flanders (IWT) (to P.V. and L.B.); Long-term exchange fellowship from the Research Foundation Flanders (FWO) (to P.V.); PhD scholarship from the EU grant Penelope FW6 (to E.V.). Funding from the EU grants 3D repertoire and the Spanish grant Centrosome3D. Funding for open access charge: EU grants 3D repertoire and the Spanish grant Centrosome3D.
Conflict of interest statement. None declared.
The authors thank Almer M. van der Sloot and Joke Reumers for critical reading of the manuscript.