|Home | About | Journals | Submit | Contact Us | Français|
In this the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of the Origin of Species it is fitting to revisit the classification of protein structures from an evolutionary perspective. Existing classifications use homologous sequence relationships, but knowing that structure is much more conserved that sequence creates an iterative loop from which structures can be further classified beyond that of the domain, thereby teasing out distant evolutionary relationships. The desired classification scheme is then one in which a fold is merely semantics and structure can be classified as either ancestral or derived.
In 1980 the Protein Data Bank (PDB; ) contained less than 100 structures and structural biologists had studied and could name most if not all of them. Today the PDB contains approximately 55,000 macromolecular structures of proteins, DNA, RNA, and complexes thereof, often combined with a variety of small molecules . No human can assimilate such a breadth of information and so it is only natural, as has happened in so many areas of science with positive consequences, that we attempt an act of reductionism. Thus, the classification of protein structures is an attempt at reductionism from which biological function can be better interpreted. In its purest form reductionism would imply that the application of a simply theory could take a subset of structures, the unique set, and generate all others from it. Clearly this cannot be done completely, Nature is far too tricky, but the notion of generating all structures from a parts list  has persisted. Two parts are considered the same if they can be superimposed in 3-dimensions. This raises at least three issues. What constitutes a part; what metric defines two parts as the same, and most importantly, does that sameness convey any biological meaning? Stating the problem a different way, the parts list approach could be considered a bottom-up approach, whereas a consideration of the biological context a top-down approach. The issues then become how well do the two approaches mesh in the middle and what constitutes the biological context?
Already we have introduced a very significant set of issues, yet enormous scientific progress has been made through existing classification schemes. Let us briefly consider some of these schemes in the context of the bottom-up versus top-down approaches. This will serve as an introduction to why we believe the future calls for a more detailed classification which only makes sense in an evolutionary framework.
A large variety of protein structure comparison algorithms have been developed over the past 20 years (see  for a review). While they use different methods of protein representation, different algorithms for comparison and different scoring functions, in the majority of cases the end result is a geometric comparison which results in a superposition of the structures according to a root means-square deviation (RMSD), length of alignment, number of gaps, and a score of the statistical significance. As was shown a number of years ago  and again more recently  there is rarely a unique answer and at a fine level of detail (the devil is often in the details) certainly leads to misalignments by failing to capture the biological relevance. Nevertheless, these methods lead to a reductionism which provides a non-redundant structural set as originally exemplified by Dali  and the FSSP database , with a number of other databases of classified protein structures following . In the majority of cases the comparison is between protein domains and beyond that has little biological context.
Top-down approaches are exemplified by CATH  and SCOP , today’s gold standards for protein structure classification. While the sheer volume of data to classify requires automation (CATH more than SCOP), human expertise is still used since difficult cases require manual inspection. Much has been written about CATH and SCOP and comparisons have been made between these classification schemes   and there is no need to go into further detail here. Both methods involve a consideration of protein domains and incorporate the biological context primarily through detecting homologous sequence relationships. This later point implies that evolution is already a consideration in structure classification; here we suggest that this needs to be taken further. How extant proteins emerged from smaller building blocks, the role of gene duplication, convergence versus divergence, and co-evolution in a functional context are examples of evolutionary considerations that need to be incorporated into future protein structure classification schemes as we shall see subsequently. In this context we would argue that the end goal of protein classification is to describe the evolutionary pathways between all protein structures.
Protein domains, as independent folding units, are the modular building blocks of proteins and most current protein structure classification schemes, whether top-down or bottom-up, are based on domains. Protein domain definition from 3D structure is not a fully solved problem [13,14] which explains some of the differences between existing classification schemes. Since many proteins are multi-domain proteins, and multi-domain proteins are more common in eukaryotes than prokaryotes, we already have a hint for the role evolution can play in an extended protein structure classification scheme. Some domains have high sequence similarity and are evolutionarily related; others are distantly related, sharing obvious structure similarity but not sequence similarity; others have similar topologies, but not to the point where there is clear evidence of common ancestry. Taking SCOP as an example, the first two groups are further classified into the family and superfamily levels, forming a hierarchical scheme. There lies a fundamental problem, a domain can be thought of as both an evolutionary and non-evolutionary unit. Difficulties with current schemes are further compounded by the notion of folds (all or part of a domain) which are considered discreet components in current top-down classification schemes. Folds are not considered from an evolutionary perspective, but they may be related. Folds do change during evolution to give rise to new folds [15,16]. Grishin proposed that it is possible for an all-alpha fold to evolve into an all-beta fold by sequential secondary structure flip-over . Similarly, recent work attempted to create two short peptides with high sequence similarity but distinct folds . They achieved this goal with two 50 amino acid peptides with 88% sequence identity, but totally different structure and function. Finally, another case which is difficult for the current classification schemes to embrace are chameleon sequences which can adopt multiple folds . If one accepts the notion of gradual structural variation at the fold level, how can protein structures be classified this way? One notion is the use of smaller fragments , but as we shall propose subsequently, this too only makes sense in the light of evolution. In summary, whether or not two proteins are in the same fold is really semantics, whereas describing which is ancestral and which is derived truly captures their relationship. Unfortunately this is a harder problem than simply clustering similar structures. In part it is harder since first you need to identify that protein within extant species and second you need to know the relationship between those species and their ancestors. Ironically, the first problem is addressed well using existing classification schemes.
The recent accumulation of genomic and structural data as well as improvements in homology detection algorithms has led to the reliable prediction of the protein domain content of all completed genomes using both SCOP and CATH domain definitions [21,22]. These protein domain distributions are the starting point for the investigation of protein domain evolution in the genomic era [23–27].
The work of assessing the distribution of domain content across the tree of life began shortly after the completion of the first genomes from each of the three superkingdoms . As the number of structures and the number of genomes accumulated a power law distribution of domains  and domain combinations  emerged. Several models have been proposed to explain this distribution [31,32]. To illustrate this point, according to SCOP 1.73 which contains 1087 folds, 692 folds contain only one family (and hence one superfamily). Therefore, the majority of folds correspond to one homologous family that covers a very tiny portion of sequence space. Conversely, the ferredoxin-like fold (SCOP d.58) is found in 55 superfamilies, comprising 123 families. This imbalance is undoubtedly the result of evolution as can be seen by considering the power law relationship with respect to the complexity of the organism.
Two independent groups compared domain abundance to features representing complexity, namely genome size  and numbers of cell types . Ranea et al.  clustered domain families into three categories in terms of their relationship to genome size: unrelated (mainly translation and biosynthesis), linearly-related (mainly metabolism) and non-linearly-related (mainly involved in gene regulation). Vogel et al.  compared domain family abundance with cell type numbers in different eukaryote species. About 10% of domain families have a strong correlation with complexity. Half of these superfamilies are involved in extracellular processes and regulation. Such results infer subtle structure-function relationships of protein domains during evolution leading to the current protein structure repertoire.
An important evolutionary consideration is not just the abundance of domains, but their organization. Over 70% of proteins in eukaryotes and over 50% of proteins in prokaryotes contain more than one domain . These multi-domain proteins are represented by linear combinations of domains; the domain architecture . Domain architectures arise through domain shuffling, domain duplication, and domain insertion and deletion (see [36,37] for a review) leading to new functions . Baus et al.  defined “promiscuous” domains as those that occur in diverse domain architectures. The authors provided a measurement of promiscuity of domains based on the frequency of their coexistence with different domain partners. A systematic comparative genomic analysis in 28 eukaryotes resulted in 215 strongly promiscuous domains. It is not surprising that most are involved in protein-protein interactions, especially in signal transduction pathways. Vogel et al.  observed an over-representation of some two-domain or three-domain combinations in complete genomes and termed them “supra-domains.” Those supra-domains (described here as macrodomains) have stable internal domain architectures that are conserved over long evolutionary distances, acting like a single domain in combination with other domain partners. About 1400 macrodomains have been identified with diverse functions, indicating that the preferred association of certain domains is universal and evolutionarily advantageous. These two examples show that domain combinations are determined by functional constraints and evolutionary selection, not just random processes . As such, domain combinations are an important aspect of any protein classification scheme.
A logical extension of these findings is to map domain combinations to presumed phylogenetic relationships derived by other means as exemplified by Snell et al. . Kummerfeld et al.  counted the distribution of various types of single domain and multi-domain proteins across the tree of life and predicted that fusion is four times more common than fission in domain combinations. Fong et al.  viewed the domain architecture in multi-domain proteins as the rearrangement of existing architectures, acquisition of new domains or deletion of old domains, and proposed a parsimony model to derive the evolutionary pathways by which extant domain architectures may have evolved. Guided by the evolutionary information in phylogenetic trees, Ekman et al.  studied the rate of multi-domain architecture formation across different branches of the phylogenetic tree and found that there are elevated rates of domain rearrangement in Metazoa, whereas creation of domains was more frequent in early evolution. Similarly, Itoh et al.  observed a large number of group-specific domain combinations in animals and investigated the difference in domain combinations among different phylogenetic groups. Yang et al.  aimed to derive the entire evolutionary history of each domain and domain combination throughout the tree of life by mapping current domain content onto the species trees. This approach reveals the origin of each protein domain as well as evolutionary processes such as horizontal gene transfer among more distant species.
The discussion thus far has focused on the protein domain as the best single level for classifying protein structure, but it is by no means the only one. Just as Ford Doolittle has argued the shortcomings of tree representations to illustrate the relationship between species , calling for a pluralistic approach where no one tree maps all species, we propose a pluralistic approach to protein structure classification incorporating domains, subdomains, macrodomains, and both convergent and divergent evolution. Subdomain Features
There are currently several available tools for comparing proteins at the subdomain level. Fragnostic is a database that defines relationships in the PDB based on shared fragments between structures . These fragments share both structural and sequence similarity. They can be varying sizes from 5 to 20 residues. Each of these edges is ambiguous (not defined as divergent or convergent evolution) and directionless. However, combining this information with other sources of information could polarize and test some of these edges as a hypothesis for structural evolution.
Another subdomain unit is the closed loop. Most protein structures are composed of loops that come back around on themselves every 25–30 residues . Domain Hierarchy and closed Loops (DHcL) is a web server that decomposes protein structures into domains and closed loops based on van der Waals energies . The protein modules that are the most conserved since the last universal common ancestor (LUCA) correspond to closed loops . Recently all prokaryotic proteins were decomposed into 20 residue fragments (possible closed loops) and clustered based on an identity threshold . The authors found that fragments that corresponded to closed loops were more likely to form large clusters. It is possible to walk between clusters because some have small connections. The authors propose this description is superior to a domain based one because it represents a finer view of protein function. Closed loops of a common origin in different superfamilies could be evidence for a common ancestor between those superfamilies. Functional sites are another subdomain feature that could be used for classification. Many distinct superfamilies bind the same ligand. It is possible that these superfamilies share a common ancestor that bound that ligand, but diverged in global structure while the site that binds the ligand is conserved. SMAP  finds such binding pockets with both sequence and structural conservation, so these are probably the result of divergent evolution. However, it is also possible that two superfamilies could converge on the same ligand. The PROCOGNATE database defines what superfamilies bind what ligand using structural information from the PDB . A combination of these approaches could create a ligand based classification for domains that encompasses both convergent and divergent evolutionary events.
A protein-protein interaction site is an example of a macro feature conserved from an evolutionary perspective. The interface is conserved while the composite proteins form new superfamilies. A comparison between all protein-protein interfaces in the PDB revealed several examples of highly similar interfaces between different pairs of superfamilies . MAPPIS is a tool for aligning protein-protein binding sites . This level of classification is best done using quaternary structure. 3D complex is a database that classifies protein structures by their quaternary structure . Homomeric complexes evolve in a stepwise fashion from monomers to structures with cyclic symmetry and then to structures with dihedral symmetry . This information can be used to establish evolutionary relationships between homomers. As an example consider the SCOP family N-acylglucosamine (NAG) epimerase (48222). SCOP 1.73 has two structures in this family; N-acyl-D-glucosamine 2-epimerase(1fp3) and NAG isomerase (2afa). N-acyl-D-glucosamine 2-epimerase is a dimer with cyclic symmetry (C2) and NAG isomerase is a hexamer with dihedral symmetry (D3) according to the 3D complex database . This implies that NAG isomerase must be derived from N-acyl-D-glucosamine 2-epimerase which evolved from one of the many monomeric structures found in this superfamily. It should be noted there may be structural intermediates that have not yet been solved.
Quaternary structure can also define the evolution of some heteromeric complexes. The simplest case is when a heteromer is composed of the same chains as a homomer. The heteromer is almost certainly derived via gene duplication. There are many examples in SCOP where proteins in the same family or superfamily have different quaternary structures. We propose that this information must be incorporated in a classification scheme. A domain based scheme would simply say these proteins share a common ancestor, while a system that includes quaternary structure defines them more explicitly. In summary a domain based classification implies common ancestry, but a macrodomain and subdomain analysis implies an evolutionary hypothesis.
We are proposing a pluralistic (some would say fuzzy) approach to protein structure classification that depends much more on evolution than simply defining homologous relationships between sequences as used in current top-down approaches. Yet these existing schemes form the basis from which pluralism is possible. Pluralism still proposes the domain as a fundamental evolutionary unit, yet encompasses the notion of subdomains and macrodomains.
The scheme needs to be dynamic since many phylogenetic relationships upon which the classification is based will change. For example, there are currently several proposed branching orders for the major taxonomic groups [59,60]. In the Cavalier-Smith scheme , archaea and eukaryotes are both derived superkingdoms, so if there is a link between a protein in bacteria and another found in only archaea, the archaeal protein must be derived. The tree of life infers polarity in the evolution of proteins, but the classification of proteins can also polarize the tree of life. Ideally the two would eventually converge to a solution that captures the history of species as well as proteins. Difficulties arise with our pluralistic scheme since convergence of structure reflects independent evolutionary invention of similar structural folds. Although convergent evolution of structure is rare, it does occur and thus can we really know if promiscuous folds, such as the TIM beta/alpha barrel fold, did not emerge several times independently in evolution? How many cases are there like this?
In our pluralistic scheme any relationship can be defined as divergent, convergent, or ambiguous. What would the map of protein classification/evolution look like when it is complete? It would likely consist of a series of views at different levels of structural granularity where each feature in a given structure could be mapped to equivalent features in other structures and mapped to its presence or absence in extant organisms and by inference common ancestors. The ancestry of modern proteins would reveal the history of their domains and domain combinations as well as similar and dissimilar micro and macro features. The architecture of the classification scheme would depend on the level it was being explored. Domains would exist as part of a directed acyclic graph if their ancestry was established or as undirected graphs for convergent or ambiguous events.
If such an integrated scheme were in place, and it is a big if, we could contemplate protein evolution in and before LUCA. The superfamily content of the last universal common ancestor (LUCA) has been estimated to contain over 140 different superfamilies , although we argue this is an underestimate (in preparation)). It has also been proposed that the oldest fold is the P-loop containing nucleoside triphosphate hydrolase . But how did this fold arise? If we are to root a classification based on evolution we need to explain how to get from that fold to 140 different superfamilies. This is not possible by simply comparing sequences or even structures of whole domains. Protein evolution probably began with structures smaller than what we would consider a domain. It has been proposed that the earliest proteins were created by trans-splicing RNAs that code for protein modules and the origin of genes is much later, independent in archaea and bacteria . Understanding the relationship between the modules that composed LUCA is essential to testing this idea and other hypothesis’ about LUCA. This will only be possible by classifying protein structures based on an evolutionary scheme at all levels of protein structure.
The possibility of a pluralistic scheme of protein structure classification is only possible by virtue of the foresight and hard work that has gone into creating our existing bottom-up and top down approaches. Notwithstanding, if improvements in important areas such as functional annotation and structure prediction are to be made new insights are needed. Further use of what evolution can teach us would seem to be required. In so doing Nature’s reductionism will become the reductionism that helps science advance.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.