To understand the meaning of a family, we compared the groupings of domains in SCOP to determine the similarity to automatically generated groupings based independently on the three aspects we wished to investigate: sequence, structure and function. Since we begin without a pre-conceived idea of the granularity or size/depth of the groupings it is necessary to generate the automatic groupings at every possible level. This is represented by a tree which is the result of hierarchical clustering of the domains based on one of the three sources of information: sequence similarity, structural similarity, functional labels (in the forms of Gene Ontology and Enzyme Classification). The level of agreement between one type of information and the grouping of a SCOP family can be assessed by asking whether each edge in the tree divides domains into family groups, or splits a family, grouping together domains from different families.
The ROC curve Figure
shows the number of disagreements/agreements of the trees produced from sequence, structure and functional data with the SCOP family classification for varying confidence values. For sequence, confidence is ranked by bootstrap percentages, for structural data the confidence is based on the structural distance scores, and for function, confidence is based on the total number of terms which suggest a particular clade in the trees. See materials and methods for details of a web resource providing all data and trees.
Figure 1 The number of superfamily agreements/disagreements with SCOP for varying confidence values. A ROC curve showing the number of superfamilies containing agreements against the number containing disagreements of trees with SCOP's groupings, for confidence (more ...)
Within the literature there is variation in suggested levels for the minimum informative bootstrap confidence
], with most suggesting about 70-80% required for confidence. We found that from 2046 families across 428 superfamilies, 99.6% of the phylogenetic trees agree with the SCOP groupings for bootstrap values above 80%. We also found that, although less reliable, there is useful information which can be acquired from the trees for bootstrap values down to 60%. These results show that, to the extent to which sequence information can reliably determine evolutionary relationships, SCOP family groupings are evolutionarily consistent. Classical sequence phylogenetics are quite reliable for high bootstrap values, but are limited in the evolutionary distance over which they can resolve relationships. There are plenty of SCOP family groupings which sequence-based phylogenetics alone is unable to determine with high confidence - the low confidence parts of the tree. Although the classical phylogenetic analysis cannot inform us directly about the evolutionary consistency of many family groupings, the fact that there is such strong agreement with those that it can, gives a strong suggestion that the others (classified independently from this information) are also likely to be evolutionarily consistent.
The top 13 edges which conflicted with the sequence trees were examined. These are shown in a table in Figure
, along with an example of each type of disagreement. The most frequent disagreement was from families which were classified not long after the creation of SCOP. These families were classified at a time when PFAM
] sequence data was not available, and therefore did not provide evidence in the curation of SCOP families. Sequence information from PFAM is now a contributing factor of data used to guide the classification. An example is shown in Example 1. We also find examples such as that shown in Example 2, where a family has been decided in SCOP based on function. Trees based on both sequence and structure place the single domain Pancreatic carboxypeptidases family between domains for a different family causing a disagreement of the trees with SCOP families. In this case the classification of a domain into a new family of its own was likely based on a functional signal, however the tree based on function places the domain in a similar way to that of structure and sequence suggesting the domain should probably belong to the surrounding family. Our method classes 'nested families' as inconsistent with evolution (shown in Example 3), whereby one family grows from another in the tree. In some sense this is more a reflection of the limited number of levels in the hierarchy, suggesting that there are some families that actually represent a 'sub' family of another. We also find a small number of other artefacts, where is a family classification based on the source species. This is can happen with proteins found in viruses. We also see cases such as duplications of domains grouped within the same family, an illustration of this is shown in Example 4.
Figure 2 Examples of disagreements with SCOP. Examples of SCOP superfamilies which contain a disagreement found with trees based on sequence information, supported by high confidence values. Four of the common reasons for disagreement are explained. Images produced (more ...)
A potential factor which contributes to the disagreements seen in trees calculated from sequence data compared to those from the other data sources is also worth noting. Diverse superfamilies with very low sequence identity between member domains may provide an unreliable multiple sequence alignment thereby creating a result tree with limited accuracy. Anomalies introduced from this effect are more likely to be seen in very large superfamilies with a great deal of structural variation.
The trees built from automatically generated structural distances largely agree, but are not always consistent with SCOP’s hand annotated groupings. The hand classification of structures in SCOP at the superfamily and fold levels is often referred to as the gold standard in the field, and clearly surpasses any fully automatic method. Since detectable structural similarity remains long after sequences have diverged beyond the point of recognition, the structurally-derived trees are able to resolve deeper edges of the tree with higher confidence than the sequence-based ones (the intersection of the red and blue lines in Figure
). That the trees are largely in agreement with the family classification indicates that SCOP is also evolutionarily consistent at greater divergence distances. The differences we see could either be cases where SCOP has grouped domains based on some criterion other than evolution (e.g. common function), or may be due to geometric structural distance being in some cases a poor measure of divergence. For some proteins, changes to the structure of a binding site may be the best indication of evolutionary divergence, but these changes make a relatively small contribution to the automatic superposition of the whole body. Conversely, movements of secondary structures relative to each other, e.g. a change of angle between beta-sheets
], can cause dramatic changes in superposable structural distance which mask the true relationships. In this way structural geometric distance does not always equate to evolutionary distance.
Examining high ranking disagreements between the SCOP family classification and structural trees can mostly be explained by the above, however one exception is shown in Example 2 from Figure
. This example shows a sequence tree but we see the same disagreement when we look at the structural tree, and so in this case it suggests the possibility of a mis-classification.
The lines for EC numbers and GO terms shown in Figure
are smaller and less smooth than the others. This is because confidence values are generated using the total number of independent features that support a particular edge of the tree. There are not very many GO features per tree and barely any for EC number. This is partly due to a lack of richness in the ontological hierarchy but also due to the incompleteness of the annotation of the domains with terms. Trees derived from both GO and EC functional data are less consistent with the family level than trees derived from structure or sequence, though the majority still agree with the classification. This may be due to the low quality of the derived functional dataset, most commonly the lack of functional annotation for a particular domain. Functions are also appended to the protein chain rather than individual domains, therefore terms may be uninformative for two domains found within the same protein. The fact that the correlation with function is so much weaker than sequence and structure suggests that although function may guide the choice of granularity or level of grouping of families in SCOP (see section on Distribution of GO terms), it is not a primary source of information for determining relationships.
In SCOP all domains must belong to a family, so a superfamily with a single member must also have a single family. As more structures are added to a superfamily over time, there may be new additions that have enough in common to group them apart from the rest and a second family is created to hold them. If this happens successively the result is that some families contain domains with something in common, but any leftovers lacking common features with each other may remain in the original family that contained the first member of the superfamily. These non-specific families are referred to here as 'dustbin families'. The 'dustbin families' line in Figure
is derived from the same trees as for the standard domain sequences line, but the rules by which edges are defined as conflicting are adjusted to not penalise for the presence of a single dustbin family in each superfamily. Remarkably, despite expectations, the results show that they are not a major feature of the SCOP classification.
shows the maximum sequence divergence between any two members of a family or superfamily, i.e. a measure of the divergence within the family or superfamily. The analysis of sequence distances shows that the maximum sequence diversity for domains grouped within a family is on average 22% with the majority of families having a maximum sequence distance of 10-30%. Superfamilies on the other hand have a sequence diversity spread of 8% and below, with the average being close to zero. While it is well known that remote homology detection at the superfamily level is a difficult problem, the data show that about half (169) of the 341 families (the most divergent family within each of the 341 superfamilies in the analysis) contain members with no less than 20% sequence identity.
Figure 3 Sequence divergence in families and superfamilies. Graph shows the maximum sequence diversity between two members of the same superfamily (or family) in SCOP. Domains which continue to diverge beyond detectable sequence identity have their distribution (more ...)
shows the maximum structural distance found between two members of the same superfamily or family. The distribution shows that the maximum structural distances are greater between two members of the same superfamily than to two domains grouped in the same family.
Figure 4 Structural divergence in families and superfamilies. Graph shows the maximum structural diversity between two members of the same superfamily (or family) in SCOP. Structural distances used are the scores produce by Structal for the alignment of two domains. (more ...)
It is clear from the distribution in the graph in Figure
that SCOP families are not selected by simply choosing a random sequence identity cutoff, and that the process of curation is much more elaborate.
Distribution of GO terms
shows the distribution of GO terms annotated to single domains across SCOP. We see that approximately 1/3 of GO and EC annotation applies directly to one family, another 1/3 to a subset of a family, and the remaining 1/3 scattered across multiple superfamilies, with strikingly few terms that apply at the superfamily level. One would expect that the terms in the sub-family would be lower down the GO hierarchy and those spanning multiple superfamilies would be broader terms found higher up the hierarchy, but the distribution across the GO hierarchy is quite similar in each of the three major segments of the pie chart shown in Figure
. This distribution does not change significantly when looking at each of the three ontologies of GO (molecular function, cellular localisation, biological process) separately. A more detailed view is shown in Additional file
: Table S1 in additional files.
Figure 5 Level in SCOP of all single domain proteins associated with a specific GO term. Figure shows the level in SCOP at which all single domains associated with a particular GO term are found. I.e. if the group represents a family or superfamily. These are (more ...)
Despite the weak link between SCOP family classification and the edges of trees representing functional data, we see a very large proportion of functional terms corresponding to exactly one family, and almost none close to the superfamily level. This suggests that the relationships between members of a superfamily and their distance apart is evolutionary, having been based on evidence from structure and sequence (not function), but the granularity at which to divide the members of a superfamily is decided by function. I.e. domains are not grouped based on their function, but the number of groups relates to the number of functions.