Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformation**|**v.4(8); 2010**|**PMC2951670

Formats

Article sections

Authors

Related links

Bioinformation. 2010; 4(8): 347–351.

Published online 2010 February 28.

PMCID: PMC2951670

Department of General Chemistry, Pavia University, viale Taramelli 12, I-27100 Pavia, Italy and Department of Biomolecular Structural Chemistry, MFPL ‐ Vienna University, Campus Vienna Biocenter 5, A-1030 Vienna, Austria

Received 2009 June 20; Accepted 2009 July 23.

Copyright © 2010 Biomedical Informatics Publishing Group

This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium,
for non-commercial purposes, provided the original author and source are credited.

Several non-redundant ensembles of protein three-dimensional structures were analyzed in order to estimate their natural clustering tendency by means of the Cox-Lewis coefficient. It was observed that, despite proteins tend to aggregate into different and well separated groups, some overlap between different clusters occurs. This suggests that classifications bases only on structural data cannot allow a systematic classification of proteins. Additional information are in particular needed in order to monitor completely the complex evolutionary relationships between proteins.

During the last few years, the common paradigm that protein folds tend to be mutually exclusive and to cluster into well separated groups started to be criticized. The expression “gregariousness” was used to indicate the number of close neighbors of each fold and such a concept was applied to examine weather the fold space is a continuum, where existing motifs are used to enlarge old folds and create new types of structures [1]. It was observed that high levels of gregariousness are observed when different folds contain the same motif [1]. The protein structure space was also analyzed by Kim and co-workers, by using several representative data sets and several computational approaches [2,3]. It was observed that protein structures can be discriminated essentially by three features: the prevalence of residues that adopt a α or a β secondary structure and the presence of α-β-α motifs (two flanked parallel β strands separated by an antiparallel α helix) [2, 3]. However, significant overlaps were observed amongst different structural classes. Folds in the overlap regions contain features of both classes, and this was considered the main reason why structure-based function predictions are not very efficient [2]. A detailed review on the nature of the protein fold space showed both advantages and disadvantages of considering it a continuous and multidimensional object rather that an ensemble of discrete categories [4].

In the present paper, a robust statistical approach is adapted to the problem of the estimation of the degree of clustering within the fold space. The term “clustering tendency” refers to the problem of deciding weather the subjects have an intrinsic predisposition to cluster into distinct groups or they are randomly arranged. This is also referred to as the spatial randomness problem and while intrinsically aggregated subjects are characterized by mutual attraction, randomly arranged subjects show mutual repulsion [5]. The clustering tendency was estimated with the Cox-Lewis coefficient [5,6] on different datasets and it was observed that protein folds are partially overlapping.

Four types of data sets were selected. Care was taken to avoid redundancies between the data. In fact, a purely random selection of a set of protein structures might produce results completely biased. The clustering tendency measured in a data set that contains several proteins nearly identical to each other would be considerably overestimated since many experimental points would be extremely close to each other.

A representative example of each protein fold was taken from the Scop database of protein domain structures [7]. Only the four most populated classes (α, β, α/ β, and α+ β) were considered. Entries containing “unobserved” residues were disregarded. 624 files were retained.

10 subsets, each containing 62 structures, of the
*Scop_fold* data set were randomly built (X = 1, 10). They do not
overlap with each other.

Protein chains were taken from the Pisces database [8]. Their crystal structures were determined at resolution not worse than 2 Å and the maximal sequence identity between two of tem is 25%. Entries containing residues “unobserved” in the electron density maps were removed. 2237 structures were retained.

10 non-overlapping subsets, each containing 223
structures, of the *Pisces* data set were randomly built (X = 1, 10).

A very wide variety of techniques were used to compare pairs of protein structures[9–11]. In the present manuscript, we used a technique that allows one to represent a structure with a geometrical point in a n-dimensional space and to select a random (geometrical) point in the fold space (this is necessary to evaluate to clustering tendency; see below). This task cannot be accomplished, in general, by using protein structure comparison techniques, since the similarity scores are nearly never metrics, in the mathematical sense. An exception is the program GI, in which the protein topology is described always by 30 numbers ‐ its number of residues and its Gauss integrals - independently of the protein dimension [12]. Therefore, each protein, either large or small, is associated with a point in a space defined by 30 variables. The distance between two protein structures can be measured with the Euclidean distance between two points in this space. A further advantage of this method of protein structure comparison is its extreme velocity (thousands of comparisons can be made in few minutes).

The Cox-Lewis coefficient is defined in the following way [5,6].
Given *m* proteins, each characterized by 30 variables, *k* *m*
geometrical points are randomly selected in the 30-dimensional space.
The smallest distance *u _{i}* between the i

Computations were performed as schematized in Figure 2. The output
file of program **GI**, *f_gi.out*, contains the 30 variables necessary to
describe each protein. If there are *np* proteins, *f_gi.out* is a table of *np*
lines and 30 columns. The program **p_quartili** determines, for each of
the 30 variables, the first and the last quartile, outputted in the file
*f_quartili.out*. Together with the file *f_random.seed*, which contains a
randomly selected integer number, the file *f_quartili.out* is read by the
program **p_random** that generates *k* random numbers ranging from
the first to the last quartile of each of the 30 variables that represent a
protein structure. The value of *k* was defined as *np*/10, where *np* is the
number of proteins described in the file *f_gi.out.* It is essential to select random numbers within the two quartiles in order to avoid insidious
problems at the boundaries of the protein fold space. Eventually, the
files *f_gi.out* and *f_random.out* are read by the program **p_cox_lewis**,
which computes the Cox-Lewis coefficient of clustering tendency. 500
different values of the seed (contained in the file *f_random.seed*) were
randomly generated and 500 values of the Cox-Lewis coefficient were
computed for each data set and averaged.

Table 1 (see supplementary material) shows the minimal, maximal, and average values (and the standard deviation of the mean) of the Cox-Lewis coefficient computed on 22 different data sets. The average values oscillate amongst different data sets. However, they are always significantly larger than one, the value that would indicate that protein fold structures are uniformly distributed. This first conclusion is therefore that protein structures tend to cluster into separate groups. Noteworthy, similar results were obtained also by using the Hopkins coefficient, another measure of clustering tendency.

A second observation is interesting. The Cox-Lewis coefficients were computed on two types of data sets. Eleven of them (Scop and Scop/X) were based on the Scop database of protein domain structures, where the redundancy was reduced essentially on the basis of structural features [7]. In the other eleven of them (Pisces and Pisces/X), the redundancy was reduced only on he basis of the amino acid sequences [8]. Despite these approaches are different (structureand sequences-based redundancy reduction), the Cox-Lewis coefficients are nearly the same. The fact that smaller values tend to be observed for the Pisces data sets is likely dependent on the fact that these data sets contain entire protein chains, sometime made by more than a single domain and sometime participating to permanent oligomeric assemblies. Consequently, and this is the second conclusion, coefficients close to 1.3-1.4 are likely to be rather reliable estimations of the clustering tendency of protein structures.

Are these values really high? To answer this question, one would need to “see” the 30-dimensional fold space defined by the GI approach [12]. This is impossible. Human perception is limited to two or three dimensions. In principle, reductions of dimensionality are possible, for example by using principal component analysis [13]. However, this is impossible in our case, since the first three principal components, out of the original 30 variables, accounts for an insufficient fraction of the original, overall variance (less than 70%). The visualization in a reduced 3-dimensional space would be useless, since about one third of the overall variability amongst structures would be ignored.

The only possibility to get a visual assessment is thus based on
simulations, the simplest of which are in two dimensions. Four sets of
bi-dimensional data, each containing 1000 points, were generated by
using a pseudo random number generator. One contained 1000 points
within distance = *R* from the point (-1, 0) (arbitrary units, a.u.); the
second was centred on the point (1, 0); the third on the point (0, -1);
and the fourth on the point (0, 1). By increasing *R*, it is possible to
reduce arbitrarily the clustering tendency of the entire data set of 4,000
points. It is thus possible to see, with a simple plot, how the clustering
tendency changes and to relate is with the Cox-Lewis coefficient. This
is shown in Figure 3. For small *R* values, the data are clearly
segregated into four clusters and the Cox-Lewis coefficient is high
(6.93). For large values of *R*, the four clusters are completely
superposed to each other and the Cox-Lewis coefficient decreases to
1.04 and approaches the value of one, expected for data uniformly
distributed. If the Cox-Lewis coefficient is close to 1.3-1.5 (values
close to those that are shown in Table 1 ‐ supplementary material),
the four clusters are partially superposed. About 6-7% of the members
of a cluster tend to invade a neighbour group.

This implies that protein fold structures have a natural tendency to aggregate into different groups and that it is, as a consequence, infrequent that a structure of a certain type is observed into a cluster that groups other structure types. However, some overlap between different clusters is possible and seldom observed, with the consequence that false positives or negatives cannot be completely avoided in the various structure-based prediction methods that were designed. As expected and as already observed [2, 3], the superposition between different types of protein 3D structure clusters tends to occur for the cases that are known to be relatively similar. Typically, this mix up is observed for structures that are essentially α or β, on the one hand, and α+β, on the other, according to the SCOP classification. Obviously, this does not mean that the classification adopted in the Scop database is useless or inappropriate. This only means that a description of the fold space based only on structural features cannot produce well isolated islands.

This work was partially funded by the BIN-II and BIN-III programs of the Austrian GEN-AU.

**Citation:****Carugo** *et al*, Bioinformation 4(8): 347-351 (2010)

1. Harrison A, et al. J Mol Biol. 2002;323:909. [PubMed]

2. Hou J, et al. Proc Natl Acad Sci U S A. 2005;102:3651. [PubMed]

3. Hou J, et al. Proc Natl Acad Sci U S A. 2003;100:2386. [PubMed]

4. Kolodny P, et al. Curr Opin Struct Biol. 2006;16:393. [PubMed]

5. Theodoridis S, Koutroumbas K. Pattern Recognition. Second edn. San Diego, U.S.A: Academic Press; 2003.

6. Jain AK, Bubes RC. Algorithms for Clustering Data. Englewoods Cliffs. New Jersey, U.S.A: Prentice Hall; 1988.

7. Murzin AG, et al. J Mol Biol. 1995;247:536. [PubMed]

8. Wang G, Dunbrack RLJ. Bioinformatics. 2003;19:1589. [PubMed]

9. Carugo O. Curr Protein Pept Sci. 2007;8:219. [PubMed]

10. Carugo O. Curr Bioinformatics. 2006;1:75.

11. Carugo O, Pongor S. Curr Protein Pept Sci. 2002;3:441. [PubMed]

12. Rogen P, Fain B. Proc Natl Acad Sci USA. 2003;100:119. [PubMed]

13. Carugo O. Acta Crystallogr. 1995;B51:314.

Articles from Bioinformation are provided here courtesy of **Biomedical Informatics Publishing Group**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |