HIV-1 protease plays an important role in the late stage of viral replication by cleavage of premature viral polypeptides to peptides that fold into mature virus proteins. The ability of HIV-1 protease to rapidly acquire a variety of mutants in response to various protease inhibitors (PI) confers the enzyme with high resistance to anti-AIDS treatments. A high cooperativity has been documented among drug-resistant mutations observed in HIV-1 protease (Ohtaka et al.
). The sequence data retrieved from treated patients is likely to include mutations that reflect cooperative effects originating from late functional constraints, rather than stochastic evolutionary noise (Atchley et al.
). Extensive studies have been made on this protein structure and dynamics (Cecconi et al.
; Hornak et al.
; Perryman et al.
; Zoete et al.
) although the molecular mechanisms of multi-drug resistance (MDR) is yet to be elucidated.
HIV-1 protease is particularly suitable for covariance analysis because of the large sets of sequences available, and the observed fast rate of mutations in response to treatments. Sequence covariance analysis is a method widely used for identifying correlated sites in proteins. Such correlations are usually inferred from the statistical analysis of pairwise amino-acid substitutions among the members of the examined family of proteins. Because correlated substitutions are expected to occur between residue pairs directly interacting in the 3-dimensional (3D) structure, sequence covariance analysis, also referred to as correlated mutation analysis (CMA), has long been used for detecting inter-residue contacts within proteins (Eyal et al.
; Gobel et al.
; Olmea et al.
; Shindyalov et al.
; Thomas et al.
). More recently, the same approach proved useful in identifying communication pathways in allosteric proteins (Hatley et al.
; Kass and Horovitz, 2002
; Lockless and Ranganathan, 1999
; Shulman et al.
; Süel et al.
), and in studying drug-induced mutations using clinical data (Hoffman et al.
; Wu et al.
The CMA procedure consists of three steps, in general: (i) generation of multiple sequence alignment (MSA) using homologous protein sequences; (ii) quantifying the covariance between different columns in MSA and (iii) identifying groups of highly covariant positions, also called clustering. The underlying assumption is that co-varying residues reflect essential structural/functional inter-residue couplings.
These techniques have some major limitations. The purpose of the method is to identify inter-residue couplings that are directly relevant to protein structure or function. However, the observed signals may not solely arise from such couplings. In fact sequence data are known to be noisy. A strong covariance may be detected among columns due to evolutionary signals that originate from early random mutation events. Noivirt et al.
) have shown that the signal due to inter-residue interactions is comparable in magnitude to the noise caused by other stochastic evolutionary events.
Several metrics have been used to quantify sequence covariance in proteins. A comparative analysis of some commonly used methods can be found in the studies of Fodor and Aldrich (2004
) and Halperin et al.
). Yet, not enough attention has been given to date, to the clustering step. This step is important due to various reasons. First, although the CMA is performed in a pairwise manner (mainly due to technical and statistical reasons), it is clear that in nature larger sets of residues are expected to co-evolve to meet particular structural/functional requirements. Second, the clustering procedure is expected to help in distinguishing the real correlations from the background noise. The choice of clustering technique may also depend on the adopted CMA. When an asymmetric metric like the statistical coupling analysis (SCA) introduced by Ranganathan and coworkers (Lockless and Ranganathan, 1999
) is used in step 2, a hierarchical clustering is conveniently applied (Chen et al.
; Hatley et al.
; Shulman et al.
; Süel et al.
). For symmetric metrics such as Pearson correlation coefficient and MI, on the other hand, a common procedure is to perform a principal component analysis (Wold et al.
; Fleishman et al.
We adopt the MI content as a measure of the correlation between residue substitutions (Atchley et al.
; Clarke, 1995
; Hoffman et al.
; Martin et al.
). Accordingly, each of the N
columns in the MSA generated for a protein of N
residues is considered as a discrete random variable Xi
(1 ≤ i
) that takes on one of the 20 amino-acid types with some probability. The MI (Cover and Thomas, 1991
) associated with the random variables Xi
corresponding to the i
th and j
th columns is defined as
Here P(Xi = xi, Xj = xj) is the joint probability of occurrence of amino-acid types xi and xj at the ith and jth positions, respectively, P(Xi = xi) and P(Xj = xj) are the corresponding singlet probabilities. I(Xi, Xj) is the ijth element of the N × N MI matrix I corresponding to the examined MSA.
In the present study, we introduce the use of spectral partitioning methods for efficient analysis of the MI matrices derived for HIV-1 protease sequences retrieved from the Stanford HIV Drug Resistance database (DB) (http://hivdb.stanford.edu
; Rhee et al.
, 2003) (). This DB includes sequences obtained from isolates along with information on the type of PIs given to the patients (accessible via the ‘Detailed Treatment Queries’ interface of the DB). The goal is to examine sequence co-variance and distinguish between correlations of different origin. Spectral clustering was originally proposed for partitioning the nodes in an undirected weighted graph G
). The weight wij
of each edge eij
is defined as a measure of similarity between nodes vi
. This weight matrix W
is replaced in our work by the MI matrix. Our objective will be to partition all the nodes/residues into groups, such that the similarity is high among the nodes within a group and low across different groups. This goal will be achieved by minimizing the normalized cut (Shi and Malik, 2000
) between groups (see Materials and Methods).
We show that the method successfully identifies the residues cooperatively involved in MDR, as well as the mutational patterns arising from different drug treatments. The results suggest that spectral partitioning of the covariance data can help in detecting cooperative functional relations and discriminating to a certain degree between the covariance patterns originating from functional constraints and those associated with neutral/stochastic mutation events that occur early in the evolution of the species/family.