Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  CLAP: A web-server for automatic classification of proteins with special reference to multi-domain proteins 
BMC Bioinformatics  2014;15(1):343.
The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better.
Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions.
Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family.
CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at
PMCID: PMC4287353  PMID: 25282152
Alignment-free comparison; Domain architectures; Multi-domain proteins; Protein classification
2.  Arbitrary protein−protein docking targets biologically relevant interfaces 
BMC Biophysics  2012;5:7.
Protein-protein recognition is of fundamental importance in the vast majority of biological processes. However, it has already been demonstrated that it is very hard to distinguish true complexes from false complexes in so-called cross-docking experiments, where binary protein complexes are separated and the isolated proteins are all docked against each other and scored. Does this result, at least in part, reflect a physical reality? False complexes could reflect possible nonspecific or weak associations.
In this paper, we investigate the twilight zone of protein-protein interactions, building on an interesting outcome of cross-docking experiments: false complexes seem to favor residues from the true interaction site, suggesting that randomly chosen partners dock in a non-random fashion on protein surfaces. Here, we carry out arbitrary docking of a non-redundant data set of 198 proteins, with more than 300 randomly chosen "probe" proteins. We investigate the tendency of arbitrary partners to aggregate at localized regions of the protein surfaces, the shape and compositional bias of the generated interfaces, and the potential of this property to predict biologically relevant binding sites. We show that the non-random localization of arbitrary partners after protein-protein docking is a generic feature of protein structures. The interfaces generated in this way are not systematically planar or curved, but tend to be closer than average to the center of the proteins. These results can be used to predict biological interfaces with an AUC value up to 0.69 alone, and 0.72 when used in combination with evolutionary information. An appropriate choice of random partners and number of docking models make this method computationally practical. It is also noted that nonspecific interfaces can point to alternate interaction sites in the case of proteins with multiple interfaces. We illustrate the usefulness of arbitrary docking using PEBP (Phosphatidylethanolamine binding protein), a kinase inhibitor with multiple partners.
An approach using arbitrary docking, and based solely on physical properties, can successfully identify biologically pertinent protein interfaces.
PMCID: PMC3441232  PMID: 22559010
Protein structure; Protein-protein interaction; Docking; Interface prediction
3.  Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs 
BMC Bioinformatics  2011;12:247.
One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function.
Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM.
Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.
PMCID: PMC3158783  PMID: 21689388
4.  Classification of Protein Kinases on the Basis of Both Kinase and Non-Kinase Regions 
PLoS ONE  2010;5(9):e12460.
Protein phosphorylation is a generic way to regulate signal transduction pathways in all kingdoms of life. In many organisms, it is achieved by the large family of Ser/Thr/Tyr protein kinases which are traditionally classified into groups and subfamilies on the basis of the amino acid sequence of their catalytic domains. Many protein kinases are multi-domain in nature but the diversity of the accessory domains and their organization are usually not taken into account while classifying kinases into groups or subfamilies.
Here, we present an approach which considers amino acid sequences of complete gene products, in order to suggest refinements in sets of pre-classified sequences. The strategy is based on alignment-free similarity scores and iterative Area Under the Curve (AUC) computation. Similarity scores are computed by detecting common patterns between two sequences and scoring them using a substitution matrix, with a consistent normalization scheme. This allows us to handle full-length sequences, and implicitly takes into account domain diversity and domain shuffling. We quantitatively validate our approach on a subset of 212 human protein kinases. We then employ it on the complete repertoire of human protein kinases and suggest few qualitative refinements in the subfamily assignment stored in the KinG database, which is based on catalytic domains only. Based on our new measure, we delineate 37 cases of potential hybrid kinases: sequences for which classical classification based entirely on catalytic domains is inconsistent with the full-length similarity scores computed here, which implicitly consider multi-domain nature and regions outside the catalytic kinase domain. We also provide some examples of hybrid kinases of the protozoan parasite Entamoeba histolytica.
The implicit consideration of multi-domain architectures is a valuable inclusion to complement other classification schemes. The proposed algorithm may also be employed to classify other families of enzymes with multi-domain architecture.
PMCID: PMC2939887  PMID: 20856812
5.  Beauty Is in the Eye of the Beholder: Proteins Can Recognize Binding Sites of Homologous Proteins in More than One Way 
PLoS Computational Biology  2010;6(6):e1000821.
Understanding the mechanisms of protein–protein interaction is a fundamental problem with many practical applications. The fact that different proteins can bind similar partners suggests that convergently evolved binding interfaces are reused in different complexes. A set of protein complexes composed of non-homologous domains interacting with homologous partners at equivalent binding sites was collected in 2006, offering an opportunity to investigate this point. We considered 433 pairs of protein–protein complexes from the ABAC database (AB and AC binary protein complexes sharing a homologous partner A) and analyzed the extent of physico-chemical similarity at the atomic and residue level at the protein–protein interface. Homologous partners of the complexes were superimposed using Multiprot, and similar atoms at the interface were quantified using a five class grouping scheme and a distance cut-off. We found that the number of interfacial atoms with similar properties is systematically lower in the non-homologous proteins than in the homologous ones. We assessed the significance of the similarity by bootstrapping the atomic properties at the interfaces. We found that the similarity of binding sites is very significant between homologous proteins, as expected, but generally insignificant between the non-homologous proteins that bind to homologous partners. Furthermore, evolutionarily conserved residues are not colocalized within the binding sites of non-homologous proteins. We could only identify a limited number of cases of structural mimicry at the interface, suggesting that this property is less generic than previously thought. Our results support the hypothesis that different proteins can interact with similar partners using alternate strategies, but do not support convergent evolution.
Author Summary
Interaction between proteins is a fundamental process, generic to most biological pathways. The increasing number of protein–protein complexes with atomic data should help us to understand the major factors that guide protein interactions. In particular, a number of examples are available of similar proteins that interact with proteins that are very different in terms of structure and function. An intuitive hypothesis to explain the ability of these different proteins to recognize the same partner is that they display the same local region for interaction, in other words, they imitate the same binding site. Here, we quantify the similarity between these putatively mimicking binding sites. We show that it is not statistically significant. We confirm this observation on the small sets of evolutionarily conserved residues. Our results suggest that different proteins that bind the same protein do not imitate binding sites, but probably target specific locations or residues at the binding site.
PMCID: PMC2887470  PMID: 20585553
6.  Mining protein loops using a structural alphabet and statistical exceptionality 
BMC Bioinformatics  2010;11:75.
Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.
We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.
We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at
PMCID: PMC2833150  PMID: 20132552
7.  Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data 
In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.
The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.
Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
PMCID: PMC2828453  PMID: 20205909
8.  Comparative kinomics of human and chimpanzee reveal unique kinship and functional diversity generated by new domain combinations 
BMC Genomics  2008;9:625.
Phosphorylation by protein kinases is a common event in many cellular processes. Further, many kinases perform specialized roles and are regulated by non-kinase domains tethered to kinase domain. Perturbation in the regulation of kinases leads to malignancy. We have identified and analysed putative protein kinases encoded in the genome of chimpanzee which is a close evolutionary relative of human.
The shared core biology between chimpanzee and human is characterized by many orthologous protein kinases which are involved in conserved pathways. Domain architectures specific to chimp/human kinases have been observed. Chimp kinases with unique domain architectures are characterized by deletion of one or more non-kinase domains in the human kinases. Interestingly, counterparts of some of the multi-domain human kinases in chimp are characterized by identical domain architectures but with kinase-like non-kinase domain. Remarkably, out of 587 chimpanzee kinases no human orthologue with greater than 95% sequence identity could be identified for 160 kinases. Variations in chimpanzee kinases compared to human kinases are brought about also by differences in functions of domains tethered to the catalytic kinase domain. For example, the heterodimer forming PB1 domain related to the fold of ubiquitin/Ras-binding domain is seen uniquely tethered to PKC-like chimpanzee kinase.
Though the chimpanzee and human are evolutionary very close, there are chimpanzee kinases with no close counterpart in the human suggesting differences in their functions. This analysis provides a direction for experimental analysis of human and chimpanzee protein kinases in order to enhance our understanding on their specific biological roles.
PMCID: PMC2651890  PMID: 19105813
9.  Hint2, A Mitochondrial Apoptotic Sensitizer Down-Regulated in Hepatocellular Carcinoma 
Gastroenterology  2006;130(7):2179-2188.
Background & Aims:
Hints, Histidine triad nucleotide-binding proteins, are adenosine monophosphate–lysine hydrolases of uncertain biological function. Here we report the characterization of human Hint2.
Tissue distribution was determined by real-time quantitative polymerase chain reaction and immunoblotting, cellular localization by immunocytochemistry, and transfection with green fluorescent protein constructs. Enzymatic activities for protein kinase C and adenosine phosphoramidase in the presence of Hint2 were measured. HepG2 cell lines with Hint2 over expressed or knocked down were established. Apoptosis was assessed by immunoblotting for caspases and by flowcytometry. Tumor growth was measured in SCID mice. Expression in human tumors was investigated by microarrays.
Hint2 was predominantly expressed in liver and pancreas. Hint2 was localized in mitochondria. Hint2 hydrolyzed adenosine monophosphate linked to an amino group (AMP-pNA; kcat:0.0223 s-1;Km:128 μmol/L). Exposed to apoptotic stress, fewer HepG2 cells overexpressing Hint2 remained viable (32.2 ± 0 6% vs 57.7 ± 4.6%), and more cells displayed changes of the mitochondrial membrane potential (87.8 ± 2.35 vs 49.7 ± 1.6%) with more cleaved caspases than control cells. The opposite was observed in HepG2 cells with knock-down expression of Hint2. Subcutaneous injection of HepG2 cells over expressing Hint2 in SCID mice resulted in smaller tumors (0.32 ± 0.13 g vs 0.85 ± 0.35 g). Microarray analyses revealed that HINT2 messenger RNA is down regulated in hepatocellular carcinomas (−0.42 ± 0.58 log2 vs −0.11 ± 0.28 log2). Low abundance of HINT2 messenger RNA was associated with poor survival.
Hint2 defines a novel class of mitochondrial apoptotic sensitizers down-regulated in hepatocellular carcinoma.
PMCID: PMC2569837  PMID: 16762638
10.  Tumor suppressor and hepatocellular carcinoma 
A few signaling pathways are driving the growth of hepatocellular carcinoma. Each of these pathways possesses negative regulators. These enzymes, which normally suppress unchecked cell proliferation, are circumvented in the oncogenic process, either the over-activity of oncogenes is sufficient to annihilate the activity of tumor suppressors or tumor suppressors have been rendered ineffective. The loss of several key tumor suppressors has been described in hepatocellular carcinoma. Here, we systematically review the evidence implicating tumor suppressors in the development of hepatocellular carcinoma.
PMCID: PMC2695912  PMID: 18350603
Tumor suppressor; Hepatocellular carcinoma; Deregulation; Liver; Carcinogenesis
11.  Structural deformation upon protein-protein interaction: A structural alphabet approach 
In a number of protein-protein complexes, the 3D structures of bound and unbound partners significantly differ, supporting the induced fit hypothesis for protein-protein binding.
In this study, we explore the induced fit modifications on a set of 124 proteins available in both bound and unbound forms, in terms of local structure. The local structure is described thanks to a structural alphabet of 27 structural letters that allows a detailed description of the backbone. Using a control set to distinguish induced fit from experimental error and natural protein flexibility, we show that the fraction of structural letters modified upon binding is significantly greater than in the control set (36% versus 28%). This proportion is even greater in the interface regions (41%). Interface regions preferentially involve coils. Our analysis further reveals that some structural letters in coil are not favored in the interface. We show that certain structural letters in coil are particularly subject to modifications at the interface, and that the severity of structural change also varies. These information are used to derive a structural letter substitution matrix that summarizes the local structural changes observed in our data set. We also illustrate the usefulness of our approach to identify common binding motifs in unrelated proteins.
Our study provides qualitative information about induced fit. These results could be of help for flexible docking.
PMCID: PMC2315654  PMID: 18307769
12.  Analysis of an optimal hidden Markov model for secondary structure prediction 
Secondary structure prediction is a useful first step toward 3D structure prediction. A number of successful secondary structure prediction methods use neural networks, but unfortunately, neural networks are not intuitively interpretable. On the contrary, hidden Markov models are graphical interpretable models. Moreover, they have been successfully used in many bioinformatic applications. Because they offer a strong statistical background and allow model interpretation, we propose a method based on hidden Markov models.
Our HMM is designed without prior knowledge. It is chosen within a collection of models of increasing size, using statistical and accuracy criteria. The resulting model has 36 hidden states: 15 that model α-helices, 12 that model coil and 9 that model β-strands. Connections between hidden states and state emission probabilities reflect the organization of protein structures into secondary structure segments. We start by analyzing the model features and see how it offers a new vision of local structures. We then use it for secondary structure prediction. Our model appears to be very efficient on single sequences, with a Q3 score of 68.8%, more than one point above PSIPRED prediction on single sequences. A straightforward extension of the method allows the use of multiple sequence alignments, rising the Q3 score to 75.5%.
The hidden Markov model presented here achieves valuable prediction results using only a limited number of parameters. It provides an interpretable framework for protein secondary structure architecture. Furthermore, it can be used as a tool for generating protein sequences with a given secondary structure content.
PMCID: PMC1769381  PMID: 17166267
13.  Protein secondary structure assignment revisited: a detailed analysis of different assignment methods 
A number of methods are now available to perform automatic assignment of periodic secondary structures from atomic coordinates, based on different characteristics of the secondary structures. In general these methods exhibit a broad consensus as to the location of most helix and strand core segments in protein structures. However the termini of the segments are often ill-defined and it is difficult to decide unambiguously which residues at the edge of the segments have to be included. In addition, there is a "twilight zone" where secondary structure segments depart significantly from the idealized models of Pauling and Corey. For these segments, one has to decide whether the observed structural variations are merely distorsions or whether they constitute a break in the secondary structure.
To address these problems, we have developed a method for secondary structure assignment, called KAKSI. Assignments made by KAKSI are compared with assignments given by DSSP, STRIDE, XTLSSTR, PSEA and SECSTR, as well as secondary structures found in PDB files, on 4 datasets (X-ray structures with different resolution range, NMR structures).
A detailed comparison of KAKSI assignments with those of STRIDE and PSEA reveals that KAKSI assigns slightly longer helices and strands than STRIDE in case of one-to-one correspondence between the segments. However, KAKSI tends also to favor the assignment of several short helices when STRIDE and PSEA assign longer, kinked, helices. Helices assigned by KAKSI have geometrical characteristics close to those described in the PDB. They are more linear than helices assigned by other methods. The same tendency to split long segments is observed for strands, although less systematically. We present a number of cases of secondary structure assignments that illustrate this behavior.
Our method provides valuable assignments which favor the regularity of secondary structure segments.
PMCID: PMC1249586  PMID: 16164759
14.  A first genotyping assay of French cattle breeds based on a new allele of the extension gene encoding the melanocortin-1 receptor (Mc1r) 
The seven transmembrane domain melanocortin-1 receptor (Mc1r) encoded by the coat color extension gene (E) plays a key role in the signaling pathway of melanin synthesis. Upon the binding of agonist (melanocortin hormone, α-MSH) or antagonist (Agouti protein) ligands, the melanosomal synthesis of eumelanin and/or phaeomelanin pigments is stimulated or inhibited, respectively. Different alleles of the extension gene were cloned from unrelated animals belonging to French cattle breeds and sequenced. The wild type E allele was mainly present in Normande cattle, the dominant ED allele in animals with black color (i.e. Holstein), whereas the recessive e allele was identified in homozygous animals exhibiting a more or less strong red coat color (Blonde d'Aquitaine, Charolaise, Limousine and Salers). A new allele, named E1, was found in either homozygous (E1/E1) or heterozygous (E1/E) individuals in Aubrac and Gasconne breeds. This allele displayed a 4 amino acid duplication (12 nucleotides) located within the third cytoplasmic loop of the receptor, a region known to interact with G proteins. A first genotyping assay of the main French cattle breeds is described based on these four extension alleles.
PMCID: PMC2706875  PMID: 14736379
cattle; genotyping; coat color; extension; polymorphism

Results 1-14 (14)