Search tips
Search criteria

Results 1-25 (33)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Evaluation of Protein Dihedral Angle Prediction Methods 
PLoS ONE  2014;9(8):e105667.
Tertiary structure prediction of a protein from its amino acid sequence is one of the major challenges in the field of bioinformatics. Hierarchical approach is one of the persuasive techniques used for predicting protein tertiary structure, especially in the absence of homologous protein structures. In hierarchical approach, intermediate states are predicted like secondary structure, dihedral angles, Cα-Cα distance bounds, etc. These intermediate states are used to restraint the protein backbone and assist its correct folding. In the recent years, several methods have been developed for predicting dihedral angles of a protein, but it is difficult to conclude which method is better than others. In this study, we benchmarked the performance of dihedral prediction methods ANGLOR and SPINE X on various datasets, including independent datasets. TANGLE dihedral prediction method was not benchmarked (due to unavailability of its standalone) and was compared with SPINE X and ANGLOR on only ANGLOR dataset on which TANGLE has reported its results. It was observed that SPINE X performed better than ANGLOR and TANGLE, especially in case of prediction of dihedral angles of glycine and proline residues. The analysis suggested that angle shifting was the foremost reason of better performance of SPINE X. We further evaluated the performance of the methods on independent ccPDB30 dataset and observed that SPINE X performed better than ANGLOR.
PMCID: PMC4148315  PMID: 25166857
2.  In Silico Approach for Predicting Toxicity of Peptides and Proteins 
PLoS ONE  2013;8(9):e73957.
Over the past few decades, scientific research has been focused on developing peptide/protein-based therapies to treat various diseases. With the several advantages over small molecules, including high specificity, high penetration, ease of manufacturing, peptides have emerged as promising therapeutic molecules against many diseases. However, one of the bottlenecks in peptide/protein-based therapy is their toxicity. Therefore, in the present study, we developed in silico models for predicting toxicity of peptides and proteins.
We obtained toxic peptides having 35 or fewer residues from various databases for developing prediction models. Non-toxic or random peptides were obtained from SwissProt and TrEMBL. It was observed that certain residues like Cys, His, Asn, and Pro are abundant as well as preferred at various positions in toxic peptides. We developed models based on machine learning technique and quantitative matrix using various properties of peptides for predicting toxicity of peptides. The performance of dipeptide-based model in terms of accuracy was 94.50% with MCC 0.88. In addition, various motifs were extracted from the toxic peptides and this information was combined with dipeptide-based model for developing a hybrid model. In order to evaluate the over-optimization of the best model based on dipeptide composition, we evaluated its performance on independent datasets and achieved accuracy around 90%. Based on above study, a web server, ToxinPred has been developed, which would be helpful in predicting (i) toxicity or non-toxicity of peptides, (ii) minimum mutations in peptides for increasing or decreasing their toxicity, and (iii) toxic regions in proteins.
ToxinPred is a unique in silico method of its kind, which will be useful in predicting toxicity of peptides/proteins. In addition, it will be useful in designing least toxic peptides and discovering toxic regions in proteins. We hope that the development of ToxinPred will provide momentum to peptide/protein-based drug discovery (
PMCID: PMC3772798  PMID: 24058508
3.  An in silico platform for predicting, screening and designing of antihypertensive peptides 
Scientific Reports  2015;5:12512.
High blood pressure or hypertension is an affliction that threatens millions of lives worldwide. Peptides from natural origin have been shown recently to be highly effective in lowering blood pressure. In the present study, we have framed a platform for predicting and designing novel antihypertensive peptides. Due to a large variation found in the length of antihypertensive peptides, we divided these peptides into four categories (i) Tiny peptides, (ii) small peptides, (iii) medium peptides and (iv) large peptides. First, we developed SVM based regression models for tiny peptides using chemical descriptors and achieved maximum correlation of 0.701 and 0.543 for dipeptides and tripeptides, respectively. Second, classification models were developed for small peptides and achieved maximum accuracy of 76.67%, 72.04% and 77.39% for tetrapeptide, pentapeptide and hexapeptides, respectively. Third, we have developed a model for medium peptides using amino acid composition and achieved maximum accuracy of 82.61%. Finally, we have developed a model for large peptides using amino acid composition and achieved maximum accuracy of 84.21%. Based on the above study, a web-based platform has been developed for locating antihypertensive peptides in a protein, screening of peptides and designing of antihypertensive peptides.
PMCID: PMC4515604  PMID: 26213115
4.  VaccineDA: Prediction, design and genome-wide screening of oligodeoxynucleotide-based vaccine adjuvants 
Scientific Reports  2015;5:12478.
Immunomodulatory oligodeoxynucleotides (IMODNs) are the short DNA sequences that activate the innate immune system via toll-like receptor 9. These sequences predominantly contain unmethylated CpG motifs. In this work, we describe VaccineDA (Vaccine DNA adjuvants), a web-based resource developed to design IMODN-based vaccine adjuvants. We collected and analyzed 2193 experimentally validated IMODNs obtained from the literature. Certain types of nucleotides (e.g., T, GT, TC, TT, CGT, TCG, TTT) are dominant in IMODNs. Based on these observations, we developed support vector machine-based models to predict IMODNs using various compositions. The developed models achieved the maximum Matthews Correlation Coefficient (MCC) of 0.75 with an accuracy of 87.57% using the pentanucleotide composition. The integration of motif information further improved the performance of our model from the MCC of 0.75 to 0.77. Similarly, models were developed to predict palindromic IMODNs and attained a maximum MCC of 0.84 with the accuracy of 91.94%. These models were evaluated using a five-fold cross-validation technique as well as validated on an independent dataset. The models developed in this study were integrated into VaccineDA to provide a wide range of services that facilitate the design of DNA-based vaccine adjuvants (
PMCID: PMC4515643  PMID: 26212482
5.  QSAR based model for discriminating EGFR inhibitors and non-inhibitors using Random forest 
Biology Direct  2015;10:10.
Epidermal Growth Factor Receptor (EGFR) is a well-characterized cancer drug target. In the past, several QSAR models have been developed for predicting inhibition activity of molecules against EGFR. These models are useful to a limited set of molecules for a particular class like quinazoline-derivatives. In this study, an attempt has been made to develop prediction models on a large set of molecules (~3500 molecules) that include diverse scaffolds like quinazoline, pyrimidine, quinoline and indole.
We train, test and validate our classification models on a dataset called EGFR10 that contains 508 inhibitors (having inhibition activity IC50 less than 10 nM) and 2997 non-inhibitors. Our Random forest based model achieved maximum MCC 0.49 with accuracy 83.7% on a validation set using 881 PubChem fingerprints. In this study, frequency-based feature selection technique has been used to identify best fingerprints. It was observed that PubChem fingerprints FP380 (C(~O) (~O)), FP579 (O = C-C-C-C), FP388 (C(:C) (:N) (:N)) and FP 816 (ClC1CC(Br)CCC1) are more frequent in the inhibitors in comparison to non-inhibitors. In addition, we created different datasets namely EGFR100 containing inhibitors having IC50 < 100 nM and EGFR1000 containing inhibitors having IC50 < 1000 nM. We trained, test and validate our models on datasets EGFR100 and EGFR1000 datasets and achieved and maximum MCC 0.58 and 0.71 respectively. In addition, models were developed for predicting quinazoline and pyrimidine based EGFR inhibitors.
In summary, models have been developed on a large set of molecules of various classes for discriminating EGFR inhibitors and non-inhibitors. These highly accurate prediction models can be used to design and discover novel EGFR inhibitors. In order to provide service to the scientific community, a web server/standalone EGFRpred also has been developed (
This article was reviewed by Dr Murphy, Prof Wang and Dr. Eisenhaber.
Electronic supplementary material
The online version of this article (doi:10.1186/s13062-015-0046-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4372225  PMID: 25880749
EGFR inhibitors; Classification of EGFR inhibitors and non-inhibitors; Active substructure; Active functional groups; PubChem fingerprint; QSAR; Random forest
6.  AHTPDB: a comprehensive platform for analysis and presentation of antihypertensive peptides 
Nucleic Acids Research  2014;43(Database issue):D956-D962.
AHTPDB ( is a manually curated database of experimentally validated antihypertensive peptides. Information pertaining to peptides with antihypertensive activity was collected from research articles and from various peptide repositories. These peptides were derived from 35 major sources that include milk, egg, fish, pork, chicken, soybean, etc. In AHTPDB, most of the peptides belong to a family of angiotensin-I converting enzyme inhibiting peptides. The current release of AHTPDB contains 5978 peptide entries among which 1694 are unique peptides. Each entry provides detailed information about a peptide like sequence, inhibitory concentration (IC50), toxicity/bitterness value, source, length, molecular mass and information related to purification of peptides. In addition, the database provides structural information of these peptides that includes predicted tertiary and secondary structures. A user-friendly web interface with various tools has been developed to retrieve and analyse the data. It is anticipated that AHTPDB will be a useful and unique resource for the researchers working in the field of antihypertensive peptides.
PMCID: PMC4383949  PMID: 25392419
7.  CancerPPD: a database of anticancer peptides and proteins 
Nucleic Acids Research  2014;43(Database issue):D837-D843.
CancerPPD ( is a repository of experimentally verified anticancer peptides (ACPs) and anticancer proteins. Data were manually collected from published research articles, patents and from other databases. The current release of CancerPPD consists of 3491 ACP and 121 anticancer protein entries. Each entry provides comprehensive information related to a peptide like its source of origin, nature of the peptide, anticancer activity, N- and C-terminal modifications, conformation, etc. Additionally, CancerPPD provides the information of around 249 types of cancer cell lines and 16 different assays used for testing the ACPs. In addition to natural peptides, CancerPPD contains peptides having non-natural, chemically modified residues and D-amino acids. Besides this primary information, CancerPPD stores predicted tertiary structures as well as peptide sequences in SMILES format. Tertiary structures of peptides were predicted using the state-of-art method, PEPstr and secondary structural states were assigned using DSSP. In order to assist users, a number of web-based tools have been integrated, these include keyword search, data browsing, sequence and structural similarity search. We believe that CancerPPD will be very useful in designing peptide-based anticancer therapeutics.
PMCID: PMC4384006  PMID: 25270878
8.  QSAR-Based Models for Designing Quinazoline/Imidazothiazoles/Pyrazolopyrimidines Based Inhibitors against Wild and Mutant EGFR 
PLoS ONE  2014;9(7):e101079.
Overexpression of EGFR is responsible for causing a number of cancers, including lung cancer as it activates various downstream signaling pathways. Thus, it is important to control EGFR function in order to treat the cancer patients. It is well established that inhibiting ATP binding within the EGFR kinase domain regulates its function. The existing quinazoline derivative based drugs used for treating lung cancer that inhibits the wild type of EGFR. In this study, we have made a systematic attempt to develop QSAR models for designing quinazoline derivatives that could inhibit wild EGFR and imidazothiazoles/pyrazolopyrimidines derivatives against mutant EGFR. In this study, three types of prediction methods have been developed to design inhibitors against EGFR (wild, mutant and both). First, we developed models for predicting inhibitors against wild type EGFR by training and testing on dataset containing 128 quinazoline based inhibitors. This dataset was divided into two subsets called wild_train and wild_valid containing 103 and 25 inhibitors respectively. The models were trained and tested on wild_train dataset while performance was evaluated on the wild_valid called validation dataset. We achieved a maximum correlation between predicted and experimentally determined inhibition (IC50) of 0.90 on validation dataset. Secondly, we developed models for predicting inhibitors against mutant EGFR (L858R) on mutant_train, and mutant_valid dataset and achieved a maximum correlation between 0.834 to 0.850 on these datasets. Finally, an integrated hybrid model has been developed on a dataset containing wild and mutant inhibitors and got maximum correlation between 0.761 to 0.850 on different datasets. In order to promote open source drug discovery, we developed a webserver for designing inhibitors against wild and mutant EGFR along with providing standalone ( and Galaxy ( version of software. We hope our webserver ( will play a vital role in designing new anticancer drugs.
PMCID: PMC4081576  PMID: 24992720
9.  ParaPep: a web resource for experimentally validated antiparasitic peptide sequences and their structures 
ParaPep is a repository of antiparasitic peptides, which provides comprehensive information related to experimentally validated antiparasitic peptide sequences and their structures. The data were collected and compiled from published research papers, patents and from various databases. The current release of ParaPep holds 863 entries among which 519 are unique peptides. In addition to peptides having natural amino acids, ParaPep also consists of peptides having d-amino acids and chemically modified residues. In ParaPep, most of the peptides have been evaluated for growth inhibition of various species of Plasmodium, Leishmania and Trypanosoma. We have provided comprehensive information about these peptides that include peptide sequence, chemical modifications, stereochemistry, antiparasitic activity, origin, nature of peptide, assay types, type of parasite, mode of action and hemolytic activity. Structures of peptides consisting of natural, as well as modified amino acids have been determined using state-of-the-art software, PEPstr. To facilitate users, various user-friendly web tools, for data fetching, analysis and browsing, have been integrated. We hope that ParaPep will be advantageous in designing therapeutic peptides against parasitic diseases.
Database URL:
PMCID: PMC4054663  PMID: 24923818
10.  Designing of promiscuous inhibitors against pancreatic cancer cell lines 
Scientific Reports  2014;4:4668.
Pancreatic cancer remains the most devastating disease with worst prognosis. There is a pressing need to accelerate the drug discovery process to identify new effective drug candidates against pancreatic cancer. We have developed QSAR models for predicting promiscuous inhibitors using the pharmacological data. Our models achieved maximum Pearson correlation coefficient of 0.86, when evaluated on 10-fold cross-validation. Our models have also successfully validated the drug-to-oncogene relationship and further we used these models to screen FDA approved drugs and tested them in vitro. We have integrated these models in a webserver named as DiPCell, which will be useful for screening and designing novel promiscuous drug molecules. We have also identified the most and least effective drugs for pancreatic cancer cell lines. On the other side, we have identified resistant pancreatic cancer cell lines, which need investigative scanner on them to put light on resistant mechanism in pancreatic cancer.
PMCID: PMC3985076  PMID: 24728108
11.  Herceptin Resistance Database for Understanding Mechanism of Resistance in Breast Cancer Patients 
Scientific Reports  2014;4:4483.
Monoclonal antibody Trastuzumab/Herceptin is considered as frontline therapy for Her2-positive breast cancer patients. However, it is not effective against several patients due to acquired or de novo resistance. In last one decade, several assays have been performed to understand the mechanism of Herceptin resistance with/without supplementary drugs. This manuscript describes a database HerceptinR, developed for understanding the mechanism of resistance at genetic level. HerceptinR maintains information about 2500 assays performed against various breast cancer cell lines (BCCs), for improving sensitivity of Herceptin with or without supplementary drugs. In order to understand Herceptin resistance at genetic level, we integrated genomic data of BCCs that include expression, mutations and copy number variations in different cell lines. HerceptinR will play a vital role in i) designing biomarkers to identify patients eligible for Herceptin treatment and ii) identification of appropriate supplementary drug for a particular patient. HerceptinR is available at
PMCID: PMC3967150  PMID: 24670875
12.  PCMdb: Pancreatic Cancer Methylation Database 
Scientific Reports  2014;4:4197.
Pancreatic cancer is the fifth most aggressive malignancy and urgently requires new biomarkers to facilitate early detection. For providing impetus to the biomarker discovery, we have developed Pancreatic Cancer Methylation Database (PCMDB,, a comprehensive resource dedicated to methylation of genes in pancreatic cancer. Data was collected and compiled manually from published literature. PCMdb has 65907 entries for methylation status of 4342 unique genes. In PCMdb, data was compiled for both cancer cell lines (53565 entries for 88 cell lines) and cancer tissues (12342 entries for 3078 tissue samples). Among these entries, 47.22% entries reported a high level of methylation for the corresponding genes while 10.87% entries reported low level of methylation. PCMdb covers five major subtypes of pancreatic cancer; however, most of the entries were compiled for adenocarcinomas (88.38%) and mucinous neoplasms (5.76%). A user-friendly interface has been developed for data browsing, searching and analysis. We anticipate that PCMdb will be helpful for pancreatic cancer biomarker discovery.
PMCID: PMC3935225  PMID: 24569397
13.  Correction: Hybrid Approach for Predicting Coreceptor Used by HIV-1 from Its V3 Loop Amino Acid Sequence 
PLoS ONE  2013;8(11):10.1371/annotation/5c57dcdc-e5d9-4999-a7d0-32004427cba5.
PMCID: PMC3821749
14.  Hemolytik: a database of experimentally determined hemolytic and non-hemolytic peptides 
Nucleic Acids Research  2013;42(Database issue):D444-D449.
Hemolytik ( is a manually curated database of experimentally determined hemolytic and non-hemolytic peptides. Data were compiled from a large number of published research articles and various databases like Antimicrobial Peptide Database, Collection of Anti-microbial Peptides, Dragon Antimicrobial Peptide Database and Swiss-Prot. The current release of Hemolytik database contains ∼3000 entries that include ∼2000 unique peptides whose hemolytic activities were evaluated on erythrocytes isolated from as many as 17 different sources. Each entry in Hemolytik provides comprehensive information about a peptide, like its name, sequence, origin, reported function, property such as chirality, types (linear and cyclic), end modifications as well as details pertaining to its hemolytic activity. In addition, tertiary structure of each peptide has been predicted, and secondary structure states have been assigned. To facilitate the scientific community, a user-friendly interface has been developed with various tools for data searching and analysis. We hope, Hemolytik will be useful for researchers working in the field of designing therapeutic peptides.
PMCID: PMC3964980  PMID: 24174543
15.  In silico Platform for Prediction of N-, O- and C-Glycosites in Eukaryotic Protein Sequences 
PLoS ONE  2013;8(6):e67008.
Glycosylation is one of the most abundant and an important post-translational modification of proteins. Glycosylated proteins (glycoproteins) are involved in various cellular biological functions like protein folding, cell-cell interactions, cell recognition and host-pathogen interactions. A large number of eukaryotic glycoproteins also have therapeutic and potential technology applications. Therefore, characterization and analysis of glycosites (glycosylated residues) in these proteins is of great interest to biologists. In order to cater these needs a number of in silico tools have been developed over the years, however, a need to get even better prediction tools remains. Therefore, in this study we have developed a new webserver GlycoEP for more accurate prediction of N-linked, O-linked and C-linked glycosites in eukaryotic glycoproteins using two larger datasets, namely, standard and advanced datasets. In case of standard datasets no two glycosylated proteins are more similar than 40%; advanced datasets are highly non-redundant where no two glycosites’ patterns (as defined in methods) have more than 60% similarity. Further, based on our results with several algorihtms developed using different machine-learning techniques, we found Support Vector Machine (SVM) as optimum tool to develop glycosite prediction models. Accordingly, using our more stringent and non-redundant advanced datasets, the SVM based models developed in this study achieved a prediction accuracy of 84.26%, 86.87% and 91.43% with corresponding MCC of 0.54, 0.20 and 0.78, for N-, O- and C-linked glycosites, respectively. The best performing models trained on advanced datasets were then implemented as a user-friendly web server GlycoEP ( Additionally, this server provides prediction models developed on standard datasets and allows users to scan sequons in input protein sequences.
PMCID: PMC3695939  PMID: 23840574
16.  Improved Method for Linear B-Cell Epitope Prediction Using Antigen’s Primary Sequence 
PLoS ONE  2013;8(5):e62216.
One of the major challenges in designing a peptide-based vaccine is the identification of antigenic regions in an antigen that can stimulate B-cell’s response, also called B-cell epitopes. In the past, several methods have been developed for the prediction of conformational and linear (or continuous) B-cell epitopes. However, the existing methods for predicting linear B-cell epitopes are far from perfection. In this study, an attempt has been made to develop an improved method for predicting linear B-cell epitopes. We have retrieved experimentally validated B-cell epitopes as well as non B-cell epitopes from Immune Epitope Database and derived two types of datasets called Lbtope_Variable and Lbtope_Fixed length datasets. The Lbtope_Variable dataset contains 14876 B-cell epitope and 23321 non-epitopes of variable length where as Lbtope_Fixed length dataset contains 12063 B-cell epitopes and 20589 non-epitopes of fixed length. We also evaluated the performance of models on above datasets after removing highly identical peptides from the datasets. In addition, we have derived third dataset Lbtope_Confirm having 1042 epitopes and 1795 non-epitopes where each epitope or non-epitope has been experimentally validated in at least two studies. A number of models have been developed to discriminate epitopes and non-epitopes using different machine-learning techniques like Support Vector Machine, and K-Nearest Neighbor. We achieved accuracy from ∼54% to 86% using diverse s features like binary profile, dipeptide composition, AAP (amino acid pair) profile. In this study, for the first time experimentally validated non B-cell epitopes have been used for developing method for predicting linear B-cell epitopes. In previous studies, random peptides have been used as non B-cell epitopes. In order to provide service to scientific community, a web server LBtope has been developed for predicting and designing B-cell epitopes (
PMCID: PMC3646881  PMID: 23667458
17.  Hybrid Approach for Predicting Coreceptor Used by HIV-1 from Its V3 Loop Amino Acid Sequence 
PLoS ONE  2013;8(4):e61437.
HIV-1 infects the host cell by interacting with the primary receptor CD4 and a coreceptor CCR5 or CXCR4. Maraviroc, a CCR5 antagonist binds to CCR5 receptor. Thus, it is important to identify the coreceptor used by the HIV strains dominating in the patient. In past, a number of experimental assays and in-silico techniques have been developed for predicting the coreceptor tropism. The prediction accuracy of these methods is excellent when predicting CCR5(R5) tropic sequences but is relatively poor for CXCR4(X4) tropic sequences. Therefore, any new method for accurate determination of coreceptor usage would be of paramount importance to the successful management of HIV-infected individuals.
The dataset used in this study comprised 1799 R5-tropic and 598 X4-tropic third variable (V3) sequences of HIV-1. We compared the amino acid composition of both types of V3 sequences and observed that certain types of residues, e.g., Asparagine and Isoleucine, were preferred in R5-tropic sequences whereas residues like Lysine, Arginine, and Tryptophan were preferred in X4-tropic sequences. Initially, Support Vector Machine-based models were developed using amino acid composition, dipeptide composition, and split amino acid composition, which achieved accuracy up to 90%. We used BLAST to discriminate R5- and X4-tropic sequences and correctly predicted 93.16% of R5- and 75.75% of X4-tropic sequences. In order to improve the prediction accuracy, a Hybrid model was developed that achieved 91.66% sensitivity, 81.77% specificity, 89.19% accuracy and 0.72 Matthews Correlation Coefficient. The performance of our models was also evaluated on an independent dataset (256 R5- and 81 X4-tropic sequences) and achieved maximum accuracy of 84.87% with Matthews Correlation Coefficient 0.63.
This study describes a highly efficient method for predicting HIV-1 coreceptor usage from V3 sequences. In order to provide a service to the scientific community, a webserver HIVcoPred was developed ( for predicting the coreceptor usage.
PMCID: PMC3626595  PMID: 23596523
18.  Computational approach for designing tumor homing peptides 
Scientific Reports  2013;3:1607.
Tumor homing peptides are small peptides that home specifically to tumor and tumor associated microenvironment i.e. tumor vasculature, after systemic delivery. Keeping in mind the huge therapeutic importance of these peptides, we have made an attempt to analyze and predict tumor homing peptides. It was observed that certain types of residues are preferred in tumor homing peptides. Therefore, we developed support vector machine based models for predicting tumor homing peptides using amino acid composition and binary profiles of peptides. Amino acid composition, dipeptide composition and binary profile-based models achieved a maximum accuracy of 86.56%, 82.03%, and 84.19% respectively. These methods have been implemented in a user-friendly web server, TumorHPD. We anticipate that this method will be helpful to design novel tumor homing peptides. TumorHPD web server is freely accessible at
PMCID: PMC3617442  PMID: 23558316
19.  In silico approaches for designing highly effective cell penetrating peptides 
Cell penetrating peptides have gained much recognition as a versatile transport vehicle for the intracellular delivery of wide range of cargoes (i.e. oligonucelotides, small molecules, proteins, etc.), that otherwise lack bioavailability, thus offering great potential as future therapeutics. Keeping in mind the therapeutic importance of these peptides, we have developed in silico methods for the prediction of cell penetrating peptides, which can be used for rapid screening of such peptides prior to their synthesis.
In the present study, support vector machine (SVM)-based models have been developed for predicting and designing highly effective cell penetrating peptides. Various features like amino acid composition, dipeptide composition, binary profile of patterns, and physicochemical properties have been used as input features. The main dataset used in this study consists of 708 peptides. In addition, we have identified various motifs in cell penetrating peptides, and used these motifs for developing a hybrid prediction model. Performance of our method was evaluated on an independent dataset and also compared with that of the existing methods.
In cell penetrating peptides, certain residues (e.g. Arg, Lys, Pro, Trp, Leu, and Ala) are preferred at specific locations. Thus, it was possible to discriminate cell-penetrating peptides from non-cell penetrating peptides based on amino acid composition. All models were evaluated using five-fold cross-validation technique. We have achieved a maximum accuracy of 97.40% using the hybrid model that combines motif information and binary profile of the peptides. On independent dataset, we achieved maximum accuracy of 81.31% with MCC of 0.63.
The present study demonstrates that features like amino acid composition, binary profile of patterns and motifs, can be used to train an SVM classifier that can predict cell penetrating peptides with higher accuracy. The hybrid model described in this study achieved more accuracy than the previous methods and thus may complement the existing methods. Based on the above study, a user- friendly web server CellPPD has been developed to help the biologists, where a user can predict and design CPPs with much ease. CellPPD web server is freely accessible at
PMCID: PMC3615965  PMID: 23517638
Cell penetrating peptides; Drug delivery; Amino acid composition; Support vector machine
20.  CancerDR: Cancer Drug Resistance Database 
Scientific Reports  2013;3:1445.
Cancer therapies are limited by the development of drug resistance, and mutations in drug targets is one of the main reasons for developing acquired resistance. The adequate knowledge of these mutations in drug targets would help to design effective personalized therapies. Keeping this in mind, we have developed a database “CancerDR”, which provides information of 148 anti-cancer drugs, and their pharmacological profiling across 952 cancer cell lines. CancerDR provides comprehensive information about each drug target that includes; (i) sequence of natural variants, (ii) mutations, (iii) tertiary structure, and (iv) alignment profile of mutants/variants. A number of web-based tools have been integrated in CancerDR. This database will be very useful for identification of genetic alterations in genes encoding drug targets, and in turn the residues responsible for drug resistance. CancerDR allows user to identify promiscuous drug molecules that can kill wide range of cancer cells. CancerDR is freely accessible at
PMCID: PMC3595698  PMID: 23486013
21.  Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information 
BMC Bioinformatics  2013;14:44.
The vitamins are important cofactors in various enzymatic-reactions. In past, many inhibitors have been designed against vitamin binding pockets in order to inhibit vitamin-protein interactions. Thus, it is important to identify vitamin interacting residues in a protein. It is possible to detect vitamin-binding pockets on a protein, if its tertiary structure is known. Unfortunately tertiary structures of limited proteins are available. Therefore, it is important to develop in-silico models for predicting vitamin interacting residues in protein from its primary structure.
In this study, first we compared protein-interacting residues of vitamins with other ligands using Two Sample Logo (TSL). It was observed that ATP, GTP, NAD, FAD and mannose preferred {G,R,K,S,H}, {G,K,T,S,D,N}, {T,G,Y}, {G,Y,W} and {Y,D,W,N,E} residues respectively, whereas vitamins preferred {Y,F,S,W,T,G,H} residues for the interaction with proteins. Furthermore, compositional information of preferred and non-preferred residues along with patterns-specificity was also observed within different vitamin-classes. Vitamins A, B and B6 preferred {F,I,W,Y,L,V}, {S,Y,G,T,H,W,N,E} and {S,T,G,H,Y,N} interacting residues respectively. It suggested that protein-binding patterns of vitamins are different from other ligands, and motivated us to develop separate predictor for vitamins and their sub-classes. The four different prediction modules, (i) vitamin interacting residues (VIRs), (ii) vitamin-A interacting residues (VAIRs), (iii) vitamin-B interacting residues (VBIRs) and (iv) pyridoxal-5-phosphate (vitamin B6) interacting residues (PLPIRs) have been developed. We applied various classifiers of SVM, BayesNet, NaiveBayes, ComplementNaiveBayes, NaiveBayesMultinomial, RandomForest and IBk etc., as machine learning techniques, using binary and Position-Specific Scoring Matrix (PSSM) features of protein sequences. Finally, we selected best performing SVM modules and obtained highest MCC of 0.53, 0.48, 0.61, 0.81 for VIRs, VAIRs, VBIRs, PLPIRs respectively, using PSSM-based evolutionary information. All the modules developed in this study have been trained and tested on non-redundant datasets and evaluated using five-fold cross-validation technique. The performances were also evaluated on the balanced and different independent datasets.
This study demonstrates that it is possible to predict VIRs, VAIRs, VBIRs and PLPIRs from evolutionary information of protein sequence. In order to provide service to the scientific community, we have developed web-server and standalone software VitaPred (
PMCID: PMC3577447  PMID: 23387468
Vitamin-interacting residue; Pyridoxal-5-phosphate; SVM; PSSM; VitaPred
22.  NPACT: Naturally Occurring Plant-based Anti-cancer Compound-Activity-Target database 
Nucleic Acids Research  2012;41(Database issue):D1124-D1129.
Plant-derived molecules have been highly valued by biomedical researchers and pharmaceutical companies for developing drugs, as they are thought to be optimized during evolution. Therefore, we have collected and compiled a central resource Naturally Occurring Plant-based Anti-cancer Compound-Activity-Target database (NPACT, that gathers the information related to experimentally validated plant-derived natural compounds exhibiting anti-cancerous activity (in vitro and in vivo), to complement the other databases. It currently contains 1574 compound entries, and each record provides information on their structure, manually curated published data on in vitro and in vivo experiments along with reference for users referral, inhibitory values (IC50/ED50/EC50/GI50), properties (physical, elemental and topological), cancer types, cell lines, protein targets, commercial suppliers and drug likeness of compounds. NPACT can easily be browsed or queried using various options, and an online similarity tool has also been made available. Further, to facilitate retrieval of existing data, each record is hyperlinked to similar databases like SuperNatural, Herbal Ingredients’ Targets, Comparative Toxicogenomics Database, PubChem and NCI-60 GI50 data.
PMCID: PMC3531140  PMID: 23203877
23.  GlycoPP: A Webserver for Prediction of N- and O-Glycosites in Prokaryotic Protein Sequences 
PLoS ONE  2012;7(7):e40155.
Glycosylation is one of the most abundant post-translational modifications (PTMs) required for various structure/function modulations of proteins in a living cell. Although elucidated recently in prokaryotes, this type of PTM is present across all three domains of life. In prokaryotes, two types of protein glycan linkages are more widespread namely, N- linked, where a glycan moiety is attached to the amide group of Asn, and O- linked, where a glycan moiety is attached to the hydroxyl group of Ser/Thr/Tyr. For their biologically ubiquitous nature, significance, and technology applications, the study of prokaryotic glycoproteins is a fast emerging area of research. Here we describe new Support Vector Machine (SVM) based algorithms (models) developed for predicting glycosylated-residues (glycosites) with high accuracy in prokaryotic protein sequences. The models are based on binary profile of patterns, composition profile of patterns, and position-specific scoring matrix profile of patterns as training features. The study employ an extensive dataset of 107 N-linked and 116 O-linked glycosites extracted from 59 experimentally characterized glycoproteins of prokaryotes. This dataset includes validated N-glycosites from phyla Crenarchaeota, Euryarchaeota (domain Archaea), Proteobacteria (domain Bacteria) and validated O-glycosites from phyla Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria (domain Bacteria). In view of the current understanding that glycosylation occurs on folded proteins in bacteria, hybrid models have been developed using information on predicted secondary structures and accessible surface area in various combinations with training features. Using these models, N-glycosites and O-glycosites could be predicted with an accuracy of 82.71% (MCC 0.65) and 73.71% (MCC 0.48), respectively. An evaluation of the best performing models with 28 independent prokaryotic glycoproteins confirms the suitability of these models in predicting N- and O-glycosites in potential glycoproteins from aforementioned organisms, with reasonably high confidence. A web server GlycoPP, implementing these models is available freely at http:/
PMCID: PMC3392279  PMID: 22808107
24.  TumorHoPe: A Database of Tumor Homing Peptides 
PLoS ONE  2012;7(4):e35187.
Cancer is responsible for millions of immature deaths every year and is an economical burden on developing countries. One of the major challenges in the present era is to design drugs that can specifically target tumor cells not normal cells. In this context, tumor homing peptides have drawn much attention. These peptides are playing a vital role in delivering drugs in tumor tissues with high specificity. In order to provide service to scientific community, we have developed a database of tumor homing peptides called TumorHoPe.
TumorHoPe is a manually curated database of experimentally validated tumor homing peptides that specifically recognize tumor cells and tumor associated microenvironment, i.e., angiogenesis. These peptides were collected and compiled from published papers, patents and databases. Current release of TumorHoPe contains 744 peptides. Each entry provides comprehensive information of a peptide that includes its sequence, target tumor, target cell, techniques of identification, peptide receptor, etc. In addition, we have derived various types of information from these peptide sequences that include secondary/tertiary structure, amino acid composition, and physicochemical properties of peptides. Peptides in this database have been found to target different types of tumors that include breast, lung, prostate, melanoma, colon, etc. These peptides have some common motifs including RGD (Arg-Gly-Asp) and NGR (Asn-Gly-Arg) motifs, which specifically recognize tumor angiogenic markers. TumorHoPe has been integrated with many web-based tools like simple/complex search, database browsing and peptide mapping. These tools allow a user to search tumor homing peptides based on their amino acid composition, charge, polarity, hydrophobicity, etc.
TumorHoPe is a unique database of its kind, which provides comprehensive information about experimentally validated tumor homing peptides and their target cells. This database will be very useful in designing peptide-based drugs and drug-delivery system. It is freely available at
PMCID: PMC3327652  PMID: 22523575
25.  PolysacDB: A Database of Microbial Polysaccharide Antigens and Their Antibodies 
PLoS ONE  2012;7(4):e34613.
Vaccines based on microbial cell surface polysaccharides have long been considered as attractive means to control infectious diseases. To realize this goal, detailed systematic information about the antigenic polysaccharide is necessary. However, only a few databases that provide limited knowledge in this area are available. This paper describes PolysacDB, a manually curated database of antigenic polysaccharides. We collected and compiled comprehensive information from literature and web resources about antigenic polysaccharides of microbial origin. The current version of the database has 1,554 entries of 149 different antigenic polysaccharides from 347 different microbes. Each entry provides comprehensive information about an antigenic polysaccharide, i.e., its origin, function, protocols for its conjugation to carriers, antibodies produced, details of assay systems, specificities of antibodies, proposed epitopes involved and antibody utilities. For convenience to the user, we have integrated web interface for searching, advanced searching and browsing data in database. This database will be useful for researchers working on polysaccharide-based vaccines. It is freely available from the URL:
PMCID: PMC3324500  PMID: 22509333

Results 1-25 (33)