Analysis of protein–ligand interactions is a fundamental issue in drug design. As the detailed and accurate analysis of protein–ligand interactions involves calculation of binding free energy based on thermodynamics and even quantum mechanics, which is highly expensive in terms of computing time, conformational and structural analysis of proteins and ligands has been widely employed as a screening process in computer-aided drug design. In this paper, a web server called ProteMiner-SSM designed for efficient analysis of similar protein tertiary substructures is presented. In one experiment reported in this paper, the web server has been exploited to obtain some clues about a biochemical hypothesis. The main distinction in the software design of the web server is the filtering process incorporated to expedite the analysis. The filtering process extracts the residues located in the caves of the protein tertiary structure for analysis and operates with O(nlogn) time complexity, where n is the number of residues in the protein. In comparison, the α-hull algorithm, which is a widely used algorithm in computer graphics for identifying those instances that are on the contour of a three-dimensional object, features O(n2) time complexity. Experimental results show that the filtering process presented in this paper is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times. The ProteMiner-SSM web server can be found at http://proteminer.csie.ntu.edu.tw/. There is a mirror site at http://p4.sbl.bc.sinica.edu.tw/proteminer/.
Sedation during gastrointestinal endoscopy is often achieved using propofol or midazolam in general population. However, impaired protein synthesis, altered drug metabolism, and compromised hepatic blood flow in patients with liver cirrhosis might affect the pharmacokinetics of sedatives, placing cirrhotic patients undergoing endoscopy at a greater risk of adverse events. The objective of this study was to assess comparative efficacies and safety of propofol and midazolam in cirrhotic patients undergoing endoscopy.
Randomized, controlled trials comparing propofol with midazolam in cirrhotic patients undergoing gastrointestinal endoscopy were selected. We performed the meta-analysis, using a random-effect model, the Review Manager, Version 5.2, statistical software package (Cochrane Collaboration, Oxford, UK) according to the PRISMA guidelines.
Five studies between 2003 and 2012, including 433 patients, were included. Propofol provided a shorter time to sedation (weight mean difference: -2.76 min, 95% confidence interval: -3.00 to -2.51) and a shorter recovery time (weight mean difference -6.17 min, 95% confidence interval: -6.81 to -5.54) than midazolam did. No intergroup difference in the incidence of hypotension, bradycardia, or hypoxemia was observed. Midazolam was associated with the deterioration of psychometric scores for a longer period than propofol.
This meta-analysis suggests that Propofol sedation for endoscopy provides more rapid sedation and recovery than midazolam does. The risk of sedation-related side effects for propofol does not differ significantly from that of midazolam. The efficacy of propofol in cirrhotic patients undergoing endoscopy is superior to those of midazolam.
Regional specificity allows different skin regions to exhibit different characteristics, enabling complementary functions to make effective use of the integumentary surface. Chickens exhibit a high degree of regional specificity in the skin and can serve as a good model for when and how these regional differences begin to emerge.
We used developing feather and scale regions in embryonic chickens as a model to gauge the differences in their molecular pathways. We employed cosine similarity analysis to identify the differentially regulated and co-regulated genes. We applied low cell techniques for expression validation and chromatin immunoprecipitation (ChIP)-based enhancer identification to overcome limited cell availabilities from embryonic chicken skin.
We identified a specific set of genes demonstrating a high correlation as being differentially expressed during feather and scale development and maturation. Some members of the WNT, TGF-beta/BMP, and Notch family known to be involved in feathering skin differentiation were found to be differentially regulated. Interestingly, we also found genes along calcium channel pathways that are differentially regulated. From the analysis of differentially regulated pathways, we used calcium signaling pathways as an example for further verification. Some voltage-gated calcium channel subunits, particularly CACNA1D, are expressed spatio-temporally in the skin epithelium. These calcium signaling pathway members may be involved in developmental decisions, morphogenesis, or epithelial maturation. We further characterized enhancers associated with histone modifications, including H3K4me1, H3K27ac, and H3K27me3, near calcium channel-related genes and identified signature intensive hotspots that may be correlated with certain voltage-gated calcium channel genes.
We demonstrated the applicability of cosine similarity analysis for identifying novel regulatory pathways that are differentially regulated during development. Our study concerning the effects of signaling pathways and histone signatures on enhancers suggests that voltage-gated calcium signaling may be involved in early skin development. This work lays the foundation for studying the roles of these gene pathways and their genomic regulation during the establishment of skin regional specificity.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-014-1202-9) contains supplementary material, which is available to authorized users.
Development; Chicken; Skin; Cosine similarity; Calcium; Histone modifications
We sequenced four Vibrio parahaemolyticus strains, three of which caused serious acute hepatopancreatic necrosis disease. Sequence analysis of the virulent strains revealed not only genes related to cholera toxin and the type IV pilus/type IV secretion system but also a unique, previously unreported, large extrachromosomal plasmid that encodes a homolog to the insecticidal Photorhabdus insect-related binary toxin PirAB.
Agarwood is derived from Aquilaria trees, the trade of which has come under strict control with a listing in Appendix II of the Convention on International Trade in Endangered Species of Wild Fauna and Flora. Many secondary metabolites of agarwood are known to have medicinal value to humans, including compounds that have been shown to elicit sedative effects and exhibit anti-cancer properties. However, little is known about the genome, transcriptome, and the biosynthetic pathways responsible for producing such secondary metabolites in agarwood.
In this study, we present a draft genome and a putative pathway for cucurbitacin E and I, compounds with known medicinal value, from in vitro Aquilaria agallocha agarwood. DNA and RNA data are utilized to annotate many genes and protein functions in the draft genome. The expression changes for cucurbitacin E and I are shown to be consistent with known responses of A. agallocha to biotic stress and a set of homologous genes in Arabidopsis thaliana related to cucurbitacin bio-synthesis is presented and validated through qRT-PCR.
This study is the first attempt to identify cucurbitacin E and I from in vitro agarwood and the first draft genome for any species of Aquilaria. The results of this study will aid in future investigations of secondary metabolite pathways in Aquilaria and other non-model medicinal plants.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-578) contains supplementary material, which is available to authorized users.
Agarwood; Cucurbitacin; Aquilaria; Genome
Plants respond differently to mechanical wounding and herbivore attack, using distinct pathways for defense. The versatile sweet potato sporamin possesses multiple biological functions in response to stress. However, the regulation of sporamin gene expression that is activated upon mechanical damage or herbivore attack has not been well studied.
Biochemical analysis revealed that different patterns of Reactive oxygen species (ROS) and antioxidant mechanism exist between mechanical wounding (MW) and herbivore attack (HA) in the sweet potato leaf. Using LC-ESI-MS (Liquid chromatography electrospray ionization mass spectrometry analysis), only the endogenous JA (jasmonic acid) level was found to increase dramatically after MW in a time-dependent manner, whereas both endogenous JA and SA (salicylic acid) increase in parallel after HA. Through yeast one-hybrid screening, two transcription factors IbNAC1 (no apical meristem (NAM), Arabidopsis transcription activation factor (ATAF), and cup-shaped cotyledon (CUC)) and IbWRKY1 were isolated, which interact with the sporamin promoter fragment of SWRE (sporamin wounding-responsive element) regulatory sequences. Exogenous application of MeJA (methyl jasmonate), SA and DIECA (diethyldithiocarbamic acid, JAs biosynthesis inhibitor) on sweet potato leaves was employed, and the results revealed that IbNAC1 mediated the expression of sporamin through a JA-dependent signaling pathway upon MW, whereas both IbNAC1 and IbWRKY1 coordinately regulated sporamin expression through JA- and SA-dependent pathways upon HA. Transcriptome analysis identified MYC2/4 and JAZ2/TIFY10A (jasmonate ZIM/tify-domain), the repressor and activator of JA and SA signaling among others, as the genes that play an intermediate role in the JA and SA pathways, and these results were further validated by qRT-PCR (quantitative real-time polymerase chain reaction).
This work has improved our understanding of the differential regulatory mechanism of sporamin expression. Our study illustrates that sweet potato sporamin expression is differentially induced upon abiotic MW and biotic HA that involves IbNAC1 and IbWRKY1 and is dependent on the JA and SA signaling pathways. Thus, we established a model to address the plant-wounding response upon physical and biotic damage.
Sporamin; Different activation; Jasmonic acid; Salicylic acid; Transcription factors: NAC, WRKY; ROS; Transcriptome; Sweet potato
Salamanders are unique among vertebrates in their ability to completely regenerate amputated limbs through the mediation of blastema cells located at the stump ends. This regeneration is nerve-dependent because blastema formation and regeneration does not occur after limb denervation. To obtain the genomic information of blastema tissues, de novo transcriptomes from both blastema tissues and denervated stump ends of Ambystoma mexicanum (axolotls) 14 days post-amputation were sequenced and compared using Solexa DNA sequencing.
The sequencing done for this study produced 40,688,892 reads that were assembled into 307,345 transcribed sequences. The N50 of transcribed sequence length was 562 bases. A similarity search with known proteins identified 39,200 different genes to be expressed during limb regeneration with a cut-off E-value exceeding 10-5. We annotated assembled sequences by using gene descriptions, gene ontology, and clusters of orthologous group terms. Targeted searches using these annotations showed that the majority of the genes were in the categories of essential metabolic pathways, transcription factors and conserved signaling pathways, and novel candidate genes for regenerative processes. We discovered and confirmed numerous sequences of the candidate genes by using quantitative polymerase chain reaction and in situ hybridization.
The results of this study demonstrate that de novo transcriptome sequencing allows gene expression analysis in a species lacking genome information and provides the most comprehensive mRNA sequence resources for axolotls. The characterization of the axolotl transcriptome can help elucidate the molecular mechanisms underlying blastema formation during limb regeneration.
Ambystoma mexicanum; Axolotl; Transcriptome; Blastema; De novo transcriptome sequencing
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.
Insecticide resistance has recently become a critical concern for control of many insect pest species. Genome sequencing and global quantization of gene expression through analysis of the transcriptome can provide useful information relevant to this challenging problem. The oriental fruit fly, Bactrocera dorsalis, is one of the world's most destructive agricultural pests, and recently it has been used as a target for studies of genetic mechanisms related to insecticide resistance. However, prior to this study, the molecular data available for this species was largely limited to genes identified through homology. To provide a broader pool of gene sequences of potential interest with regard to insecticide resistance, this study uses whole transcriptome analysis developed through de novo assembly of short reads generated by next-generation sequencing (NGS). The transcriptome of B. dorsalis was initially constructed using Illumina's Solexa sequencing technology. Qualified reads were assembled into contigs and potential splicing variants (isotigs). A total of 29,067 isotigs have putative homologues in the non-redundant (nr) protein database from NCBI, and 11,073 of these correspond to distinct D. melanogaster proteins in the RefSeq database. Approximately 5,546 isotigs contain coding sequences that are at least 80% complete and appear to represent B. dorsalis genes. We observed a strong correlation between the completeness of the assembled sequences and the expression intensity of the transcripts. The assembled sequences were also used to identify large numbers of genes potentially belonging to families related to insecticide resistance. A total of 90 P450-, 42 GST-and 37 COE-related genes, representing three major enzyme families involved in insecticide metabolism and resistance, were identified. In addition, 36 isotigs were discovered to contain target site sequences related to four classes of resistance genes. Identified sequence motifs were also analyzed to characterize putative polypeptide translational products and associate them with specific genes and protein functions.
By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein–DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD–DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein–DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies.
http://dbd2bs.csie.ntu.edu.tw/, http://dbd2bs.csbb.ntu.edu.tw/, and http://dbd2bs.ee.ncku.edu.tw.
The mechanism underlying the dysregulation of cholesterol metabolism and inflammation in atherogenesis is not understood fully. Glycine N-methyltransferase (GNMT) has been implicated in hepatic lipid metabolism and the pathogenesis of liver diseases. However, little is known about the significance of GNMT in atherosclerosis. We showed the predominant expression of GNMT in foamy macrophages of mouse atherosclerotic aortas. Genetic deletion of GNMT exacerbated the hyperlipidemia, inflammation and development of atherosclerosis in apolipoprotein E–deficient mice. In addition, ablation of GNMT in macrophages aggravated oxidized low-density lipoprotein-mediated cholesterol accumulation in macrophage foam cells by downregulating the expression of reverse cholesterol transporters including ATP-binding cassette transporters-A1 and G1 and scavenger receptor BI. Furthermore, tumor necrosis factor-α–induced inflammatory response was promoted in GNMT-null macrophages. Collectively, our data suggest that GNMT is a crucial regulator in cholesterol metabolism and in inflammation, and contributes to the pathogenesis of atherosclerosis. This finding may reveal a potential therapeutic target for atherosclerosis.
DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. Accurate prediction of such target sequences, often represented by position weight matrices (PWMs), is an important step to understand many biological processes. Recent studies have shown that knowledge-based potential functions can be applied on protein-DNA co-crystallized structures to generate PWMs that are considerably consistent with experimental data. However, this success has not been extended to DNA-binding proteins lacking co-crystallized structures. This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins' unbound structures (structures of the unbound state). Given an unbound query protein and a template complex, the proposed method first employs structure alignment to generate synthetic protein-DNA complexes for the query protein. Once a complex is available, an atomic-level knowledge-based potential function is employed to predict PWMs characterizing the sequences to which the query protein can bind. The evaluation of the proposed method is based on seven DNA-binding proteins, which have structures of both DNA-bound and unbound forms for prediction as well as annotated PWMs for validation. Since this work is the first attempt to predict target sequences of DNA-binding proteins from their unbound structures, three types of structural variations that presumably influence the prediction accuracy were examined and discussed. Based on the analyses conducted in this study, the conformational change of proteins upon binding DNA was shown to be the key factor. This study sheds light on the challenge of predicting the target DNA sequences of a protein lacking co-crystallized structures, which encourages more efforts on the structure alignment-based approaches in addition to docking- and homology modeling-based approaches for generating synthetic complexes.
DNA-binding proteins perform their functions through specific or non-specific sequence recognition. Although many sequence- or structure-based approaches have been proposed to identify DNA-binding residues on proteins or protein-binding sites on DNA sequences with satisfied performance, it remains a challenging task to unveil the exact mechanism of protein-DNA interactions without crystal complex structures. Without information from complexes, the linkages between DNA-binding proteins and their binding sites on DNA are still missing.
While it is still difficult to acquire co-crystallized structures in an efficient way, this study proposes a knowledge-based learning method to effectively predict DNA orientation and base locations around the protein’s DNA-binding sites when given a protein structure. First, the functionally important residues of a query protein are predicted by a sequential pattern mining tool. After that, surface residues falling in the predicted functional regions are determined based on the given structure. These residues are then clustered based on their spatial coordinates and the resultant clusters are ranked by a proposed DNA-binding propensity function. Clusters with high DNA-binding propensities are treated as DNA-binding units (DBUs) and each DBU is analyzed by principal component analysis (PCA) to predict potential orientation of DNA grooves. More specifically, the proposed method is developed to predict the direction of the tangent line to the helix curve of the DNA groove where a DBU is going to bind.
This paper proposes a knowledge-based learning procedure to determine the spatial location of the DNA groove with respect to the query protein structure by considering geometric propensity between protein side chains and DNA bases. The 11 test cases used in this study reveal that the location and orientation of the DNA groove around a selected DBU can be predicted with satisfied errors.
This study presents a method to predict the location and orientation of DNA grooves with respect to the structure of a DNA-binding protein. The test cases shown in this study reveal the possibility of imaging protein-DNA binding conformation before co-crystallized structure can be determined. How the proposed method can be incorporated with existing protein-DNA docking tools to study protein-DNA interactions deserve further studies in the near future.
Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.
WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for discovering functional regions of a single protein by referring to a set of related sequences (e.g. its homologues). The discovered W-patterns are used to characterize the protein sequence and the results are compared with the conserved positions identified by multiple sequence alignment (MSA). The family-based mining mode of WildSpan is developed for extracting sequence signatures for a group of related proteins (e.g. a protein family) for protein function classification. In this situation, the discovered W-patterns are compared with PROSITE patterns as well as the patterns generated by three existing methods performing the similar task. Finally, analysis on execution time of running WildSpan reveals that the proposed pruning strategy is effective in improving the scalability of the proposed algorithm.
The mining results conducted in this study reveal that WildSpan is efficient and effective in discovering functional signatures of proteins directly from sequences. The proposed pruning strategy is effective in improving the scalability of WildSpan. It is demonstrated in this study that the W-patterns discovered by WildSpan provides useful information in characterizing protein sequences. The WildSpan executable and open source codes are available on the web (http://biominer.csie.cyu.edu.tw/wildspan).
To explore the potent inhibitor from one of the Traditional Chinese medicine (TCM), Epimedium sagittatum.
We predicted the potent compound, ES03b, de novo evolution from the four Epimedium sagittatum components were verified by molecular docking, pharmacophore analysis, and analysis of quantitative structure-activity relationship (QSAR) model, which was constructed by multiple linear regression.
ES03b was chosen to undergo drug modification via de novo evolution. By analyzing the pharmacophore features, we found that the hydrophobic core in the binding site and the hydrogen bond generated at Asn663 played key roles in designing PDE5 inhibitors. ES03b generated 49 diversities (Evo01-49). Evo48 had high activity in prediction. Although the value of prediction was overestimated, Evo48 was suggested as the potent lead.
In this study, we showed that the hydrophobic core in the binding site and hydrogen bond production on Asn663 played key roles to design PDE5 inhibitors. From several require validation analysis, Evo48 was suggested to be a potent inhibitor.
phosphodiesterase5 (PDE5); Epimedium sagittatum; erectile dysfunction; pharmacophore analysis; quantitative structureactivity relationship
Sequence motifs are important in the study of molecular biology. Motif discovery tools efficiently deliver many function related signatures of proteins and largely facilitate sequence annotation. As increasing numbers of motifs are detected experimentally or predicted computationally, characterizing the functional roles of motifs and identifying the potential synergetic relationships between them are important next steps. A good way to investigate novel motifs is to utilize the abundant 3D structures that have also been accumulated at an astounding rate in recent years. This article reports the development of the web service seeMotif, which provides users with an interactive interface for visualizing sequence motifs on protein structures from the Protein Data Bank (PDB). Researchers can quickly see the locations and conformation of multiple motifs among a number of related structures simultaneously. Considering the fact that PDB sequences are usually shorter than those in sequence databases and/or may have missing residues, seeMotif has two complementary approaches for selecting structures and mapping motifs to protein chains in structures. As more and more structures belonging to previously uncharacterized protein families become available, combining sequence and structure information gives good opportunities to facilitate understanding of protein functions in large-scale genome projects. Available at: http://seemotif.csie.ntu.edu.tw,http://seemotif.ee.ncku.edu.tw or http://seemotif.csbb.ntu.edu.tw.
A variety of high-throughput techniques are now available for constructing comprehensive gene regulatory networks in systems biology. In this study, we report a new statistical approach for facilitating in silico inference of regulatory network structure. The new measure of association, coefficient of intrinsic dependence (CID), is model-free and can be applied to both continuous and categorical distributions. When given two variables X and Y, CID answers whether Y is dependent on X by examining the conditional distribution of Y given X. In this paper, we apply CID to analyze the regulatory relationships between transcription factors (TFs) (X) and their downstream genes (Y) based on clinical data. More specifically, we use estrogen receptor α (ERα) as the variable X, and the analyses are based on 48 clinical breast cancer gene expression arrays (48A).
The analytical utility of CID was evaluated in comparison with four commonly used statistical methods, Galton-Pearson's correlation coefficient (GPCC), Student's t-test (STT), coefficient of determination (CoD), and mutual information (MI). When being compared to GPCC, CoD, and MI, CID reveals its preferential ability to discover the regulatory association where distribution of the mRNA expression levels on X and Y does not fit linear models. On the other hand, when CID is used to measure the association of a continuous variable (Y) against a discrete variable (X), it shows similar performance as compared to STT, and appears to outperform CoD and MI. In addition, this study established a two-layer transcriptional regulatory network to exemplify the usage of CID, in combination with GPCC, in deciphering gene networks based on gene expression profiles from patient arrays.
CID is shown to provide useful information for identifying associations between genes and transcription factors of interest in patient arrays. When coupled with the relationships detected by GPCC, the association predicted by CID are applicable to the construction of transcriptional regulatory networks. This study shows how information from different data sources and learning algorithms can be integrated to investigate whether relevant regulatory mechanisms identified in cell models can also be partially re-identified in clinical samples of breast cancers.
the implementation of CID in R codes can be freely downloaded from .
Though prediction of protein secondary structures has been an active research issue in bioinformatics for quite a few years and many approaches have been proposed, a new challenge emerges as the sizes of contemporary protein structure databases continue to grow rapidly. The new challenge concerns how we can effectively exploit all the information implicitly deposited in the protein structure databases and deliver ever-improving prediction accuracy as the databases expand rapidly.
The new challenge is addressed in this article by proposing a predictor designed with a novel kernel density estimation algorithm. One main distinctive feature of the kernel density estimation based approach is that the average execution time taken by the training process is in the order of O(nlogn), where n is the number of instances in the training dataset. In the experiments reported in this article, the proposed predictor delivered an average Q3 (three-state prediction accuracy) score of 80.3% and an average SOV (segment overlap) score of 76.9% for a set of 27 benchmark protein chains extracted from the EVA server that are longer than 100 residues.
The experimental results reported in this article reveal that we can continue to achieve higher prediction accuracy of protein secondary structures by effectively exploiting the structural information deposited in fast-growing protein structure databases. In this respect, the kernel density estimation based approach enjoys a distinctive advantage with its low time complexity for carrying out the training process.
Large-scale automatic annotation of protein sequences remains challenging in postgenomics era. E1DS is designed for annotating enzyme sequences based on a repository of 1D signatures. The employed sequence signatures are derived using a novel pattern mining approach that discovers long motifs consisted of several sequential blocks (conserved segments). Each of the sequential blocks is considerably conserved among the protein members of an EC group. Moreover, a signature includes at least three sequential blocks that are concurrently conserved, i.e. frequently observed together in sequences. In other words, a sequence signature is consisted of residues from multiple regions of the protein sequence, which echoes the observation that an enzyme catalytic site is usually constituted of residues that are largely separated in the sequence. E1DS currently contains 5421 sequence signatures that in total cover 932 4-digital EC numbers. E1DS is evaluated based on a collection of enzymes with catalytic sites annotated in Catalytic Site Atlas. When compared to the famous pattern database PROSITE, predictions based on E1DS signatures are considered more sensitive in identifying catalytic sites and the involved residues. E1DS is available at http://e1ds.ee.ncku.edu.tw/ and a mirror site can be found at http://e1ds.csbb.ntu.edu.tw/.
This article presents a web server iPDA, which aims at identifying the disordered regions of a query protein. Automatic prediction of disordered regions from protein sequences is an important problem in the study of structural biology. The proposed classifier DisPSSMP2 is different from several existing disorder predictors by its employment of position-specific scoring matrices with respect to physicochemical properties (PSSMP), where the physicochemical properties adopted here especially take the disorder propensity of amino acids into account. The web server iPDA integrates DisPSSMP2 with several other sequence predictors in order to investigate the functional role of the detected disordered region. The predicted information includes sequence conservation, secondary structure, sequence complexity and hydrophobic clusters. According to the proportion of the secondary structure elements predicted, iPDA dynamically adjusts the cutting threshold of determining protein disorder. Furthermore, a pattern mining package for detecting sequence conservation is embedded in iPDA for discovering potential binding regions of the query protein, which is really helpful to uncovering the relationship between protein function and its primary sequence. The web service is available at http://biominer.bime.ntu.edu.tw/ipda and mirrored at http://biominer.cse.yzu.edu.tw/ipda.
Identification of protein interacting sites is an important task in computational molecular biology. As more and more protein sequences are deposited without available structural information, it is strongly desirable to predict protein binding regions by their sequences alone. This paper presents a pattern mining approach to tackle this problem. It is observed that a functional region of protein structures usually consists of several peptide segments linked with large wildcard regions. Thus, the proposed mining technology considers large irregular gaps when growing patterns, in order to find the residues that are simultaneously conserved but largely separated on the sequences. A derived pattern is called a cluster-like pattern since the discovered conserved residues are always grouped into several blocks, which each corresponds to a local conserved region on the protein sequence.
The experiments conducted in this work demonstrate that the derived long patterns automatically discover the important residues that form one or several hot regions of protein-protein interactions. The methodology is evaluated by conducting experiments on the web server MAGIIC-PRO based on a well known benchmark containing 220 protein chains from 72 distinct complexes. Among the tested 218 proteins, there are 900 sequential blocks discovered, 4.25 blocks per protein chain on average. About 92% of the derived blocks are observed to be clustered in space with at least one of the other blocks, and about 66% of the blocks are found to be near the interface of protein-protein interactions. It is summarized that for about 83% of the tested proteins, at least two interacting blocks can be discovered by this approach.
This work aims to demonstrate that the important residues associated with the interface of protein-protein interactions may be automatically discovered by sequential pattern mining. The detected regions possess high conservation and thus are considered as the computational hot regions. This information would be useful to characterizing protein sequences, predicting protein function, finding potential partners, and facilitating protein docking for drug discovery.
This paper presents a web service named MAGIIC-PRO, which aims to discover functional signatures of a query protein by sequential pattern mining. Automatic discovery of patterns from unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously established methods performing similar tasks in two major ways. The first remarkable feature of MAGIIC-PRO is its efficiency in delivering long patterns. With incorporating a new type of gap constraints and some of the state-of-the-art data mining techniques, MAGIIC-PRO usually identifies satisfied patterns within an acceptable response time. The efficiency of MAGIIC-PRO enables the users to quickly discover functional signatures of which the residues are not from only one region of the protein sequences or are only conserved in few members of a protein family. The second remarkable feature of MAGIIC-PRO is its effort in refining the mining results. Considering large flexible gaps improves the completeness of the derived functional signatures. The users can be directly guided to the patterns with as many blocks as that are conserved simultaneously. In this paper, we show by experiments that MAGIIC-PRO is efficient and effective in identifying ligand-binding sites and hot regions in protein–protein interactions directly from sequences. The web service is available at and a mirror site at .
More and more disordered regions have been discovered in protein sequences, and many of them are found to be functionally significant. Previous studies reveal that disordered regions of a protein can be predicted by its primary structure, the amino acid sequence. One observation that has been widely accepted is that ordered regions usually have compositional bias toward hydrophobic amino acids, and disordered regions are toward charged amino acids. Recent studies further show that employing evolutionary information such as position specific scoring matrices (PSSMs) improves the prediction accuracy of protein disorder. As more and more machine learning techniques have been introduced to protein disorder detection, extracting more useful features with biological insights attracts more attention.
This paper first studies the effect of a condensed position specific scoring matrix with respect to physicochemical properties (PSSMP) on the prediction accuracy, where the PSSMP is derived by merging several amino acid columns of a PSSM belonging to a certain property into a single column. Next, we decompose each conventional physicochemical property of amino acids into two disjoint groups which have a propensity for order and disorder respectively, and show by experiments that some of the new properties perform better than their parent properties in predicting protein disorder. In order to get an effective and compact feature set on this problem, we propose a hybrid feature selection method that inherits the efficiency of uni-variant analysis and the effectiveness of the stepwise feature selection that explores combinations of multiple features. The experimental results show that the selected feature set improves the performance of a classifier built with Radial Basis Function Networks (RBFN) in comparison with the feature set constructed with PSSMs or PSSMPs that adopt simply the conventional physicochemical properties.
Distinguishing disordered regions from ordered regions in protein sequences facilitates the exploration of protein structures and functions. Results based on independent testing data reveal that the proposed predicting model DisPSSMP performs the best among several of the existing packages doing similar tasks, without either under-predicting or over-predicting the disordered regions. Furthermore, the selected properties are demonstrated to be useful in finding discriminating patterns for order/disorder classification.