We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).
We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.
By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
Advances of high-throughput technologies have rapidly produced more and more data from DNAs and RNAs to proteins, especially large volumes of genome-scale data. However, connection of the genomic information to cellular functions and biological behaviours relies on the development of effective approaches at higher systems level. In particular, advances in RNA-Seq technology has helped the studies of transcriptome, RNA expressed from the genome, while systems biology on the other hand provides more comprehensive pictures, from which genes and proteins actively interact to lead to cellular behaviours and physiological phenotypes. As biological interactions mediate many biological processes that are essential for cellular function or disease development, it is important to systematically identify genomic information including genetic mutations from GWAS (genome-wide association study), differentially expressed genes, bidirectional promoters, intrinsic disordered proteins (IDP) and protein interactions to gain deep insights into the underlying mechanisms of gene regulations and networks. Furthermore, bidirectional promoters can co-regulate many biological pathways, where the roles of bidirectional promoters can be studied systematically for identifying co-regulating genes at interactive network level. Combining information from different but related studies can ultimately help revealing the landscape of molecular mechanisms underlying complex diseases such as cancer.
Kidney Renal Clear Cell Carcinoma (KIRC) is one of fatal genitourinary diseases and accounts for most malignant kidney tumours. KIRC has been shown resistance to radiotherapy and chemotherapy. Like many types of cancers, there is no curative treatment for metastatic KIRC. Using advanced sequencing technologies, The Cancer Genome Atlas (TCGA) project of NIH/NCI-NHGRI has produced large-scale sequencing data, which provide unprecedented opportunities to reveal new molecular mechanisms of cancer. We combined differentially expressed genes, pathways and network analyses to gain new insights into the underlying molecular mechanisms of the disease development.
Followed by the experimental design for obtaining significant genes and pathways, comprehensive analysis of 537 KIRC patients' sequencing data provided by TCGA was performed. Differentially expressed genes were obtained from the RNA-Seq data. Pathway and network analyses were performed. We identified 186 differentially expressed genes with significant p-value and large fold changes (P < 0.01, |log(FC)| > 5). The study not only confirmed a number of identified differentially expressed genes in literature reports, but also provided new findings. We performed hierarchical clustering analysis utilizing the whole genome-wide gene expressions and differentially expressed genes that were identified in this study. We revealed distinct groups of differentially expressed genes that can aid to the identification of subtypes of the cancer. The hierarchical clustering analysis based on gene expression profile and differentially expressed genes suggested four subtypes of the cancer. We found enriched distinct Gene Ontology (GO) terms associated with these groups of genes. Based on these findings, we built a support vector machine based supervised-learning classifier to predict unknown samples, and the classifier achieved high accuracy and robust classification results. In addition, we identified a number of pathways (P < 0.04) that were significantly influenced by the disease. We found that some of the identified pathways have been implicated in cancers from literatures, while others have not been reported in the cancer before. The network analysis leads to the identification of significantly disrupted pathways and associated genes involved in the disease development. Furthermore, this study can provide a viable alternative in identifying effective drug targets.
Our study identified a set of differentially expressed genes and pathways in kidney renal clear cell carcinoma, and represents a comprehensive computational approach to analysis large-scale next-generation sequencing data. The pathway and network analyses suggested that information from distinctly expressed genes can be utilized in the identification of aberrant upstream regulators. Identification of distinctly expressed genes and altered pathways are important in effective biomarker identification for early cancer diagnosis and treatment planning. Combining differentially expressed genes with pathway and network analyses using intelligent computational approaches provide an unprecedented opportunity to identify upstream disease causal genes and effective drug targets.
Kidney Renal Clear Cell Carcinoma; TCGA; RNA-Seq; Differentially Expressed Genes; Pathways; Gene Network Analysis; Machine Learning Classifier
Candidate disease gene prediction is a rapidly developing area of bioinformatics research with the potential to deliver great benefits to human health. As experimental studies detecting associations between genetic intervals and disease proliferate, better bioinformatic techniques that can expand and exploit the data are required.
Gentrepid is a web resource which predicts and prioritizes candidate disease genes for both Mendelian and complex diseases. The system can take input from linkage analysis of single genetic intervals or multiple marker loci from genome-wide association studies. The underlying database of the Gentrepid tool sources data from numerous gene and protein resources, taking advantage of the wealth of biological information available. Using known disease gene information from OMIM, the system predicts and prioritizes disease gene candidates that participate in the same protein pathways or share similar protein domains. Alternatively, using an ab initio approach, the system can detect enrichment of these protein annotations without prior knowledge of the phenotype.
The system aims to integrate the wealth of protein information currently available with known and novel phenotype/genotype information to acquire knowledge of biological mechanisms underpinning disease. We have updated the system to facilitate analysis of GWAS data and the study of complex diseases. Application of the system to GWAS data on hypertension using the ICBP data is provided as an example. An interesting prediction is a ZIP transporter additional to the one found by the ICBP analysis. The webserver URL is https://www.gentrepid.org/.
Candidate disease gene prediction; Candidate disease genes; Mendelian diseases; Complex diseases; Genome-wide association studies; Genotype; Phenotype; Candidate gene identification; Genetic-association studies; Hypertension
Genome-wide association studies can provide novel insights into diseases of interest, as well as to the responsiveness of an individual to specific treatments. In such studies, it is very important to correct for population stratification, which refers to allele frequency differences between cases and controls due to systematic ancestry differences. Population stratification can cause spurious associations if not adjusted properly. The principal component analysis (PCA) method has been relied upon as a highly useful methodology to adjust for population stratification in these types of large-scale studies. Recently, the linear mixed model (LMM) has also been proposed to account for family structure or cryptic relatedness. However, neither of these approaches may be optimal in properly correcting for sample structures in the presence of subject outliers.
We propose to use robust PCA combined with k-medoids clustering to deal with population stratification. This approach can adjust for population stratification for both continuous and discrete populations with subject outliers, and it can be considered as an extension of the PCA method and the multidimensional scaling (MDS) method. Through simulation studies, we compare the performance of our proposed methods with several widely used stratification methods, including PCA and MDS. We show that subject outliers can greatly influence the analysis results from several existing methods, while our proposed robust population stratification methods perform very well for both discrete and admixed populations with subject outliers. We illustrate the new method using data from a rheumatoid arthritis study.
We demonstrate that subject outliers can greatly influence the analysis result in GWA studies, and propose robust methods for dealing with population stratification that outperform existing population stratification methods in the presence of subject outliers.
Population structure; Population stratification; Robust principal component analysis; Resampling by half means; Outlier detection; GWA studies
Transcription factors (TFs) and microRNAs (miRNAs) are primary metazoan gene regulators. Regulatory mechanisms of the two main regulators are of great interest to biologists and may provide insights into the causes of diseases. However, the interplay between miRNAs and TFs in a regulatory network still remains unearthed. Currently, it is very difficult to study the regulatory mechanisms that involve both miRNAs and TFs in a biological lab. Even at data level, a network involving miRNAs, TFs and genes will be too complicated to achieve. Previous research has been mostly directed at inferring either miRNA or TF regulatory networks from data. However, networks involving a single type of regulator may not fully reveal the complex gene regulatory mechanisms, for instance, the way in which a TF indirectly regulates a gene via a miRNA.
We propose a framework to learn from heterogeneous data the three-component regulatory networks, with the presence of miRNAs, TFs, and mRNAs. This method firstly utilises Bayesian network structure learning to construct a regulatory network from multiple sources of data: gene expression profiles of miRNAs, TFs and mRNAs, target information based on sequence data, and sample categories. Then, in order to produce more meaningful results for further biological experimentation and research, the method searches the learnt network to identify the interplay between miRNAs and TFs and applies a network motif finding algorithm to further infer the network.
We apply the proposed framework to the data sets of epithelial-to-mesenchymal transition (EMT). The results elucidate the complex gene regulatory mechanism for EMT which involves both TFs and miRNAs. Several discovered interactions and molecular functions have been confirmed by literature. In addition, many other discovered interactions and bio-markers are of high statistical significance and thus can be good candidates for validation by experiments. Moreover, the results generated by our method are compact, involving a small number of interactions which have been proved highly relevant to EMT.
We have designed a framework to infer gene regulatory networks involving both TFs and miRNAs from multiple sources of data, including gene expression data, target information, and sample categories. Results on the EMT data sets have shown that the proposed approach is able to produce compact and meaningful gene regulatory networks that are highly relevant to the biological conditions of the data sets. This framework has the potential for application to other heterogeneous datasets to reveal the complex gene regulatory relationships.
Illumina sequencing platform is widely used in genome research. Sequence reads quality assessment and control are needed for downstream analysis. However, software that provides efficient quality assessment and versatile filtration methods is still lacking.
We have developed a toolkit named HTQC – abbreviation of High-Throughput Quality Control – for sequence reads quality control, which consists of six programs for reads quality assessment, reads filtration and generation of graphic reports.
The HTQC toolkit can generate reads quality assessment faster than existing tools, providing guidance for reads filtration utilities that allow users to choose different strategies to remove low quality reads.
Cancers, a group of multifactorial complex diseases, are generally caused by mutation of multiple genes or dysregulation of pathways. Identifying biomarkers that can characterize cancers would help to understand and diagnose cancers. Traditional computational methods that detect genes differentially expressed between cancer and normal samples fail to work due to small sample size and independent assumption among genes. On the other hand, genes work in concert to perform their functions. Therefore, it is expected that dysregulated pathways will serve as better biomarkers compared with single genes.
In this paper, we propose a novel approach to identify dysregulated pathways in cancer based on a pathway interaction network. Our contribution is three-fold. Firstly, we present a new method to construct pathway interaction network based on gene expression, protein-protein interactions and cellular pathways. Secondly, the identification of dysregulated pathways in cancer is treated as a feature selection problem, which is biologically reasonable and easy to interpret. Thirdly, the dysregulated pathways are identified as subnetworks from the pathway interaction networks, where the subnetworks characterize very well the functional dependency or crosstalk between pathways. The benchmarking results on several distinct cancer datasets demonstrate that our method can obtain more reliable and accurate results compared with existing state of the art methods. Further functional analysis and independent literature evidence also confirm that our identified potential pathogenic pathways are biologically reasonable, indicating the effectiveness of our method.
Dysregulated pathways can serve as better biomarkers compared with single genes. In this work, by utilizing pathway interaction networks and gene expression data, we propose a novel approach that effectively identifies dysregulated pathways, which can not only be used as biomarkers to diagnose cancers but also serve as potential drug targets in the future.
DNA barcoding technology, which uses a short piece of DNA sequence to identify species, has wide ranges of applications. Until today, a universal DNA barcode marker for plants remains elusive. The rbcL and matK regions have been proposed as the “core barcode” for plants and the ITS2 and psbA-trnH intergenic spacer (PTIGS) regions were later added as supplemental barcodes. The use of PTIGS region as a supplemental barcode has been limited by the lack of computational tools that can handle significant insertions and deletions in the PTIGS sequences. Here, we compared the most commonly used alignment-based and alignment-free methods and developed a web server to allow the biologists to carry out PTIGS-based DNA barcoding analyses.
First, we compared several alignment-based methods such as BLAST and those calculating P distance and Edit distance, alignment-free methods Di-Nucleotide Frequency Profile (DNFP) and their combinations. We found that the DNFP and Edit-distance methods increased the identification success rate to ~80%, 20% higher than the most commonly used BLAST method. Second, the combined methods showed overall better success rate and performance. Last, we have developed a web server that allows (1) retrieving various sub-regions and the consensus sequences of PTIGS, (2) annotating novel PTIGS sequences, (3) determining species identity by PTIGS sequences using eight methods, and (4) examining identification efficiency and performance of the eight methods for various taxonomy groups.
The Edit distance and the DNFP methods have the highest discrimination powers. Hybrid methods can be used to achieve significant improvement in performance. These methods can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2. To our knowledge, the web server developed here is the only one that allows species determination based on PTIGS sequences. The web server can be accessed at http://psba-trnh-plantidit.dnsalias.org.
A birth and death process is frequently used for modeling the size of a gene family that may vary along the branches of a phylogenetic tree. Under the birth and death model, maximum likelihood methods have been developed to estimate the birth and death rate and the sizes of ancient gene families (numbers of gene copies at the internodes of the phylogenetic tree). This paper aims to provide a Bayesian approach for estimating parameters in the birth and death model.
We develop a Bayesian approach for estimating the birth and death rate and other parameters in the birth and death model. In addition, a Bayesian hypothesis test is developed to identify the gene families that are unlikely under the birth and death process. Simulation results suggest that the Bayesian estimate is more accurate than the maximum likelihood estimate of the birth and death rate. The Bayesian approach was applied to a real dataset of 3517 gene families across genomes of five yeast species. The results indicate that the Bayesian model assuming a constant birth and death rate among branches of the phylogenetic tree cannot adequately explain the observed pattern of the sizes of gene families across species. The yeast dataset was thus analyzed with a Bayesian heterogeneous rate model that allows the birth and death rate to vary among the branches of the tree. The unlikely gene families identified by the Bayesian heterogeneous rate model are different from those given by the maximum likelihood method.
Compared to the maximum likelihood method, the Bayesian approach can produce more accurate estimates of the parameters in the birth and death model. In addition, the Bayesian hypothesis test is able to identify unlikely gene families based on Bayesian posterior p-values. As a powerful statistical technique, the Bayesian approach can effectively extract information from gene family data and thereby provide useful information regarding the evolutionary process of gene families across genomes.
It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task.
We consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem.
The proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.
HIV and HCV infections have become the leading global public-health threats. Even more remarkable, HIV-HCV co-infection is rapidly emerging as a major cause of morbidity and mortality throughout the world, due to the common rapid mutation characteristics of the two viruses as well as their similar complex influence to immunology system. Although considerable progresses have been made on the study of the infection of HIV and HCV respectively, few researches have been conducted on the investigation of the molecular mechanism of their co-infection and designing of the multi-target co-inhibitors for the two viruses simultaneously.
In our study, a multi-target Quantitative Structure-Activity Relationship (QSAR) study of the inhibitors for HIV-HCV co-infection were addressed with an in-silico machine learning technique, i.e. multi-task learning, to help to guide the co-inhibitor design. Firstly, an integrated dataset with 3 HIV inhibitor subsets targeted on protease, integrase and reverse transcriptase respectively, together with another 6 subsets of 2 HCV inhibitors targeted on NS3 serine protease and NS5B polymerase respectively were compiled. Secondly, an efficient multi-target QSAR modelling of HIV-HCV co-inhibitors was performed by applying an accelerated gradient method based multi-task learning on the whole 9 datasets. Furthermore, by solving the L-1-infinity regularized optimization, the Drug-like index features for compound description were ranked according to their joint importance in multi-target QSAR modelling of HIV and HCV. Finally, a drug structure-activity simulation for investigating the relationships between compound structures and binding affinities was presented based on our multiple target analysis, which is then providing several novel clues for the design of multi-target HIV-HCV co-inhibitors with increasing likelihood of successful therapies on HIV, HCV and HIV-HCV co-infection.
The framework presented in our study provided an efficient way to identify and design inhibitors that simultaneously and selectively bind to multiple targets from multiple viruses with high affinity, and will definitely shed new lights on the future work of inhibitor synthesis for multi-target HIV, HCV, and HIV-HCV co-infection treatments.
HDX mass spectrometry is a powerful platform to probe protein structure dynamics during ligand binding, protein folding, enzyme catalysis, and such. HDX mass spectrometry analysis derives the protein structure dynamics based on the mass increase of a protein of which the backbone protons exchanged with solvent deuterium. Coupled with enzyme digestion and MS/MS analysis, HDX mass spectrometry can be used to study the regional dynamics of protein based on the m/z value or percentage of deuterium incorporation for the digested peptides in the HDX experiments. Various software packages have been developed to analyze HDX mass spectrometry data. Despite the progresses, proper and explicit statistical treatment is still lacking in most of the current HDX mass spectrometry software. In order to address this issue, we have developed the HDXanalyzer for the statistical analysis of HDX mass spectrometry data using R, Python, and RPY2.
Implementation and results
HDXanalyzer package contains three major modules, the data processing module, the statistical analysis module, and the user interface. RPY2 is employed to enable the connection of these three components, where the data processing module is implemented using Python and the statistical analysis module is implemented with R. RPY2 creates a low-level interface for R and allows the effective integration of statistical module for data processing. The data processing module generates the centroid for the peptides in form of m/z value, and the differences of centroids between the peptides derived from apo and ligand-bound protein allow us to evaluate whether the regions have significant changes in structure dynamics or not. Another option of the software is to calculate the deuterium incorporation rate for the comparison. The two types of statistical analyses are Paired Student’s t-test and the linear combination of the intercept for multiple regression and ANCOVA model. The user interface is implemented with wxpython to facilitate the data visualization in graphs and the statistical analysis output presentation. In order to evaluate the software, a previously published xylanase HDX mass spectrometry analysis dataset is processed and presented. The results from the different statistical analysis methods are compared and shown to be similar. The statistical analysis results are overlaid with the three dimensional structure of the protein to highlight the regional structure dynamics changes in the xylanase enzyme.
Statistical analysis provides crucial evaluation of whether a protein region is significantly protected or unprotected during the HDX mass spectrometry studies. Although there are several other available software programs to process HDX experimental data, HDXanalyzer is the first software program to offer multiple statistical methods to evaluate the changes in protein structure dynamics based on HDX mass spectrometry analysis. Moreover, the statistical analysis can be carried out for both m/z value and deuterium incorporation rate. In addition, the software package can be used for the data generated from a wide range of mass spectrometry instruments.
Microarray has been widely used to measure the gene expression level on the genome scale in the current decade. Many algorithms have been developed to reconstruct gene regulatory networks based on microarray data. Unfortunately, most of these models and algorithms focus on global properties of the expression of genes in regulatory networks. And few of them are able to offer intuitive parameters. We wonder whether some simple but basic characteristics of microarray datasets can be found to identify the potential gene regulatory relationship.
Based on expression correlation, expression level variation and vectors derived from microarray expression levels, we first introduced several novel parameters to measure the characters of regulating gene pairs. Subsequently, we used the naïve Bayesian network to integrate these features as well as the functional co-annotation between transcription factors and their target genes. Then, based on the character of time-delay from the expression profile, we were able to predict the existence and direction of the regulatory relationship respectively.
Several novel parameters have been proposed and integrated to identify the regulatory relationship. This new model is proved to be of higher efficacy than that of individual features. It is believed that our parametric approach can serve as a fast approach for regulatory relationship mining.
Inference of causal regulators responsible for gene expression changes under different conditions is of great importance but remains rather challenging. To date, most approaches use direct binding targets of transcription factors (TFs) to associate TFs with expression profiles. However, the low overlap between binding targets of a TF and the affected genes of the TF knockout limits the power of those methods.
We developed a TF-centered downstream gene set enrichment analysis approach to identify potential causal regulators responsible for expression changes. We constructed hierarchical and multi-layer regulation models to derive possible downstream gene sets of a TF using not only TF-DNA interactions, but also, for the first time, post-translational modifications (PTM) information. We verified our method in one expression dataset of large-scale TF knockout and another dataset involving both TF knockout and TF overexpression. Compared with the flat model using TF-DNA interactions alone, our method correctly identified five more actual perturbed TFs in large-scale TF knockout data and six more perturbed TFs in overexpression data. Potential regulatory pathways downstream of three perturbed regulators— SNF1, AFT1 and SUT1 —were given to demonstrate the power of multilayer regulation models integrating TF-DNA interactions and PTM information. Additionally, our method successfully identified known important TFs and inferred some novel potential TFs involved in the transition from fermentative to glycerol-based respiratory growth and in the pheromone response. Downstream regulation pathways of SUT1 and AFT1 were also supported by the mRNA and/or phosphorylation changes of their mediating TFs and/or “modulator” proteins.
The results suggest that in addition to direct transcription, indirect transcription and post-translational regulation are also responsible for the effects of TFs perturbation, especially for TFs overexpression. Many TFs inferred by our method are supported by literature. Multiple TF regulation models could lead to new hypotheses for future experiments. Our method provides a valuable framework for analyzing gene expression data to identify causal regulators in the context of TF-DNA interactions and PTM information.
Enzyme dynamics has recently been shown to be crucial for structure-function relationship. Among various structure dynamics analysis platforms, HDX (hydrogen deuterium exchange) mass spectrometry stands out as an efficient and high-throughput way to analyze protein dynamics upon ligand binding. Despite the potential, limited research has employed the HDX mass spec platform to probe regional structure dynamics of enzymes. In particular, the technique has never been used for analyzing cell wall degrading enzymes. We hereby used xylanase as a model to explore the potential of HDX mass spectrometry for studying cell wall degrading enzymes.
HDX mass spectrometry revealed significant intrinsic dynamics for the xylanase enzyme. Different regions of the enzymes are differentially stabilized in the apo enzyme. The comparison of substrate-binding enzymes revealed that xylohexaose can significantly stabilize the enzyme. Several regions including those near the reaction centres were significantly stabilized during the xylohexaose binding. As compared to xylohexaose, xylan induced relatively less protection in the enzyme, which may be due to the insolubility of the substrate. The structure relevance of the enzyme dynamics was discussed with reference to the three dimensional structure of the enzyme. HDX mass spectrometry revealed strong dynamics-function relevance and such relevance can be explored for the future enzyme improvement.
Ligand-binding can lead to the significant stabilization at both regional and global level for enzymes like xylanase. HDX mass spectrometry is a powerful high-throughput platform to identify the key regions protected during the ligand binding and to explore the molecular mechanisms of the enzyme function. The HDX mass spectrometry analysis of cell wall degrading enzymes has provided a novel platform to guide the rational design of enzymes.
microRNAs (miRNAs) regulate target gene expression by controlling their mRNAs post-transcriptionally. Increasing evidence demonstrates that miRNAs play important roles in various biological processes. However, the functions and precise regulatory mechanisms of most miRNAs remain elusive. Current research suggests that miRNA regulatory modules are complicated, including up-, down-, and mix-regulation for different physiological conditions. Previous computational approaches for discovering miRNA-mRNA interactions focus only on down-regulatory modules. In this work, we present a method to capture complex miRNA-mRNA interactions including all regulatory types between miRNAs and mRNAs.
We present a method to capture complex miRNA-mRNA interactions using Bayesian network structure learning with splitting-averaging strategy. It is designed to explore all possible miRNA-mRNA interactions by integrating miRNA-targeting information, expression profiles of miRNAs and mRNAs, and sample categories. We also present an analysis of data sets for epithelial and mesenchymal transition (EMT). Our results show that the proposed method identified all possible types of miRNA-mRNA interactions from the data. Many interactions are of tremendous biological significance. Some discoveries have been validated by previous research, for example, the miR-200 family negatively regulates ZEB1 and ZEB2 for EMT. Some are consistent with the literature, such as LOX has wide interactions with the miR-200 family members for EMT. Furthermore, many novel interactions are statistically significant and worthy of validation in the near future.
This paper presents a new method to explore the complex miRNA-mRNA interactions for different physiological conditions using Bayesian network structure learning with splitting-averaging strategy. The method makes use of heterogeneous data including miRNA-targeting information, expression profiles of miRNAs and mRNAs, and sample categories. Results on EMT data sets show that the proposed method uncovers many known miRNA targets as well as new potentially promising miRNA-mRNA interactions. These interactions could not be achieved by the normal Bayesian network structure learning.
Understanding the constituent domains of oncogenes, their origins and their fusions may shed new light about the initiation and the development of cancers.
We have developed a computational pipeline for identification of functional domains of human genes, prediction of the origins of these domains and their major fusion events during evolution through integration of existing and new tools of our own. An application of the pipeline to 124 well-characterized human oncogenes has led to the identification of a collection of domains and domain pairs that occur substantially more frequently in oncogenes than in human genes on average. Most of these enriched domains and domain pairs are related to tyrosine kinase activities. In addition, our analyses indicate that a substantial portion of the domain-fusion events of oncogenes took place in metazoans during evolution.
We expect that the computational pipeline for domain identification, domain origin and domain fusion prediction will prove to be useful for studying other groups of genes.
Schizophrenia is a complex disease with multiple factors contributing to its pathogenesis. In addition to environmental factors, genetic factors may also increase susceptibility. In other words, schizophrenia is a highly heritable disease. Some candidate genes have been deduced on the basis of their known function with others found on the basis of chromosomal location. Individuals with multiple candidate genes may have increased risk. However it is not clear what kind of gene combinations may produce the disease phenotype. Their collective effect remains to be studied.
Most pathways except metabolic pathways are rich in protein-protein interactions (PPIs). Thus, the PPI network contains pathway information, even though the upstream-downstream relation of PPI is yet to be explored. Here we have constructed a PPI sub-network by extracting the nearest neighbour of the 36 reported candidate genes described in the literature. Although these candidate genes were discovered by different approaches, most of the proteins formed a cluster. Two major protein interaction modules were identified on the basis of the pairwise distance among the proteins in this sub-network. The large and small clusters might play roles in synaptic transmission and signal transduction, respectively, based on gene ontology annotation. The protein interactions in the synaptic transmission cluster were used to explain the interaction between the NRG1 and CACNG2 genes, which was found by both linkage and association studies. This working hypothesis is supported by the co-expression analysis based on public microarray gene expression.
On the basis of the protein interaction network, it appears that the NRG1-triggered NMDAR protein internalization and the CACNG2 mediated AMPA receptor recruiting may act together in the glutamatergic signalling process. Since both the NMDA and AMPA receptors are calcium channels, this process may regulate the influx of Ca2+. Reducing the cation influx might be one of the disease mechanisms for schizophrenia. This PPI network analysis approach combined with the support from co-expression analysis may provide an efficient way to propose pathogenetic mechanisms for various highly heritable diseases.
Piwi-interacting RNAs (piRNAs) are a class of small non-coding RNA primarily expressed in germ cells that can silence transposons at the post-transcriptional level. Accurate prediction of piRNAs remains a significant challenge.
We developed a program for piRNA annotation (Piano) using piRNA-transposon interaction information. We downloaded 13,848 Drosophila piRNAs and 261,500 Drosophila transposons. The piRNAs were aligned to transposons with a maximum of three mismatches. Then, piRNA-transposon interactions were predicted by RNAplex. Triplet elements combining structure and sequence information were extracted from piRNA-transposon matching/pairing duplexes. A support vector machine (SVM) was used on these triplet elements to classify real and pseudo piRNAs, achieving 95.3 ± 0.33% accuracy and 96.0 ± 0.5% sensitivity. The SVM classifier can be used to correctly predict human, mouse and rat piRNAs, with overall accuracy of 90.6%. We used Piano to predict piRNAs for the rice stem borer, Chilo suppressalis, an important rice insect pest that causes huge yield loss. As a result, 82,639 piRNAs were predicted in C. suppressalis.
Piano demonstrates excellent piRNA prediction performance by using both structure and sequence features of transposon-piRNAs interactions. Piano is freely available to the academic community at http://ento.njau.edu.cn/Piano.html.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0419-6) contains supplementary material, which is available to authorized users.
piRNAs; piRNA prediction; Support vector machine (SVM); Chilo suppressalis; Drosophila melanogaster; Homo sapiens; Mus musculus; Rattus norvegicus
MicroRNAs (miRNAs) are a family of non-coding RNAs approximately 21 nucleotides in length that play pivotal roles at the post-transcriptional level in animals, plants and viruses. These molecules silence their target genes by degrading transcription or suppressing translation. Studies have shown that miRNAs are involved in biological responses to a variety of biotic and abiotic stresses. Identification of these molecules and their targets can aid the understanding of regulatory processes. Recently, prediction methods based on machine learning have been widely used for miRNA prediction. However, most of these methods were designed for mammalian miRNA prediction, and few are available for predicting miRNAs in the pre-miRNAs of specific plant species. Although the complete Solanum lycopersicum genome has been published, only 77 Solanum lycopersicum miRNAs have been identified, far less than the estimated number. Therefore, it is essential to develop a prediction method based on machine learning to identify new plant miRNAs.
A novel classification model based on a support vector machine (SVM) was trained to identify real and pseudo plant pre-miRNAs together with their miRNAs. An initial set of 152 novel features related to sequential structures was used to train the model. By applying feature selection, we obtained the best subset of 47 features for use with the Back Support Vector Machine-Recursive Feature Elimination (B-SVM-RFE) method for the classification of plant pre-miRNAs. Using this method, 63 features were obtained for plant miRNA classification. We then developed an integrated classification model, miPlantPreMat, which comprises MiPlantPre and MiPlantMat, to identify plant pre-miRNAs and their miRNAs. This model achieved approximately 90% accuracy using plant datasets from nine plant species, including Arabidopsis thaliana, Glycine max, Oryza sativa, Physcomitrella patens, Medicago truncatula, Sorghum bicolor, Arabidopsis lyrata, Zea mays and Solanum lycopersicum. Using miPlantPreMat, 522 Solanum lycopersicum miRNAs were identified in the Solanum lycopersicum genome sequence.
We developed an integrated classification model, miPlantPreMat, based on structure-sequence features and SVM. MiPlantPreMat was used to identify both plant pre-miRNAs and the corresponding mature miRNAs. An improved feature selection method was proposed, resulting in high classification accuracy, sensitivity and specificity.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0423-x) contains supplementary material, which is available to authorized users.
MiRNA; Pre-miRNA; Prediction; SVM; Feature selection
Alternative Splicing (AS) as a post-transcription regulation mechanism is an important application of RNA-seq studies in eukaryotes. A number of software and computational methods have been developed for detecting AS. Most of the methods, however, are designed and tested on animal data, such as human and mouse. Plants genes differ from those of animals in many ways, e.g., the average intron size and preferred AS types. These differences may require different computational approaches and raise questions about their effectiveness on plant data. The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals.
This study compares the eight popular public available software packages for differential splicing analysis using both simulated and real Arabidopsis thaliana RNA-seq data. All software are freely available. The study examines the effect of varying AS ratio, read depth, dispersion pattern, AS types, sample sizes and the influence of annotation. Using a real data, the study looks at the consistences between the packages and verifies a subset of the detected AS events using PCR studies.
No single method performs the best in all situations. The accuracy of annotation has a major impact on which method should be chosen for AS analysis. DEXSeq performs well in the simulated data when the AS signal is relative strong and annotation is accurate. Cufflinks achieve a better tradeoff between precision and recall and turns out to be the best one when incomplete annotation is provided. Some methods perform inconsistently for different AS types. Complex AS events that combine several simple AS events impose problems for most methods, especially for MATS. MATS stands out in the analysis of real RNA-seq data when all the AS events being evaluated are simple AS events.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0364-4) contains supplementary material, which is available to authorized users.
RNAseq; Alternative splicing; Plants
Combining information from different studies is an important and useful practice in bioinformatics, including genome-wide association study, rare variant data analysis and other set-based analyses. Many statistical methods have been proposed to combine p-values from independent studies. However, it is known that there is no uniformly most powerful test under all conditions; therefore, finding a powerful test in specific situation is important and desirable.
In this paper, we propose a new statistical approach to combining p-values based on gamma distribution, which uses the inverse of the p-value as the shape parameter in the gamma distribution.
Simulation study and real data application demonstrate that the proposed method has good performance under some situations.
Fisher test; Lancaster method; rare variant association test; z-test
The earliest whole protein order/disorder predictor (Uversky et al., Proteins, 41: 415-427 (2000)), herein called the charge-hydropathy (C-H) plot, was originally developed using the Kyte-Doolittle (1982) hydropathy scale (Kyte & Doolittle., J. Mol. Biol, 157: 105-132(1982)). Here the goal is to determine whether the performance of the C-H plot in separating structured and disordered proteins can be improved by using an alternative hydropathy scale.
Using the performance of the CH-plot as the metric, we compared 19 alternative hydropathy scales, with the finding that the Guy (1985) hydropathy scale (Guy, Biophys. J, 47:61-70(1985)) was the best of the tested hydropathy scales for separating large collections structured proteins and intrinsically disordered proteins (IDPs) on the C-H plot. Next, we developed a new scale, named IDP-Hydropathy, which further improves the discrimination between structured proteins and IDPs. Applying the C-H plot to a dataset containing 109 IDPs and 563 non-homologous fully structured proteins, the Kyte-Doolittle (1982) hydropathy scale, the Guy (1985) hydropathy scale, and the IDP-Hydropathy scale gave balanced two-state classification accuracies of 79%, 84%, and 90%, respectively, indicating a very substantial overall improvement is obtained by using different hydropathy scales. A correlation study shows that IDP-Hydropathy is strongly correlated with other hydropathy scales, thus suggesting that IDP-Hydropathy probably has only minor contributions from amino acid properties other than hydropathy.
We suggest that IDP-Hydropathy would likely be the best scale to use for any type of algorithm developed to predict protein disorder.
Intrinsically disordered proteins; natively unstructured or unfolded proteins; structure and disorder prediction; support vector machines
Diabetes mellitus of type 2 (T2D), also known as noninsulin-dependent diabetes mellitus (NIDDM) or adult-onset diabetes, is a common disease. It is estimated that more than 300 million people worldwide suffer from T2D. In this study, we investigated the T2D, pre-diabetic and healthy human (no diabetes) bloodstream samples using genomic, genealogical, and phonemic information. We identified differentially expressed genes and pathways. The study has provided deeper insights into the development of T2D, and provided useful information for further effective prevention and treatment of the disease.
A total of 142 bloodstream samples were collected, including 47 healthy humans, 22 pre-diabetic and 73 T2D patients. Whole genome scale gene expression profiles were obtained using the Agilent Oligo chips that contain over 20,000 human genes. We identified 79 significantly differentially expressed genes that have fold change ≥ 2. We mapped those genes and pinpointed locations of those genes on human chromosomes. Amongst them, 3 genes were not mapped well on the human genome, but the rest of 76 differentially expressed genes were well mapped on the human genome. We found that most abundant differentially expressed genes are on chromosome one, which contains 9 of those genes, followed by chromosome two that contains 7 of the 76 differentially expressed genes. We performed gene ontology (GO) functional analysis of those 79 differentially expressed genes and found that genes involve in the regulation of cell proliferation were among most common pathways related to T2D. The expression of the 79 genes was combined with clinical information that includes age, sex, and race to construct an optimal discriminant model. The overall performance of the model reached 95.1% accuracy, with 91.5% accuracy on identifying healthy humans, 100% accuracy on pre-diabetic patients and 95.9% accuract on T2D patients. The higher performance on identifying pre-diabetic patients was resulted from more significant changes of gene expressions among this particular group of humans, which implicated that patients were having profound genetic changes towards disease development.
Differentially expressed genes were distributed across chromosomes, and are more abundant on chromosomes 1 and 2 than the rest of the human genome. We found that regulation of cell proliferation actually plays an important role in the T2D disease development. The predictive model developed in this study has utilized the 79 significant genes in combination with age, sex, and racial information to distinguish pre-diabetic, T2D, and healthy humans. The study not only has provided deeper understanding of the disease molecular mechanisms but also useful information for pathway analysis and effective drug target identification.
T2D; pre-diabetic; differential gene expression; Gene Ontology (GO) analysis; Discriminant model