Finding out the associations between an input gene set, such as genes associated with a certain phenotype, and annotated gene sets, such as known pathways, are a very important problem in modern molecular biology. The existing approaches mainly focus on the overlap between the two, and may miss important but subtle relationships between genes. In this paper, we propose a method, NetPEA, by combining the known pathways and high-throughput networks. Our method not only considers the shared genes, but also takes the gene interactions into account. It utilizes a protein-protein interaction network and a random walk procedure to identify hidden relationships between gene sets, and uses a randomization strategy to evaluate the significance for pathways to achieve such similarity scores. Compared with the over-representation based method, our method can identify more relationships. Compared with a state of the art network-based method, EnrichNet, our method not only provides a ranked list of pathways, but also provides the statistical significant information. Importantly, through independent tests, we show that our method likely has a higher sensitivity in revealing the true casual pathways, while at the same time achieve a higher specificity. Literature review of selected results indicates that some of the novel pathways reported by our method are biologically relevant and important.
pathway; protein-protein interaction network; enrichment analysis; gene sets; random walk
In silico prediction of drug side-effects in early stage of drug development is becoming more popular now days, which not only reduces the time for drug design but also reduces the drug development costs. In this article we propose an ensemble approach to predict drug side-effects of drug molecules based on their chemical structure. Our idea originates from the observation that similar drugs have similar side-effects. Based on this observation we design an ensemble approach that combine the results from different classification models where each model is generated by a different set of similar drugs. We applied our approach to 1385 side-effects in the SIDER database for 888 drugs. Results show that our approach outperformed previously published approaches and standard classifiers. Furthermore, we applied our method to a number of uncharacterized drug molecules in DrugBank database and predict their side-effect profiles for future usage. Results from various sources confirm that our method is able to predict the side-effects for uncharacterized drugs and more importantly able to predict rare side-effects which are often ignored by other approaches. The method described in this article can be useful to predict side-effects in drug design in an early stage to reduce experimental cost and time.
adverse side-effect; drug development; chemical substructure; uncharacterized drug
The traditional de novo drug discovery is known as a high cost and high risk process. In response, recently there is an increasing interest in discovering new indications for known drugs—a process known as drug repositioning—using computational methods. In this study, we present a new systematic approach for identifying potential new indications of an existing drug through its relation to similar drugs. Different from the previous similarity-based methods, we adapted a novel bipartite-graph based method when considering common drug targets and their interaction information. Furthermore, we added drug structure information into the calculation of drug pairwise similarity. In cross-validation experiments, our method achieved a sensitivity of 0.77 and specificity of 0.92 (AUC = 0.888) and compared favorably to the state of the art. When compared with a control group of drug uses, our drug repositioning results were found to be significantly enriched in both the biomedical literature and clinical trials. Our results indicate that combining chemical structure and drug target information results in better prediction performance and that the proposed approach successfully captures the implicit information between drug targets.
drug repositioning; bipartite graph; target similarity; chemical similarity; target interaction
Identifying drug-drug interactions is an important and challenging problem in computational biology and healthcare research. There are accurate, structured but limited domain knowledge and noisy, unstructured but abundant textual information available for building predictive models. The difficulty lies in mining the true patterns embedded in text data and developing efficient and effective ways to combine heterogenous types of information. We demonstrate a novel approach of leveraging augmented text-mining features to build a logistic regression model with improved prediction performance (in terms of discrimination and calibration). Our model based on synthesized features significantly outperforms the model trained with only structured features (AUC: 96% vs. 91%, Sensitivity: 90% vs. 82% and Specificity: 88% vs. 81%). Along with the quantitative results, we also show learned “latent topics”, an intermediary result of our text mining module, and discuss their implications.
In this paper, we present a novel framework for microscopic image analysis of nuclei, data management, and high performance computation to support translational research involving nuclear morphometry features, molecular data, and clinical outcomes. Our image analysis pipeline consists of nuclei segmentation and feature computation facilitated by high performance computing with coordinated execution in multi-core CPUs and Graphical Processor Units (GPUs). All data derived from image analysis are managed in a spatial relational database supporting highly efficient scientific queries. We applied our image analysis workflow to 159 glioblastomas (GBM) from The Cancer Genome Atlas dataset. With integrative studies, we found statistics of four specific nuclear features were significantly associated with patient survival. Additionally, we correlated nuclear features with molecular data and found interesting results that support pathologic domain knowledge. We found that Proneural subtype GBMs had the smallest mean of nuclear Eccentricity and the largest mean of nuclear Extent, and MinorAxisLength. We also found gene expressions of stem cell marker MYC and cell proliferation maker MKI67 were correlated with nuclear features. To complement and inform pathologists of relevant diagnostic features, we queried the most representative nuclear instances from each patient population based on genetic and transcriptional classes. Our results demonstrate that specific nuclear features carry prognostic significance and associations with transcriptional and genetic classes, highlighting the potential of high throughput pathology image analysis as a complementary approach to human-based review and translational research.
Glioblastoma; large-scale image analysis; survival analysis; phenotype-genotype integration; translational research
The rapid determination of nucleic acid sequences is increasing the number of sequences that are available. Inherent in a template or seed alignment is the culmination of structural and functional constraints that are selecting those mutations that are viable during the evolution of the RNA. While we might not understand these structural and functional, template-based alignment programs utilize the patterns of sequence conservation to encapsulate the characteristics of viable RNA sequences that are aligned properly. We have developed a program that utilizes the different dimensions of information in rCAD, a large RNA informatics resource, to establish a profile for each position in an alignment. The most significant include sequence identity and column composition in different phylogenetic taxa. We have compared our methods with a maximum of eight alternative alignment methods on different sets of 16S and 23S rRNA sequences with sequence percent identities ranging from 50% to 100%. The results showed that CRWAlign outperformed the other alignment methods in both speed and accuracy. A web-based alignment server is available at http://www.rna.ccbb.utexas.edu/SAE/2F/CRWAlign.
RNA sequence alignment; template-based alignment; comparative analysis; phylogenetic-based alignment
We present a fast pairwise RNA sequence alignment method using structural information, named R-PASS (RNA Pairwise Alignment of Structure and Sequence), which shows good accuracy on sequences with low sequence identity and significantly faster than alternative methods. The method begins by representing RNA secondary structure as a set of structure motifs. The motifs from two RNAs are then used as input into a bipartite graph-matching algorithm, which determines the structure matches. The matches are then used as constraints in a constrained dynamic programming sequence alignment procedure. The R-PASS method has an O(nm) complexity. We compare our method with two other structure-based alignment methods, LARA and ExpaLoc, and with a sequence-based alignment method, MAFFT, across three benchmarks and obtain favorable results in accuracy and orders of magnitude faster in speed.
RNA pairwise structural alignment; structure motif; bipartite graph matching; constraint sequence alignment
Nanoparticle formulations that are being developed and tested for various medical applications are typically multi-component systems that vary in their structure, chemical composition, and function. It is difficult to compare and understand the differences between the structural and chemical descriptions of hundreds and thousands of nanoparticle formulations found in text documents. We have developed a string nomenclature to create computable string expressions that identify and enumerate the different high-level types of material parts of a nanoparticle formulation and represent the spatial order of their connectivity to each other. The string expressions are intended to be used as IDs, along with terms that describe a nanoparticle formulation and its material parts, in data sharing documents and nanomaterial research databases. The strings can be parsed and represented as a directed acyclic graph. The nodes of the graph can be used to display the string ID, name and other text descriptions of the nanoparticle formulation or its material part, while the edges represent the connectivity between the material parts with respect to the whole nanoparticle formulation. The different patterns in the string expressions can be searched for and used to compare the structure and chemical components of different nanoparticle formulations. The proposed string nomenclature is extensible and can be applied along with ontology terms to annotate the complete description of nanoparticles formulations.
string nomenclature; nanoparticle; ontology; informatics
Breast screening is the regular examination of a woman’s breasts to find breast cancer earlier. The sole exam approved for this purpose is mammography. Usually, findings are annotated through the Breast Imaging Reporting and Data System (BIRADS) created by the American College of Radiology. The BIRADS system determines a standard lexicon to be used by radiologists when studying each finding. Although the lexicon is standard, the annotation accuracy of the findings depends on the experience of the radiologist. Moreover, the accuracy of the classification of a mammography is also highly dependent on the expertise of the radiologist. A correct classification is paramount due to economical and humanitarian reasons.
The main goal of this work is to produce machine learning models that predict the outcome of a mammography from a reduced set of annotated mammography findings. In the study we used a data set consisting of 348 consecutive breast masses that underwent image guided or surgical biopsy performed between October 2005 and December 2007 on 328 female subjects. The main conclusions are threefold: (1) automatic classification of a mammography, independent on information about mass density, can reach equal or better results than the classification performed by a physician; (2) mass density seems to be a good indicator of malignancy, as previous studies suggested; (3) a machine learning model can predict mass density with a quality as good as the specialist blind to biopsy, which is one of our main contributions. Our model can predict malignancy in the absence of the mass density attribute, since we can fill up this attribute using our mass density predictor.
machine learning; mammography; BIRADS
We have previously identified several biomarkers of hepatocellular carcinoma (HCC). The levels of three of these biomarkers were analyzed individually and in combination with the currently used marker, alpha fetoprotein (AFP), for the ability to distinguish between a diagnosis of cirrhosis (n=113) and HCC (n=164). We have utilized several novel biostatistical tools, along with the inclusion of clinical factors such as age and gender, to determine if improved algorithms could be used to increase the probability of cancer detection. Using several of these methods, we are able to detect HCC in the background of cirrhosis with an AUC of at least 0.95. The use of clinical factors in combination with biomarker values to detect HCC is discussed.
Hepatocellular Carcinoma; biomarkers; Hepatitis B virus; logistic regression; penalized logistic regression; classification and regression trees
Tandem mass spectrometry experiments generate from thousands to millions of spectra. These spectra can be used to identify the presence of proteins in biological samples. In this work, we propose a new method to identify peptides, substrings of proteins, based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method uses all available information to score all the spectra in a cluster against candidate peptides using Bayesian model selection. We illustrate the performance of our method by applying it to seven-standard-protein mixture data.
Bayesian analysis; Bioinformatics; Clustered tandem mass spectra; False discovery rate; Peptide identification; Proteomics
In this work we build the first BI-RADS parser for Portuguese free texts, modeled after existing approaches to extract BI-RADS features from English medical records. Our concept finder uses a semantic grammar based on the BIRADS lexicon and on iterative transferred expert knowledge. We compare the performance of our algorithm to manual annotation by a specialist in mammography. Our results show that our parser’s performance is comparable to the manual method.
feature extraction; breast cancer; BI-RADS descriptors
High-throughput spectrometers are capable of producing data sets containing thousands of spectra for a single biological sample. These data sets contain a substantial amount of redundancy from peptides that may get selected multiple times in a LC-MS/MS experiment. In this paper, we present an efficient algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data which increases both the sensitivity and confidence of spectral assignment. CAMS utilizes a novel metric, called F-set, that allows accurate identification of the spectra that are similar. A graph theoretic framework is defined that allows the use of F-set metric efficiently for accurate cluster identifications. The accuracy of the algorithm is tested on real HCD and CID data sets with varying amounts of peptides. Our experiments show that the proposed algorithm is able to cluster spectra with very high accuracy in a reasonable amount of time for large spectral data sets. Thus, the algorithm is able to decrease the computational time by compressing the data sets while increasing the throughput of the data by interpreting low S/N spectra.
Clustering; Mass spectrometry; Graph Theory; Efficient Algorithms
Phosphorylation site assignment of large-scale data from high throughput tandem mass spectrometry (LC-MS/MS) data is an important aspect of phosphoproteomics. Correct assignment of phosphorylated residue(s) is important for functional interpretation of the data within a biological context. Common search algorithms (Sequest etc.) for mass spectrometry data are not designed for accurate site assignment; thus, additional algorithms are needed. In this paper, we propose a linear-time and linear-space dynamic programming strategy for phosphorylation site assignment. The algorithm, referred to as PhosSA, optimizes the objective function defined as the summation of peak intensities that are associated with theoretical phosphopeptide fragmentation ions. Quality control is achieved through the use of a post-processing criteria whose value is indicative of the signal-to-noise (S/N) properties and redundancy of the fragmentation spectra. The algorithm is tested using experimentally generated data sets of peptides with known phosphorylation sites while varying the fragmentation strategy (CID or HCD) and molar amounts of the peptides. The algorithm is also compatible with various peptide labeling strategies including SILAC and iTRAQ. PhosSA is shown to achieve > 99% accuracy with a high degree of sensitivity. The algorithm is extremely fast and scalable (able to process up to 0.5 million peptides in an hour). The implemented algorithm is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic purposes.
The values of data elements stored in biomedical databases often draw from biomedical ontologies. Authorization rules can be defined on these ontologies to control access to sensitive and private data elements in such databases. Authorization rules may be specified by different authorities at different times for various purposes. Since such policy rules can conflict with each other, access to sensitive information may inadvertently be allowed. Another problem in biomedical data protection is inference attacks, in which a user who has legitimate access to some data elements is able to infer information related to other data elements. We propose and evaluate two strategies; one for detecting policy inconsistencies to avoid potential inference attacks and the other for detecting policy conflicts.
Authorization policy; Biomedical ontology; Inference attacks; Policy conflicts
A new and emerging paradigm in molecular biology is revealing that RNA is implicated in nearly every aspect of the metabolism in the cell. To enhance our understanding of the function of these RNA molecules in the cell, it is essential that we have a complete understanding of their higher-order structures. While many computational tools have been developed to predict and analyse these higher-order RNA structures, few are able to visualize them for analytical purposes. In this paper, we present an interactive visualization tool of the secondary structure of RNA, named RNA2DMap. This program enables multiple-dimensions of information about RNA structure to be selected, customized and displayed to visually identify patterns and relationships. RNA2DMap facilitates the comparative analysis and understanding of RNAs that cannot be readily obtained with other graphical or text output from computer programs. Three use cases are presented to illustrate how RNA2DMap aids structural analysis.
Biological Data Visulation; RNA Struaral Analysis; Interative Application
Strains of the Mycobacterium tuberculosis complex (MTBC) can be classified into coherent lineages of similar traits based on their genotype. We present a tensor clustering framework to group MTBC strains into sublineages of the known major lineages based on two biomarkers: spacer oligonucleotide type (spoligotype) and mycobacterial interspersed repetitive units (MIRU). We represent genotype information of MTBC strains in a high-dimensional array in order to include information about spoligotype, MIRU, and their coexistence using multiple-biomarker tensors. We use multiway models to transform this multidimensional data about the MTBC strains into two-dimensional arrays and use the resulting score vectors in a stable partitive clustering algorithm to classify MTBC strains into sublineages. We validate clusterings using cluster stability and accuracy measures, and find stabilities of each cluster. Based on validated clustering results, we present a sublineage structure of MTBC strains and compare it to the sublineage structures of SpolDB4 and MIRU-VNTRplus.
Tuberculosis; Mycobacterium tuberculosis complex; multiway models; clustering; cluster validation
Biomarkers of Mycobacterium tuberculosis complex (MTBC) mutate over time. Among the biomarkers of MTBC, spacer oligonucleotide type (spoligotype) and Mycobacterium Interspersed Repetitive Unit (MIRU) patterns are commonly used to genotype clinical MTBC strains. In this study, we present an evolution model of spoligotype rearrangements using MIRU patterns to disambiguate the ancestors of spoligotypes, in a large patient dataset from the United States Centers for Disease Control and Prevention (CDC). Based on the contiguous deletion assumption and rare observation of convergent evolution, we first generate the most parsimonious forest of spoligotypes, called a spoligoforest, using three genetic distance measures. An analysis of topological attributes of the spoligoforest and number of variations at the direct repeat (DR) locus of each strain reveals interesting properties of deletions in the DR region. First, we compare our mutation model to existing mutation models of spoligotypes and find that our mutation model produces as many within-lineage mutation events as other models, with slightly higher segregation accuracy. Second, based on our mutation model, the number of descendant spoligotypes follows a power law distribution. Third, contrary to prior studies, the power law distribution does not plausibly fit to the mutation length frequency. Finally, the total number of mutation events at consecutive DR loci follows a bimodal distribution, which results in accumulation of shorter deletions in the DR region. The two modes are spacers 13 and 40, which are hotspots for chromosomal rearrangements. The change point in the bimodal distribution is spacer 34, which is absent in most MTBC strains. This bimodal separation results in accumulation of shorter deletions, which explains why a power law distribution is not a plausible fit to the mutation length frequency.
tuberculosis; Mycobacterium tuberculosis complex; DR locus; spoligotype; MIRU-VNTR; mutation
GWAS studies have been successful in finding genetic determinants of obesity. To translate discovered genetic variants into new therapies or prevention strategies, molecular or physiological mechanisms need to be discovered. One strategy is to perform data mining of data sets with detailed phenotypic data, such as those present in dbGAP (database of Genotypes and Phenotypes) for hypothesis generation. We propose a novel technique that combines the power and computational efficiency of existing Bayesian Network (BN) learning algorithms with the statistical rigor of Structural Equation Modeling (SEM) to produce an overall system that searches the space of potential networks and evaluates promising candidates using standard SEM model selection criteria. We demonstrate our method using the analysis of a candidate SNP data set from the AMERICO sample, a multi-ethnic cross-sectional cohort of roughly three hundred children with detailed obesity-related phenotypes. We demonstrate our approach by showing genetic mechanisms for three obesity-related SNPs.
We study the problem of learning classification models from complex multivariate temporal data encountered in electronic health record systems. The challenge is to define a good set of features that are able to represent well the temporal aspect of the data. Our method relies on temporal abstractions and temporal pattern mining to extract the classification features. Temporal pattern mining usually returns a large number of temporal patterns, most of which may be irrelevant to the classification task. To address this problem, we present the minimal predictive temporal patterns framework to generate a small set of predictive and non-spurious patterns. We apply our approach to the real-world clinical task of predicting patients who are at risk of developing heparin induced thrombocytopenia. The results demonstrate the benefit of our approach in learning accurate classifiers, which is a key step for developing intelligent clinical monitoring systems.
One of the major obstacles in computational modeling of a biological system is to determine a large number of parameters in the mathematical equations representing biological properties of the system. To tackle this problem, we have developed a global optimization method, called Discrete Selection Levenberg-Marquardt (DSLM), for parameter estimation. For fast computational convergence, DSLM suggests a new approach for the selection of optimal parameters in the discrete spaces, while other global optimization methods such as genetic algorithm and simulated annealing use heuristic approaches that do not guarantee the convergence. As a specific application example, we have targeted understanding phagocyte transmigration which is involved in the fibrosis process for biomedical device implantation. The goal of computational modeling is to construct an analyzer to understand the nature of the system. Also, the simulation by computational modeling for phagocyte transmigration provides critical clues to recognize current knowledge of the system and to predict yet-to-be observed biological phenomenon.
biological system modeling; nonlinear estimation; parameter estimation; reverse engineering
16S rRNA gene profiling has recently been boosted by the development of pyrosequencing methods. A common analysis is to group pyrosequences into Operational Taxonomic Units (OTUs), such that reads in an OTU are likely sampled from the same species. However, species diversity estimated from error-prone 16S rRNA pyrosequences may be inflated because the reads sampled from the same 16S rRNA gene may appear different, and current OTU inference approaches typically involve time-consuming pairwise/multiple distance calculation and clustering. I propose a novel approach AbundantOTU based on a Consensus Alignment (CA) algorithm, which infers consensus sequences, each representing an OTU, taking advantage of the sequence redundancy for abundant species. Pyrosequencing reads can then be recruited to the consensus sequences to give quantitative information for the corresponding species. As tested on 16S rRNA pyrosequence datasets from mock communities with known species, AbundantOTU rapidly reported identified sequences of the source 16S rRNAs and the abundances of the corresponding species. AbundantOTU was also applied to 16S rRNA pyrosequence datasets derived from real microbial communities and the results are in general agreement with previous studies.
16S rRNA gene; pyrosequencing; Operational Taxonomic Unit (OTU); abundant species
We describe an algorithm for finding approximate seeds for DNA homology searches. In contrast to previous algorithms that use exact or spaced seeds, our approximate seeds may contain insertions and deletions. We present a generalized heuristic for finding such seeds efficiently and prove that the heuristic does not affect sensitivity. We show how to adapt this algorithm to work over the memory efficient suffix array with provably minimal overhead in running time.
We demonstrate the effectiveness of our algorithm on two tasks: whole genome alignment of bacteria and alignment of the DNA sequences of 177 genes that are orthologous in human and mouse. We show our algorithm achieves better sensitivity and uses less memory than other commonly used local alignment tools.
local alignment; inexact seeds; suffix array
Citations are ubiquitous in scientific articles and play important roles for representing the semantic content of a full-text biomedical article. In this work, we manually examined full-text biomedical articles to analyze the semantic content of citations in full-text biomedical articles. After developing a citation relation schema and annotation guideline, our pilot annotation results show an overall agreement of 0.71, and here we report on the research challenges and the lessons we've learned while trying to overcome them. Our work is a first step toward automatic citation classification in full-text biomedical articles, which may contribute to many text mining tasks, including information retrieval, extraction, summarization, and question answering.
In homology modeling of protein structures, it is typical to find templates through a sequence search against a database of proteins with known structures. In more complicated modeling cases, such as modeling a protein structure in contact with a ligand, sequence information itself may not be enough and more biological information is required for a successful modeling process. SCOP and PFAM are two databases providing protein domain information which can be utilized in complex protein structure modeling. However, due to the manually-curated nature of both databases, they fail to provide timely coverage of protein sequences existing in the Protein Data Bank (PDB). In this paper, we introduce a new relational database, IDOPS, which integrates sequence and biological information extracted from remediated PDB files and protein domain information generated with HMM profiles of PFAM families. With a carefully designed protocol, this database is updated regularly and the coverage rate of PDB entries is guaranteed to be high.