Tandem mass spectrometry experiments generate from thousands to millions of spectra. These spectra can be used to identify the presence of proteins in biological samples. In this work, we propose a new method to identify peptides, substrings of proteins, based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method uses all available information to score all the spectra in a cluster against candidate peptides using Bayesian model selection. We illustrate the performance of our method by applying it to seven-standard-protein mixture data.
Bayesian analysis; Bioinformatics; Clustered tandem mass spectra; False discovery rate; Peptide identification; Proteomics
In this work we build the first BI-RADS parser for Portuguese free texts, modeled after existing approaches to extract BI-RADS features from English medical records. Our concept finder uses a semantic grammar based on the BIRADS lexicon and on iterative transferred expert knowledge. We compare the performance of our algorithm to manual annotation by a specialist in mammography. Our results show that our parser’s performance is comparable to the manual method.
feature extraction; breast cancer; BI-RADS descriptors
High-throughput spectrometers are capable of producing data sets containing thousands of spectra for a single biological sample. These data sets contain a substantial amount of redundancy from peptides that may get selected multiple times in a LC-MS/MS experiment. In this paper, we present an efficient algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data which increases both the sensitivity and confidence of spectral assignment. CAMS utilizes a novel metric, called F-set, that allows accurate identification of the spectra that are similar. A graph theoretic framework is defined that allows the use of F-set metric efficiently for accurate cluster identifications. The accuracy of the algorithm is tested on real HCD and CID data sets with varying amounts of peptides. Our experiments show that the proposed algorithm is able to cluster spectra with very high accuracy in a reasonable amount of time for large spectral data sets. Thus, the algorithm is able to decrease the computational time by compressing the data sets while increasing the throughput of the data by interpreting low S/N spectra.
Clustering; Mass spectrometry; Graph Theory; Efficient Algorithms
Phosphorylation site assignment of large-scale data from high throughput tandem mass spectrometry (LC-MS/MS) data is an important aspect of phosphoproteomics. Correct assignment of phosphorylated residue(s) is important for functional interpretation of the data within a biological context. Common search algorithms (Sequest etc.) for mass spectrometry data are not designed for accurate site assignment; thus, additional algorithms are needed. In this paper, we propose a linear-time and linear-space dynamic programming strategy for phosphorylation site assignment. The algorithm, referred to as PhosSA, optimizes the objective function defined as the summation of peak intensities that are associated with theoretical phosphopeptide fragmentation ions. Quality control is achieved through the use of a post-processing criteria whose value is indicative of the signal-to-noise (S/N) properties and redundancy of the fragmentation spectra. The algorithm is tested using experimentally generated data sets of peptides with known phosphorylation sites while varying the fragmentation strategy (CID or HCD) and molar amounts of the peptides. The algorithm is also compatible with various peptide labeling strategies including SILAC and iTRAQ. PhosSA is shown to achieve > 99% accuracy with a high degree of sensitivity. The algorithm is extremely fast and scalable (able to process up to 0.5 million peptides in an hour). The implemented algorithm is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic purposes.
The values of data elements stored in biomedical databases often draw from biomedical ontologies. Authorization rules can be defined on these ontologies to control access to sensitive and private data elements in such databases. Authorization rules may be specified by different authorities at different times for various purposes. Since such policy rules can conflict with each other, access to sensitive information may inadvertently be allowed. Another problem in biomedical data protection is inference attacks, in which a user who has legitimate access to some data elements is able to infer information related to other data elements. We propose and evaluate two strategies; one for detecting policy inconsistencies to avoid potential inference attacks and the other for detecting policy conflicts.
Authorization policy; Biomedical ontology; Inference attacks; Policy conflicts
A new and emerging paradigm in molecular biology is revealing that RNA is implicated in nearly every aspect of the metabolism in the cell. To enhance our understanding of the function of these RNA molecules in the cell, it is essential that we have a complete understanding of their higher-order structures. While many computational tools have been developed to predict and analyse these higher-order RNA structures, few are able to visualize them for analytical purposes. In this paper, we present an interactive visualization tool of the secondary structure of RNA, named RNA2DMap. This program enables multiple-dimensions of information about RNA structure to be selected, customized and displayed to visually identify patterns and relationships. RNA2DMap facilitates the comparative analysis and understanding of RNAs that cannot be readily obtained with other graphical or text output from computer programs. Three use cases are presented to illustrate how RNA2DMap aids structural analysis.
Biological Data Visulation; RNA Struaral Analysis; Interative Application
Strains of the Mycobacterium tuberculosis complex (MTBC) can be classified into coherent lineages of similar traits based on their genotype. We present a tensor clustering framework to group MTBC strains into sublineages of the known major lineages based on two biomarkers: spacer oligonucleotide type (spoligotype) and mycobacterial interspersed repetitive units (MIRU). We represent genotype information of MTBC strains in a high-dimensional array in order to include information about spoligotype, MIRU, and their coexistence using multiple-biomarker tensors. We use multiway models to transform this multidimensional data about the MTBC strains into two-dimensional arrays and use the resulting score vectors in a stable partitive clustering algorithm to classify MTBC strains into sublineages. We validate clusterings using cluster stability and accuracy measures, and find stabilities of each cluster. Based on validated clustering results, we present a sublineage structure of MTBC strains and compare it to the sublineage structures of SpolDB4 and MIRU-VNTRplus.
Tuberculosis; Mycobacterium tuberculosis complex; multiway models; clustering; cluster validation
Biomarkers of Mycobacterium tuberculosis complex (MTBC) mutate over time. Among the biomarkers of MTBC, spacer oligonucleotide type (spoligotype) and Mycobacterium Interspersed Repetitive Unit (MIRU) patterns are commonly used to genotype clinical MTBC strains. In this study, we present an evolution model of spoligotype rearrangements using MIRU patterns to disambiguate the ancestors of spoligotypes, in a large patient dataset from the United States Centers for Disease Control and Prevention (CDC). Based on the contiguous deletion assumption and rare observation of convergent evolution, we first generate the most parsimonious forest of spoligotypes, called a spoligoforest, using three genetic distance measures. An analysis of topological attributes of the spoligoforest and number of variations at the direct repeat (DR) locus of each strain reveals interesting properties of deletions in the DR region. First, we compare our mutation model to existing mutation models of spoligotypes and find that our mutation model produces as many within-lineage mutation events as other models, with slightly higher segregation accuracy. Second, based on our mutation model, the number of descendant spoligotypes follows a power law distribution. Third, contrary to prior studies, the power law distribution does not plausibly fit to the mutation length frequency. Finally, the total number of mutation events at consecutive DR loci follows a bimodal distribution, which results in accumulation of shorter deletions in the DR region. The two modes are spacers 13 and 40, which are hotspots for chromosomal rearrangements. The change point in the bimodal distribution is spacer 34, which is absent in most MTBC strains. This bimodal separation results in accumulation of shorter deletions, which explains why a power law distribution is not a plausible fit to the mutation length frequency.
tuberculosis; Mycobacterium tuberculosis complex; DR locus; spoligotype; MIRU-VNTR; mutation
GWAS studies have been successful in finding genetic determinants of obesity. To translate discovered genetic variants into new therapies or prevention strategies, molecular or physiological mechanisms need to be discovered. One strategy is to perform data mining of data sets with detailed phenotypic data, such as those present in dbGAP (database of Genotypes and Phenotypes) for hypothesis generation. We propose a novel technique that combines the power and computational efficiency of existing Bayesian Network (BN) learning algorithms with the statistical rigor of Structural Equation Modeling (SEM) to produce an overall system that searches the space of potential networks and evaluates promising candidates using standard SEM model selection criteria. We demonstrate our method using the analysis of a candidate SNP data set from the AMERICO sample, a multi-ethnic cross-sectional cohort of roughly three hundred children with detailed obesity-related phenotypes. We demonstrate our approach by showing genetic mechanisms for three obesity-related SNPs.
We study the problem of learning classification models from complex multivariate temporal data encountered in electronic health record systems. The challenge is to define a good set of features that are able to represent well the temporal aspect of the data. Our method relies on temporal abstractions and temporal pattern mining to extract the classification features. Temporal pattern mining usually returns a large number of temporal patterns, most of which may be irrelevant to the classification task. To address this problem, we present the minimal predictive temporal patterns framework to generate a small set of predictive and non-spurious patterns. We apply our approach to the real-world clinical task of predicting patients who are at risk of developing heparin induced thrombocytopenia. The results demonstrate the benefit of our approach in learning accurate classifiers, which is a key step for developing intelligent clinical monitoring systems.
One of the major obstacles in computational modeling of a biological system is to determine a large number of parameters in the mathematical equations representing biological properties of the system. To tackle this problem, we have developed a global optimization method, called Discrete Selection Levenberg-Marquardt (DSLM), for parameter estimation. For fast computational convergence, DSLM suggests a new approach for the selection of optimal parameters in the discrete spaces, while other global optimization methods such as genetic algorithm and simulated annealing use heuristic approaches that do not guarantee the convergence. As a specific application example, we have targeted understanding phagocyte transmigration which is involved in the fibrosis process for biomedical device implantation. The goal of computational modeling is to construct an analyzer to understand the nature of the system. Also, the simulation by computational modeling for phagocyte transmigration provides critical clues to recognize current knowledge of the system and to predict yet-to-be observed biological phenomenon.
biological system modeling; nonlinear estimation; parameter estimation; reverse engineering
16S rRNA gene profiling has recently been boosted by the development of pyrosequencing methods. A common analysis is to group pyrosequences into Operational Taxonomic Units (OTUs), such that reads in an OTU are likely sampled from the same species. However, species diversity estimated from error-prone 16S rRNA pyrosequences may be inflated because the reads sampled from the same 16S rRNA gene may appear different, and current OTU inference approaches typically involve time-consuming pairwise/multiple distance calculation and clustering. I propose a novel approach AbundantOTU based on a Consensus Alignment (CA) algorithm, which infers consensus sequences, each representing an OTU, taking advantage of the sequence redundancy for abundant species. Pyrosequencing reads can then be recruited to the consensus sequences to give quantitative information for the corresponding species. As tested on 16S rRNA pyrosequence datasets from mock communities with known species, AbundantOTU rapidly reported identified sequences of the source 16S rRNAs and the abundances of the corresponding species. AbundantOTU was also applied to 16S rRNA pyrosequence datasets derived from real microbial communities and the results are in general agreement with previous studies.
16S rRNA gene; pyrosequencing; Operational Taxonomic Unit (OTU); abundant species
We describe an algorithm for finding approximate seeds for DNA homology searches. In contrast to previous algorithms that use exact or spaced seeds, our approximate seeds may contain insertions and deletions. We present a generalized heuristic for finding such seeds efficiently and prove that the heuristic does not affect sensitivity. We show how to adapt this algorithm to work over the memory efficient suffix array with provably minimal overhead in running time.
We demonstrate the effectiveness of our algorithm on two tasks: whole genome alignment of bacteria and alignment of the DNA sequences of 177 genes that are orthologous in human and mouse. We show our algorithm achieves better sensitivity and uses less memory than other commonly used local alignment tools.
local alignment; inexact seeds; suffix array
Citations are ubiquitous in scientific articles and play important roles for representing the semantic content of a full-text biomedical article. In this work, we manually examined full-text biomedical articles to analyze the semantic content of citations in full-text biomedical articles. After developing a citation relation schema and annotation guideline, our pilot annotation results show an overall agreement of 0.71, and here we report on the research challenges and the lessons we've learned while trying to overcome them. Our work is a first step toward automatic citation classification in full-text biomedical articles, which may contribute to many text mining tasks, including information retrieval, extraction, summarization, and question answering.
In homology modeling of protein structures, it is typical to find templates through a sequence search against a database of proteins with known structures. In more complicated modeling cases, such as modeling a protein structure in contact with a ligand, sequence information itself may not be enough and more biological information is required for a successful modeling process. SCOP and PFAM are two databases providing protein domain information which can be utilized in complex protein structure modeling. However, due to the manually-curated nature of both databases, they fail to provide timely coverage of protein sequences existing in the Protein Data Bank (PDB). In this paper, we introduce a new relational database, IDOPS, which integrates sequence and biological information extracted from remediated PDB files and protein domain information generated with HMM profiles of PFAM families. With a carefully designed protocol, this database is updated regularly and the coverage rate of PDB entries is guaranteed to be high.
Enabling data analysis in large data depositories for high throughput experimental data such as gene microarrays and ChIP-seq is challenging. In this paper, we discuss three methods for integrating QUEST, a data depository for epigenetic experiments, with a web-based data analysis platform GenePattern. These methods are universal and can serve as an exemplary implementation resolving the dilemma facing many similar database systems in integrating data analysis tools.
high-throughput database; GenePattern; ChIP-seq
Genome-wide association studies (GWAS) have been widely applied to identify informative SNPs associated with common and complex diseases. Besides single-SNP analysis, the interaction between SNPs is believed to play an important role in disease risk due to the complex networking of genetic regulations. While many approaches have been proposed for detecting SNP interactions, the relative performance and merits of these methods in practice are largely unclear. In this paper, a ground-truth based comparative study is reported involving 9 popular SNP detection methods using realistic simulation datasets. The results provide general characteristics and guidelines on these methods that may be informative to the biological investigators.
Genome-wide association study; single-nucleotide polymorphism; SNP interaction
This paper compares the performance of keyword and machine learning-based chest x-ray report classification for Acute Lung Injury (ALI). ALI mortality is approximately 30 percent. High mortality is, in part, a consequence of delayed manual chest x-ray classification. An automated system could reduce the time to recognize ALI and lead to reductions in mortality. For our study, 96 and 857 chest x-ray reports in two corpora were labeled by domain experts for ALI. We developed a keyword and a Maximum Entropy-based classification system. Word unigram and character n-grams provided the features for the machine learning system. The Maximum Entropy algorithm with character 6-gram achieved the highest performance (Recall=0.91, Precision=0.90 and F-measure=0.91) on the 857-report corpus. This study has shown that for the classification of ALI chest x-ray reports, the machine learning approach is superior to the keyword based system and achieves comparable results to highest performing physician annotators.
We present a novel Bayesian network (BN) to classify strains of Mycobacterium tuberculosis Complex (MTBC) into six major genetic lineages using mycobacterial interspersed repetitive units (MIRUs), a high-throughput biomarker. MTBC is the causative agent of tuberculosis (TB), which remains one of the leading causes of disease and morbidity world-wide. DNA fingerprinting methods such as MIRU are key components of modern TB control and tracking. The BN achieves high accuracy on four large MTBC genotype collections consisting of over 4700 distinct 12-loci MIRU genotypes. The BN captures distinct MIRU signatures associated with each lineage, explaining the excellent performance of the BN. The errors in the BN support the need for additional biomarkers such as the expanded 24-loci MIRU used in CDC genotyping labs since May 2009. The conditional independence assumption of each locus given the lineage makes the BN easily extensible to additional MIRU loci and other biomarkers.
tuberculosis; MIRU-VNTR; Bayesian network; lineages
Thousands of biochemical interactions are available for download from curated databases such as Reactome, Pathway Interaction Database and other sources in the Biological Pathways Exchange (BioPAX) format. However, the BioPAX ontology does not encode the necessary information for kinetic modeling and simulation. The current standard for kinetic modeling is the System Biology Markup Language (SBML), but only a small number of models are available in SBML format in public repositories. Additionally, reusing and merging SBML models presents a significant challenge, because often each element has a value only in the context of the given model, and information encoding biological meaning is absent. We describe a software system that enables a variety of operations facilitating the use of BioPAX data to create kinetic models that can be visualized, edited, and simulated using the Virtual Cell (VCell), including improved conversion to SBML (for use with other simulation tools that support this format).