Motivation: The synapse is integral to the function of the brain and may be an important source of dysfunction underlying many neuropsychiatric disorders. Consequently, it is an excellent candidate for large-scale genomic and proteomic study. However, while the tools and databases available for the annotation of high-throughput DNA and protein are generally robust, a comprehensive resource dedicated to the integration of information about the synapse is lacking.
Results: We present an integrated database, called SynaptomeDB, to retrieve and annotate genes comprising the synaptome. These genes encode components of the synapse including neurotransmitters and their receptors, adhesion/cytoskeletal proteins, scaffold proteins, membrane transporters. SynaptomeDB integrates various and complex data sources for synaptic genes and proteins.
Availability:
http://psychiatry.igm.jhmi.edu/SynaptomeDB/
Contact:
mpirooz1@jhmi.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts040
PMCID: PMC3307115
PMID: 22285564
Summary: Meta-analysis across genome-wide association studies is a common approach for discovering genetic associations. However, in some meta-analysis efforts, individual-level data cannot be broadly shared by study investigators due to privacy and Institutional Review Board concerns. In such cases, researchers cannot confirm that each study represents a unique group of people, leading to potentially inflated test statistics and false positives. To resolve this problem, we created a software tool, Gencrypt, which utilizes a security protocol known as one-way cryptographic hashes to allow overlapping participants to be identified without sharing individual-level data.
Availability: Gencrypt is freely available under the GNU general public license v3 at http://www.broadinstitute.org/software/gencrypt/
Contact:
joelh@broadinstitute.org
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts045
PMCID: PMC3307118
PMID: 22302573
Motivation: Selecting a small number of signature genes for accurate classification of samples is essential for the development of diagnostic tests. However, many genes are highly correlated in gene expression data, and hence, many possible sets of genes are potential classifiers. Because treatment outcomes are poor in advanced chronic myeloid leukemia (CML), we hypothesized that expression of classifiers of advanced phase CML when detected in early CML [chronic phase (CP) CML], correlates with subsequent poorer therapeutic outcome.
Results: We developed a method that integrates gene expression data with expert knowledge and predicted functional relationships using iterative Bayesian model averaging. Applying our integrated method to CML, we identified small sets of signature genes that are highly predictive of disease phases and that are more robust and stable than using expression data alone. The accuracy of our algorithm was evaluated using cross-validation on the gene expression data. We then tested the hypothesis that gene sets associated with advanced phase CML would predict relapse after allogeneic transplantation in 176 independent CP CML cases. Our gene signatures of advanced phase CML are predictive of relapse even after adjustment for known risk factors associated with transplant outcomes.
Availability: The source codes and data sets used are available from the web site http://expression.washington.edu/publications/kayee/integratedBMA.
Contact:
kayee@u.washington.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts059
PMCID: PMC3307121
PMID: 22296787
Motivation: Probabilistic logic programming offers a powerful way to describe and evaluate structured statistical models. To investigate the practicality of probabilistic logic programming for structure learning in bioinformatics, we undertook a simplified bacterial gene-finding benchmark in PRISM, a probabilistic dialect of Prolog.
Results: We evaluate Hidden Markov Model structures for bacterial protein-coding gene potential, including a simple null model structure, three structures based on existing bacterial gene finders and two novel model structures. We test standard versions as well as ADPH length modeling and three-state versions of the five model structures. The models are all represented as probabilistic logic programs and evaluated using the PRISM machine learning system in terms of statistical information criteria and gene-finding prediction accuracy, in two bacterial genomes. Neither of our implementations of the two currently most used model structures are best performing in terms of statistical information criteria or prediction performances, suggesting that better-fitting models might be achievable.
Availability: The source code of all PRISM models, data and additional scripts are freely available for download at: http://github.com/somork/codonhmm.
Contact:
soer@ruc.dk
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr698
PMCID: PMC3289911
PMID: 22215819
Motivation: The folding free energy is an important characteristic of proteins stability and is directly related to protein's wild-type function. The changes of protein's stability due to naturally occurring mutations, missense mutations, are typically causing diseases. Single point mutations made in vitro are frequently used to assess the contribution of given amino acid to the stability of the protein. In both cases, it is desirable to predict the change of the folding free energy upon single point mutations in order to either provide insights of the molecular mechanism of the change or to design new experimental studies.
Results: We report an approach that predicts the free energy change upon single point mutation by utilizing the 3D structure of the wild-type protein. It is based on variation of the molecular mechanics Generalized Born (MMGB) method, scaled with optimized parameters (sMMGB) and utilizing specific model of unfolded state. The corresponding mutations are built in silico and the predictions are tested against large dataset of 1109 mutations with experimentally measured changes of the folding free energy. Benchmarking resulted in root mean square deviation = 1.78 kcal/mol and slope of the linear regression fit between the experimental data and the calculations was 1.04. The sMMGB is compared with other leading methods of predicting folding free energy changes upon single mutations and results discussed with respect to various parameters.
Availability: All the pdb files we used in this article can be downloaded from http://compbio.clemson.edu/downloadDir/mentaldisorders/sMMGB_pdb.rar
Contact:
ealexov@clemson.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts005
PMCID: PMC3289912
PMID: 22238268
Motivation: Molecular interaction information, such as protein–protein interactions and protein–small molecule interactions, is indispensable for understanding the mechanism of biological processes and discovering treatments for diseases. Many databases have been built by manual annotation of literature to organize such information into structured form. However, most databases focus on only one type of interactions, which are often not well annotated and integrated with related functional information.
Results: In this study, we integrate molecular interaction information from literature by automatic information extraction and from manually annotated databases. We further integrate the relationships between protein/gene and other bio-entity terms including gene ontology terms, pathways, species and diseases to build an integrated molecular interaction database (IMID). Interactions can be selected by their associated probabilities. IMID allows complex and versatile queries for context-specific molecular interactions, which are not available currently in other molecular interaction databases.
Availability: The database is located at www.integrativebiology.org.
Contact:
jinfeng@stat.fsu.edu
doi:10.1093/bioinformatics/bts010
PMCID: PMC3289914
PMID: 22238258
Summary: The Illumina Infinium HumanMethylation450 BeadChip is a newly designed high-density microarray for quantifying the methylation level of over 450 000 CpG sites within human genome. Illumina Methylation Analyzer (IMA) is a computational package designed to automate the pipeline for exploratory analysis and summarization of site-level and region-level methylation changes in epigenetic studies utilizing the 450K DNA methylation microarray. The pipeline loads the data from Illumina platform and provides user-customized functions commonly required to perform exploratory methylation analysis for individual sites as well as annotated regions.
Availability: IMA is implemented in the R language and is freely available from http://www.rforge.net/IMA.
Contact:
song.liu@roswellpark.org
doi:10.1093/bioinformatics/bts013
PMCID: PMC3289916
PMID: 22253290
Motivation: Pathway diagrams from PubMed and World Wide Web (WWW) contain valuable highly curated information difficult to reach without tools specifically designed and customized for the biological semantics and high-content density of the images. There is currently no search engine or tool that can analyze pathway images, extract their pathway components (molecules, genes, proteins, organelles, cells, organs, etc.) and indicate their relationships.
Results: Here, we describe a resource of pathway diagrams retrieved from article and web-page images through optical character recognition, in conjunction with data mining and data integration methods. The recognized pathways are integrated into the BiologicalNetworks research environment linking them to a wealth of data available in the BiologicalNetworks' knowledgebase, which integrates data from >100 public data sources and the biomedical literature. Multiple search and analytical tools are available that allow the recognized cellular pathways, molecular networks and cell/tissue/organ diagrams to be studied in the context of integrated knowledge, experimental data and the literature.
Availability: BiologicalNetworks software and the pathway repository are freely available at www.biologicalnetworks.org.
Contact: baitaluk@sdsc.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts018
PMCID: PMC3289920
PMID: 22267504
Summary: We have developed medpie, a software package for preparing medical message board corpora and extracting patient mentions and statistics for drugs, herbs and adverse effects experienced from them. The package is divided into web-crawling, HTML-cleaning, de-identification and information extraction modules. It also includes a sample controlled vocabulary of drugs, herbs and adverse effect terms.
Availability:
http://www.cis.upenn.edu/~ungar/medpie.zip
Dependencies: Python 2.6 or 2.7
Contact:
ungar@cis.upenn.edu; adrianb@mail.med.upenn.edu
doi:10.1093/bioinformatics/bts030
PMCID: PMC3289922
PMID: 22262673
Summary: AnnTools is a versatile bioinformatics application designed for comprehensive annotation of a full spectrum of human genome variation: novel and known single-nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural variants/copy number variation (SV/CNV). The variants are interpreted by interrogating data compiled from 15 constantly updated sources. In addition to detailed functional characterization of the coding variants, AnnTools searches for overlaps with regulatory elements, disease/trait associated loci, known segmental duplications and artifact prone regions, thereby offering an integrated and comprehensive analysis of genomic data. The tool conveniently accepts user-provided tracks for custom annotation and offers flexibility in input data formats. The output is generated in the universal Variant Call Format. High annotation speed makes AnnTools suitable for high-throughput sequencing facilities, while a low-memory footprint and modest CPU requirements allow it to operate on a personal computer. The application is freely available for public use; the package includes installation scripts and a set of helper tools.
Availability:
http://anntools.sourceforge.net/
Contact:
vladimir.makarov@mssm.edu; chris.yoon@mssm.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts032
PMCID: PMC3289923
PMID: 22257670
Motivation: Multicellular systems, such as tissues, are composed of different cell types that form a heterogeneous community. Behavior of these systems is determined by complex regulatory networks within (intracellular networks) and between (intercellular networks) cells. Increasingly more studies are applying genome-wide experimental approaches to delineate the contributions of individual cell types (e.g. stromal, epithelial, vascular cells) to collective behavior of heterogeneous cell communities (e.g. tumors). Although many computational methods have been developed for analyses of intracellular networks based on genome-scale data, these efforts have not been extended toward analyzing genomic data from heterogeneous cell communities.
Results: Here, we propose a network-based approach for analyses of genome-scale data from multiple cell types to extract community-wide molecular networks comprised of intra- and intercellular interactions. Intercellular interactions in this model can be physical interactions between proteins or indirect interactions mediated by secreted metabolites of neighboring cells. Applying this method on data from a recent study on xenograft mouse models of human lung adenocarcinoma, we uncover an extensive network of intra- and intercellular interactions involved in the acquired resistance to angiogenesis inhibitors.
Contact:
kakajan.komurov@cchmc.org
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr718
PMCID: PMC3338330
PMID: 22210865
Motivation: A review of the available single nucleotide polymorphism (SNP) calling procedures for Illumina high-throughput sequencing (HTS) platform data reveals that most rely mainly on base-calling and mapping qualities as sources of error when calling SNPs. Thus, errors not involved in base-calling or alignment, such as those in genomic sample preparation, are not accounted for.
Results: A novel method of consensus and SNP calling, Genotype Model Selection (GeMS), is given which accounts for the errors that occur during the preparation of the genomic sample. Simulations and real data analyses indicate that GeMS has the best performance balance of sensitivity and positive predictive value among the tested SNP callers.
Availability: The GeMS package can be downloaded from https://sites.google.com/a/bioinformatics.ucr.edu/xinping-cui/home/software or http://computationalbioenergy.org/software.html
Contact:
xinping.cui@ucr.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts001
PMCID: PMC3338331
PMID: 22253293
Motivation: ChIPseq is rapidly becoming a common technique for investigating protein–DNA interactions. However, results from individual experiments provide a limited understanding of chromatin structure, as various chromatin factors cooperate in complex ways to orchestrate transcription. In order to quantify chromtain interactions, it is thus necessary to devise a robust similarity metric applicable to ChIPseq data. Unfortunately, moving past simple overlap calculations to give statistically rigorous comparisons of ChIPseq datasets often involves arbitrary choices of distance metrics, with significance being estimated by computationally intensive permutation tests whose statistical power may be sensitive to non-biological experimental and post-processing variation.
Results: We show that it is in fact possible to compare ChIPseq datasets through the efficient computation of exact P-values for proximity. Our method is insensitive to non-biological variation in datasets such as peak width, and can rigorously model peak location biases by evaluating similarity conditioned on a restricted set of genomic regions (such as mappable genome or promoter regions).
Applying our method to the well-studied dataset of Chen et al. (2008), we elucidate novel interactions which conform well with our biological understanding. By comparing ChIPseq data in an asymmetric way, we are able to observe clear interaction differences between cofactors such as p300 and factors that bind DNA directly.
Availability: Source code is available for download at http://sonorus.princeton.edu/IntervalStats/IntervalStats.tar.gz
Contact:
ogt@cs.princeton.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts009
PMCID: PMC3339511
PMID: 22262674
Motivation: Peptide detection is a crucial step in mass spectrometry (MS) based proteomics. Most existing algorithms are based upon greedy isotope template matching and thus may be prone to error propagation and ineffective to detect overlapping peptides. In addition, existing algorithms usually work at different charge states separately, isolating useful information that can be drawn from other charge states, which may lead to poor detection of low abundance peptides.
Results: BPDA2d models spectra as a mixture of candidate peptide signals and systematically evaluates all possible combinations of possible peptide candidates to interpret the given spectra. For each candidate, BPDA2d takes into account its elution profile, charge state distribution and isotope pattern, and it combines all evidence to infer the candidate's signal and existence probability. By piecing all evidence together—especially by deriving information across charge states—low abundance peptides can be better identified and peptide detection rates can be improved. Instead of local template matching, BPDA2d performs global optimization for all candidates and systematically optimizes their signals. Since BPDA2d looks for the optimal among all possible interpretations of the given spectra, it has the capability in handling complex spectra where features overlap. BPDA2d estimates the posterior existence probability of detected peptides, which can be directly used for probability-based evaluation in subsequent processing steps. Our experiments indicate that BPDA2d outperforms state-of-the-art detection methods on both simulated data and real liquid chromatography–mass spectrometry data, according to sensitivity and detection accuracy.
Availability: The BPDA2d software package is available at http://gsp.tamu.edu/Publications/supplementary/sun11a/
Contact:
Michelle.Zhang@utsa.edu; edward@ece.tamu.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr675
PMCID: PMC3278754
PMID: 22155863
Motivation: High-dimensional data such as microarrays have created new challenges to traditional statistical methods. One such example is on class prediction with high-dimension, low-sample size data. Due to the small sample size, the sample mean estimates are usually unreliable. As a consequence, the performance of the class prediction methods using the sample mean may also be unsatisfactory. To obtain more accurate estimation of parameters some statistical methods, such as regularizations through shrinkage, are often desired.
Results: In this article, we investigate the family of shrinkage estimators for the mean value under the quadratic loss function. The optimal shrinkage parameter is proposed under the scenario when the sample size is fixed and the dimension is large. We then construct a shrinkage-based diagonal discriminant rule by replacing the sample mean by the proposed shrinkage mean. Finally, we demonstrate via simulation studies and real data analysis that the proposed shrinkage-based rule outperforms its original competitor in a wide range of settings.
Contact:
tongt@hkbu.edu.hk
doi:10.1093/bioinformatics/btr690
PMCID: PMC3278755
PMID: 22171335
Summary: microRibonucleic acid (miRNAs) are small regulatory molecules that act by mRNA degradation or via translational repression. Although many miRNAs are ubiquitously expressed, a small subset have differential expression patterns that may give rise to tissue-specific complexes.
Motivation: This work studies gene targeting patterns amongst miRNAs with differential expression profiles, and links this to control and regulation of protein complexes.
Results: We find that, when a pair of miRNAs are not expressed in the same tissues, there is a higher tendency for them to target the direct partners of the same hub proteins. At the same time, they also avoid targeting the same set of hub-spokes. Moreover, the complexes corresponding to these hub-spokes tend to be specific and nonoverlapping. This suggests that the effect of miRNAs on the formation of complexes is specific.
Contact:
wongls@comp.nus.edu.sg
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr693
PMCID: PMC3278756
PMID: 22180412
Motivation: A plethora of bioinformatics analysis has led to the discovery of numerous gene sets, which can be interpreted as discrete measurements emitted from latent signaling pathways. Their potential to infer signaling pathway structures, however, has not been sufficiently exploited. Existing methods accommodating discrete data do not explicitly consider signal cascading mechanisms that characterize a signaling pathway. Novel computational methods are thus needed to fully utilize gene sets and broaden the scope from focusing only on pairwise interactions to the more general cascading events in the inference of signaling pathway structures.
Results: We propose a gene set based simulated annealing (SA) algorithm for the reconstruction of signaling pathway structures. A signaling pathway structure is a directed graph containing up to a few hundred nodes and many overlapping signal cascades, where each cascade represents a chain of molecular interactions from the cell surface to the nucleus. Gene sets in our context refer to discrete sets of genes participating in signal cascades, the basic building blocks of a signaling pathway, with no prior information about gene orderings in the cascades. From a compendium of gene sets related to a pathway, SA aims to search for signal cascades that characterize the optimal signaling pathway structure. In the search process, the extent of overlap among signal cascades is used to measure the optimality of a structure. Throughout, we treat gene sets as random samples from a first-order Markov chain model. We evaluated the performance of SA in three case studies. In the first study conducted on 83 KEGG pathways, SA demonstrated a significantly better performance than Bayesian network methods. Since both SA and Bayesian network methods accommodate discrete data, use a ‘search and score’ network learning strategy and output a directed network, they can be compared in terms of performance and computational time. In the second study, we compared SA and Bayesian network methods using four benchmark datasets from DREAM. In our final study, we showcased two context-specific signaling pathways activated in breast cancer.
Availibility: Source codes are available from http://dl.dropbox.com/u/16000775/sa_sc.zip
Contact:
dzhu@wayne.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr696
PMCID: PMC3278757
PMID: 22199386
Summary:
CLARE is a computational method designed to reveal sequence encryption of tissue-specific regulatory elements. Starting with a set of regulatory elements known to be active in a particular tissue/process, it learns the sequence code of the input set and builds a predictive model from features specific to those elements. The resulting model can then be applied to user-supplied genomic regions to identify novel candidate regulatory elements. CLARE's model also provides a detailed analysis of transcription factors that most likely bind to the elements, making it an invaluable tool for understanding mechanisms of tissue-specific gene regulation.
Availability: CLARE is freely accessible at http://clare.dcode.org/.
Contact:
taherl@ncbi.nlm.nih.gov; ovcharen@nih.gov
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr704
PMCID: PMC3278760
PMID: 22199387
Summary: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles.
Availability: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art
Contact:
weichun.huang@nih.gov; gabor.marth@bc.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr708
PMCID: PMC3278762
PMID: 22199392
Summary: VarSifter is a graphical software tool for desktop computers that allows investigators of varying computational skills to easily and quickly sort, filter, and sift through sequence variation data. A variety of filters and a custom query framework allow filtering based on any combination of sample and annotation information. By simplifying visualization and analyses of exome-scale sequence variation data, this program will help bring the power and promise of massively-parallel DNA sequencing to a broader group of researchers.
Availability and Implementation: VarSifter is written in Java, and is freely available in source and binary versions, along with a User Guide, at http://research.nhgri.nih.gov/software/VarSifter/.
Contact:
mullikin@mail.nih.gov
Supplementary Information: Additional figures and methods available online at the journal's website.
doi:10.1093/bioinformatics/btr711
PMCID: PMC3278764
PMID: 22210868
Summary: Next-generation sequencing is rapidly becoming the approach of choice for transcriptional analysis experiments. Substantial advances have been achieved in computational approaches to support these technologies. These approaches typically rely on existing transcript annotations, introducing a bias towards known genes, require specific experimental design and computational resources, or focus only on identification of splice variants (ignoring other biologically relevant transcribed features contained within the data that may be important for downstream analysis). Biologically relevant transcribed features also include large and small non-coding RNA, new transcription start sites, alternative promoters, RNA editing and processing of coding transcripts. Also, many existing solutions lack accessible interfaces required for wide scale adoption. We present a user-friendly, rapid and computation-efficient feature annotation framework (RNA-eXpress) that enables identification of transcripts and other genomic and transcriptional features independently of current annotations. RNA-eXpress accepts mapped reads in the standard binary alignment (BAM) format and produces a study-specific feature annotation in GTF format, comparison statistics, sequence extraction and feature counts. The framework is designed to be easily accessible while allowing advanced users to integrate new feature-identification algorithms through simple class extension, thus facilitating expansion to novel feature types or identification of study-specific feature types.
Availability and implementation: RNA-eXpress software, source code, user manuals, supporting tutorials, developer guides and example data are available at http://www.rnaexpress.org.
Contact:
paul.hertzog@monash.edu
Supplementary information:
Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt034
PMCID: PMC3597146
PMID: 23396121
Summary: Reduced representation bisulfite sequencing (RRBS) is a powerful yet cost-efficient method for studying DNA methylation on a genomic scale. RRBS involves restriction-enzyme digestion, bisulfite conversion and size selection, resulting in DNA sequencing data that require special bioinformatic handling. Here, we describe RRBSMAP, a short-read alignment tool that is designed for handling RRBS data in a user-friendly and scalable way. RRBSMAP uses wildcard alignment, and avoids the need for any preprocessing or post-processing steps. We benchmarked RRBSMAP against a well-validated MAQ-based pipeline for RRBS read alignment and observed similar accuracy but much improved runtime performance, easier handling and better scaling to large sample sets. In summary, RRBSMAP removes bioinformatic hurdles and reduces the computational burden of large-scale epigenome association studies performed with RRBS.
Availability: http://rrbsmap.computational-epigenetics.org/ http://code.google.com/p/bsmap/
Contact: wl1@bcm.tmc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr668
PMCID: PMC3268241
PMID: 22155871
Karnovsky, Alla | Weymouth, Terry | Hull, Tim | Tarcea, V. Glenn | Scardoni, Giovanni | Laudanna, Carlo | Sartor, Maureen A. | Stringer, Kathleen A. | Jagadish, H. V. | Burant, Charles | Athey, Brian | Omenn, Gilbert S.
Motivation: Metabolomics is a rapidly evolving field that holds promise to provide insights into genotype–phenotype relationships in cancers, diabetes and other complex diseases. One of the major informatics challenges is providing tools that link metabolite data with other types of high-throughput molecular data (e.g. transcriptomics, proteomics), and incorporate prior knowledge of pathways and molecular interactions.
Results: We describe a new, substantially redesigned version of our tool Metscape that allows users to enter experimental data for metabolites, genes and pathways and display them in the context of relevant metabolic networks. Metscape 2 uses an internal relational database that integrates data from KEGG and EHMN databases. The new version of the tool allows users to identify enriched pathways from expression profiling data, build and analyze the networks of genes and metabolites, and visualize changes in the gene/metabolite data. We demonstrate the applications of Metscape to annotate molecular pathways for human and mouse metabolites implicated in the pathogenesis of sepsis-induced acute lung injury, for the analysis of gene expression and metabolite data from pancreatic ductal adenocarcinoma, and for identification of the candidate metabolites involved in cancer and inflammation.
Availability: Metscape is part of the National Institutes of Health-supported National Center for Integrative Biomedical Informatics (NCIBI) suite of tools, freely available at http://metscape.ncibi.org. It can be downloaded from http://cytoscape.org or installed via Cytoscape plugin manager.
Contact: metscape-help@umich.edu; akarnovs@umich.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr661
PMCID: PMC3268237
PMID: 22135418
Larson, David E. | Harris, Christopher C. | Chen, Ken | Koboldt, Daniel C. | Abbott, Travis E. | Dooling, David J. | Ley, Timothy J. | Mardis, Elaine R. | Wilson, Richard K. | Ding, Li
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact: delarson@wustl.edu; lding@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr665
PMCID: PMC3268238
PMID: 22155872
Motivation: One of the challenges in interpreting high-throughput genomic studies such as a genome-wide associations, microarray or ChIP-seq is their open-ended nature—once a set of experimentally identified regions is identified as statistically significant, at least two questions arise: (i) besides P-value, do any of these significant regions stand out in terms of biological implications? (ii) Does the set of significant regions, as a whole, have anything in common genome wide? These issues are difficult to address because of the growing number of annotated genomic features (e.g. single nucleotide polymorphisms, transcription factor binding sites, methylation peaks, etc.), and it is difficult to know a priori which features would be most fruitful to analyze. Our goal is to provide partial automation of this process to begin examining associations between experimental features and annotated genomic regions in a hypothesis-free, data-driven manner.
Results: We created GenomeRunner—a tool for automating annotation and enrichment of genomic features of interest (FOI) with annotated genomic features (GFs), in different organisms. Besides simple association of FOIs with known GFs GenomeRunner tests whether the enriched FOIs, as a group, are statistically associated with a large and growing set of genomic features.
Availability: GenomeRunner setup files and source code are freely available at http://sourceforge.net/projects/genomerunner.
Contact: mikhail-dozmorov@omrf.org; Jonathan-Wren@omrf.org; jdwren@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr666
PMCID: PMC3268239
PMID: 22155868