Search tips
Search criteria

Results 26-50 (2555)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
26.  [No title available] 
PMCID: PMC3904520  PMID: 24336642
27.  [No title available] 
PMCID: PMC3904521  PMID: 24319002
28.  [No title available] 
PMCID: PMC3904522  PMID: 24307700
29.  [No title available] 
PMCID: PMC3904524  PMID: 24292941
30.  Fast pairwise IBD association testing in genome-wide association studies 
Bioinformatics  2013;30(2):206-213.
Motivation: Recently, investigators have proposed state-of-the-art Identity-by-descent (IBD) mapping methods to detect IBD segments between purportedly unrelated individuals. The IBD information can then be used for association testing in genetic association studies. One approach for this IBD association testing strategy is to test for excessive IBD between pairs of cases (‘pairwise method’). However, this approach is inefficient because it requires a large number of permutations. Moreover, a limited number of permutations define a lower bound for P-values, which makes fine-mapping of associated regions difficult because, in practice, a much larger genomic region is implicated than the region that is actually associated.
Results: In this article, we introduce a new pairwise method ‘Fast-Pairwise’. Fast-Pairwise uses importance sampling to improve efficiency and enable approximation of extremely small P-values. Fast-Pairwise method takes only days to complete a genome-wide scan. In the application to the WTCCC type 1 diabetes data, Fast-Pairwise successfully fine-maps a known human leukocyte antigen gene that is known to cause the disease.
Availability: Fast-Pairwise is publicly available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3892684  PMID: 24158599
31.  Testing multiple biological mediators simultaneously 
Bioinformatics  2013;30(2):214-220.
Motivation: Modern biomedical and epidemiological studies often measure hundreds or thousands of biomarkers, such as gene expression or metabolite levels. Although there is an extensive statistical literature on adjusting for ‘multiple comparisons’ when testing whether these biomarkers are directly associated with a disease, testing whether they are biological mediators between a known risk factor and a disease requires a more complex null hypothesis, thus offering additional methodological challenges.
Results: We propose a permutation approach that tests multiple putative mediators and controls the family wise error rate. We demonstrate that, unlike when testing direct associations, replacing the Bonferroni correction with a permutation approach that focuses on the maximum of the test statistics can significantly improve the power to detect mediators even when all biomarkers are independent. Through simulations, we show the power of our method is 2–5× larger than the power achieved by Bonferroni correction. Finally, we apply our permutation test to a case-control study of dietary risk factors and colorectal adenoma to show that, of 149 test metabolites, docosahexaenoate is a possible mediator between fish consumption and decreased colorectal adenoma risk.
Availability and implementation: R-package included in online Supplementary Material.
Supplementary information: Supplementary materials are available at Bioinformatics online.
PMCID: PMC3892685  PMID: 24202540
32.  On the simultaneous association analysis of large genomic regions: a massive multi-locus association test 
Bioinformatics  2013;30(2):157-164.
Motivation: For samples of unrelated individuals, we propose a general analysis framework in which hundred thousands of genetic loci can be tested simultaneously for association with complex phenotypes. The approach is built on spatial-clustering methodology, assuming that genetic loci that are associated with the target phenotype cluster in certain genomic regions. In contrast to standard methodology for multilocus analysis, which has focused on the dimension reduction of the data, our multilocus association-clustering test profits from the availability of large numbers of genetic loci by detecting clusters of loci that are associated with the phenotype.
Results: The approach is computationally fast and powerful, enabling the simultaneous association testing of large genomic regions. Even the entire genome or certain chromosomes can be tested simultaneously. Using simulation studies, the properties of the approach are evaluated. In an application to a genome-wide association study for chronic obstructive pulmonary disease, we illustrate the practical relevance of the proposed method by simultaneously testing all genotyped loci of the genome-wide association study and by testing each chromosome individually. Our findings suggest that statistical methodology that incorporates spatial-clustering information will be especially useful in whole-genome sequencing studies in which millions or billions of base pairs are recorded and grouped by genomic regions or genes, and are tested jointly for association.
Availability and implementation: Implementation of the approach is available upon request.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3892690  PMID: 24262215
33.  SeqDepot: streamlined database of biological sequences and precomputed features 
Bioinformatics  2013;30(2):295-297.
Summary: Assembling and/or producing integrated knowledge of sequence features continues to be an onerous and redundant task despite a large number of existing resources. We have developed SeqDepot—a novel database that focuses solely on two primary goals: (i) assimilating known primary sequences with predicted feature data and (ii) providing the most simple and straightforward means to procure and readily use this information. Access to >28.5 million sequences and 300 million features is provided through a well-documented and flexible RESTful interface that supports fetching specific data subsets, bulk queries, visualization and searching by MD5 digests or external database identifiers. We have also developed an HTML5/JavaScript web application exemplifying how to interact with SeqDepot and Perl/Python scripts for use with local processing pipelines.
Availability: Freely available on the web at REST access via Database files and scripts may be downloaded from
PMCID: PMC3892692  PMID: 24234005
34.  Pathway Commons at Virtual Cell: use of pathway data for mathematical modeling 
Bioinformatics  2013;30(2):292-294.
Summary: Pathway Commons is a resource permitting simultaneous queries of multiple pathway databases. However, there is no standard mechanism for using these data (stored in BioPAX format) to annotate and build quantitative mathematical models. Therefore, we developed a new module within the virtual cell modeling and simulation software. It provides pathway data retrieval and visualization and enables automatic creation of executable network models directly from qualitative connections between pathway nodes.
Availability and implementation: Available at Virtual Cell ( Application runs on all major platforms and does not require registration for use on the user’s computer. Tutorials and video are available at user guide page.
PMCID: PMC3892693  PMID: 24273241
35.  Using Genome Query Language to uncover genetic variation 
Bioinformatics  2013;30(1):1-8.
Motivation: With high-throughput DNA sequencing costs dropping <$1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. To address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants.
Results: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5–10 lines of high-level code and search large datasets (100 GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple datasets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection and focuses on statistical inference.
Availability: GQL can be downloaded from
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3866549  PMID: 23751181
36.  A C library for retrieving specific reactions from the BioModels database 
Bioinformatics  2013;30(1):129-130.
Summary: We describe libSBMLReactionFinder, a C library for retrieving specific biochemical reactions from the curated systems biology markup language models contained in the BioModels database. The library leverages semantic annotations in the database to associate reactions with human-readable descriptions, making the reactions retrievable through simple string searches. Our goal is to provide a useful tool for quantitative modelers who seek to accelerate modeling efforts through the reuse of previously published representations of specific chemical reactions.
Availability and implementation: The library is open-source and dual licensed under the Mozilla Public License Version 2.0 and GNU General Public License Version 2.0. Project source code, downloads and documentation are available at
PMCID: PMC3866552  PMID: 24078714
37.  A user-oriented web crawler for selectively acquiring online content in e-health research 
Bioinformatics  2013;30(1):104-114.
Motivation: Life stories of diseased and healthy individuals are abundantly available on the Internet. Collecting and mining such online content can offer many valuable insights into patients’ physical and emotional states throughout the pre-diagnosis, diagnosis, treatment and post-treatment stages of the disease compared with those of healthy subjects. However, such content is widely dispersed across the web. Using traditional query-based search engines to manually collect relevant materials is rather labor intensive and often incomplete due to resource constraints in terms of human query composition and result parsing efforts. The alternative option, blindly crawling the whole web, has proven inefficient and unaffordable for e-health researchers.
Results: We propose a user-oriented web crawler that adaptively acquires user-desired content on the Internet to meet the specific online data source acquisition needs of e-health researchers. Experimental results on two cancer-related case studies show that the new crawler can substantially accelerate the acquisition of highly relevant online content compared with the existing state-of-the-art adaptive web crawling technology. For the breast cancer case study using the full training set, the new method achieves a cumulative precision between 74.7 and 79.4% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 32.8 and 37.0% using the peer method for the same time period. For the lung cancer case study using the full training set, the new method achieves a cumulative precision between 56.7 and 61.2% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 29.3 and 32.4% using the peer method. Using the reduced training set in the breast cancer case study, the cumulative precision of our method is between 44.6 and 54.9%, whereas the cumulative precision of the peer method is between 24.3 and 26.3%; for the lung cancer case study using the reduced training set, the cumulative precisions of our method and the peer method are, respectively, between 35.7 and 46.7% versus between 24.1 and 29.6%. These numbers clearly show a consistently superior accuracy of our method in discovering and acquiring user-desired online content for e-health research.
Availability and implementation: The implementation of our user-oriented web crawler is freely available to non-commercial users via the following Web site: The Web site provides a step-by-step guide on how to execute the web crawler implementation. In addition, the Web site provides the two study datasets including manually labeled ground truth, initial seeds and the crawling results reported in this article.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3866553  PMID: 24078710
38.  MSPrep—Summarization, normalization and diagnostics for processing of mass spectrometry–based metabolomic data 
Bioinformatics  2013;30(1):133-134.
Motivation: Although R packages exist for the pre-processing of metabolomic data, they currently do not incorporate additional analysis steps of summarization, filtering and normalization of aligned data. We developed the MSPrep R package to complement other packages by providing these additional steps, implementing a selection of popular normalization algorithms and generating diagnostics to help guide investigators in their analyses.
Supplementary Information: Supplementary materials are available at Bioinformatics online.
PMCID: PMC3866554  PMID: 24174567
39.  PhosphoNetworks: a database for human phosphorylation networks 
Bioinformatics  2013;30(1):141-142.
Summary: Phosphorylation plays an important role in cellular signal transduction. Current phosphorylation-related databases often focus on the phosphorylation sites, which are mainly determined by mass spectrometry. Here, we present PhosphoNetworks, a phosphorylation database built on a high-resolution map of phosphorylation networks. This high-resolution map of phosphorylation networks provides not only the kinase–substrate relationships (KSRs), but also the specific phosphorylation sites on which the kinases act on the substrates. The database contains the most comprehensive dataset for KSRs, including the relationships from a recent high-throughput project for identification of KSRs using protein microarrays, as well as known KSRs curated from the literature. In addition, the database also includes several analytical tools for dissecting phosphorylation networks. PhosphoNetworks is expected to play a prominent role in proteomics and phosphorylation-related disease research.
Availability and implementation:
PMCID: PMC3866559  PMID: 24227675
40.  Functional module identification in protein interaction networks by interaction patterns 
Bioinformatics  2013;30(1):81-93.
Motivation: Identifying functional modules in protein–protein interaction (PPI) networks may shed light on cellular functional organization and thereafter underlying cellular mechanisms. Many existing module identification algorithms aim to detect densely connected groups of proteins as potential modules. However, based on this simple topological criterion of ‘higher than expected connectivity’, those algorithms may miss biologically meaningful modules of functional significance, in which proteins have similar interaction patterns to other proteins in networks but may not be densely connected to each other. A few blockmodel module identification algorithms have been proposed to address the problem but the lack of global optimum guarantee and the prohibitive computational complexity have been the bottleneck of their applications in real-world large-scale PPI networks.
Results: In this article, we propose a novel optimization formulation LCP2 (low two-hop conductance sets) using the concept of Markov random walk on graphs, which enables simultaneous identification of both dense and sparse modules based on protein interaction patterns in given networks through searching for LCP2 by random walk. A spectral approximate algorithm SLCP2 is derived to identify non-overlapping functional modules. Based on a bottom-up greedy strategy, we further extend LCP2 to a new algorithm (greedy algorithm for LCP2) GLCP2 to identify overlapping functional modules. We compare SLCP2 and GLCP2 with a range of state-of-the-art algorithms on synthetic networks and real-world PPI networks. The performance evaluation based on several criteria with respect to protein complex prediction, high level Gene Ontology term prediction and especially sparse module detection, has demonstrated that our algorithms based on searching for LCP2 outperform all other compared algorithms.
Availability and implementation: All data and code are available at∼xqian/fmi/slcp2hop/.
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3924044  PMID: 24085567
41.  BSeQC: quality control of bisulfite sequencing experiments 
Bioinformatics  2013;29(24):3227-3229.
Motivation: Bisulfite sequencing (BS-seq) has emerged as the gold standard to study genome-wide DNA methylation at single-nucleotide resolution. Quality control (QC) is a critical step in the analysis pipeline to ensure that BS-seq data are of high quality and suitable for subsequent analysis. Although several QC tools are available for next-generation sequencing data, most of them were not designed to handle QC issues specific to BS-seq protocols. Therefore, there is a strong need for a dedicated QC tool to evaluate and remove potential technical biases in BS-seq experiments.
Results: We developed a package named BSeQC to comprehensively evaluate the quality of BS-seq experiments and automatically trim nucleotides with potential technical biases that may result in inaccurate methylation estimation. BSeQC takes standard SAM/BAM files as input and generates bias-free SAM/BAM files for downstream analysis. Evaluation based on real BS-seq data indicates that the use of the bias-free SAM/BAM file substantially improves the quantification of methylation level.
Availability and implementation: BSeQC is freely available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3842756  PMID: 24064417
42.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels 
Bioinformatics  2013;29(24):3143-3150.
Motivation: Accurately predicting and genotyping indels longer than 30 bp has remained a central challenge in next-generation sequencing (NGS) studies. While indels of up to 30 bp are reliably processed by standard read aligners and the Genome Analysis Toolkit (GATK), longer indels have still resisted proper treatment. Also, discovering and genotyping longer indels has become particularly relevant owing to the increasing attention in globally concerted projects.
Results: We present MATE-CLEVER (Mendelian-inheritance-AtTEntive CLique-Enumerating Variant findER) as an approach that accurately discovers and genotypes indels longer than 30 bp from contemporary NGS reads with a special focus on family data. For enhanced quality of indel calls in family trios or quartets, MATE-CLEVER integrates statistics that reflect the laws of Mendelian inheritance. MATE-CLEVER’s performance rates for indels longer than 30 bp are on a par with those of the GATK for indels shorter than 30 bp, achieving up to 90% precision overall, with >80% of calls correctly typed. In predicting de novo indels longer than 30 bp in family contexts, MATE-CLEVER even raises the standards of the GATK. MATE-CLEVER achieves precision and recall of ∼63% on indels of 30 bp and longer versus 55% in both categories for the GATK on indels of 10–29 bp. A special version of MATE-CLEVER has contributed to indel discovery, in particular for indels of 30–100 bp, the ‘NGS twilight zone of indels’, in the Genome of the Netherlands Project.
Availability and implementation:
Contact: or
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3842759  PMID: 24072733
43.  STAR: an integrated solution to management and visualization of sequencing data 
Bioinformatics  2013;29(24):3204-3210.
Motivation: Easily visualization of complex data features is a necessary step to conduct studies on next-generation sequencing (NGS) data. We developed STAR, an integrated web application that enables online management, visualization and track-based analysis of NGS data.
Results: STAR is a multilayer web service system. On the client side, STAR leverages JavaScript, HTML5 Canvas and asynchronous communications to deliver a smoothly scrolling desktop-like graphical user interface with a suite of in-browser analysis tools that range from providing simple track configuration controls to sophisticated feature detection within datasets. On the server side, STAR supports private session state retention via an account management system and provides data management modules that enable collection, visualization and analysis of third-party sequencing data from the public domain with over thousands of tracks hosted to date. Overall, STAR represents a next-generation data exploration solution to match the requirements of NGS data, enabling both intuitive visualization and dynamic analysis of data.
Availability and implementation: STAR browser system is freely available on the web at and
PMCID: PMC3842760  PMID: 24078702
44.  WebGLORE: a Web service for Grid LOgistic REgression 
Bioinformatics  2013;29(24):3238-3240.
WebGLORE is a free web service that enables privacy-preserving construction of a global logistic regression model from distributed datasets that are sensitive. It only transfers aggregated local statistics (from participants) through Hypertext Transfer Protocol Secure to a trusted server, where the global model is synthesized. WebGLORE seamlessly integrates AJAX, JAVA Applet/Servlet and PHP technologies to provide an easy-to-use web service for biomedical researchers to break down policy barriers during information exchange.
Availability and implementation: WebGLORE can be used under the terms of GNU general public license as published by the Free Software Foundation.
PMCID: PMC3842761  PMID: 24072732
45.  Optimized atomic statistical potentials: assessment of protein interfaces and loops 
Bioinformatics  2013;29(24):3158-3166.
Motivation: Statistical potentials have been widely used for modeling whole proteins and their parts (e.g. sidechains and loops) as well as interactions between proteins, nucleic acids and small molecules. Here, we formulate the statistical potentials entirely within a statistical framework, avoiding questionable statistical mechanical assumptions and approximations, including a definition of the reference state.
Results: We derive a general Bayesian framework for inferring statistically optimized atomic potentials (SOAP) in which the reference state is replaced with data-driven ‘recovery’ functions. Moreover, we restrain the relative orientation between two covalent bonds instead of a simple distance between two atoms, in an effort to capture orientation-dependent interactions such as hydrogen bonds. To demonstrate this general approach, we computed statistical potentials for protein–protein docking (SOAP-PP) and loop modeling (SOAP-Loop). For docking, a near-native model is within the top 10 scoring models in 40% of the PatchDock benchmark cases, compared with 23 and 27% for the state-of-the-art ZDOCK and FireDock scoring functions, respectively. Similarly, for modeling 12-residue loops in the PLOP benchmark, the average main-chain root mean square deviation of the best scored conformations by SOAP-Loop is 1.5 Å, close to the average root mean square deviation of the best sampled conformations (1.2 Å) and significantly better than that selected by Rosetta (2.1 Å), DFIRE (2.3 Å), DOPE (2.5 Å) and PLOP scoring functions (3.0 Å). Our Bayesian framework may also result in more accurate statistical potentials for additional modeling applications, thus affording better leverage of the experimentally determined protein structures.
Availability and implementation: SOAP-PP and SOAP-Loop are available as part of MODELLER (
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3842762  PMID: 24078704
46.  GPCR ontology: development and application of a G protein-coupled receptor pharmacology knowledge framework 
Bioinformatics  2013;29(24):3211-3219.
Motivation: Novel tools need to be developed to help scientists analyze large amounts of available screening data with the goal to identify entry points for the development of novel chemical probes and drugs. As the largest class of drug targets, G protein-coupled receptors (GPCRs) remain of particular interest and are pursued by numerous academic and industrial research projects.
Results: We report the first GPCR ontology to facilitate integration and aggregation of GPCR-targeting drugs and demonstrate its application to classify and analyze a large subset of the PubChem database. The GPCR ontology, based on previously reported BioAssay Ontology, depicts available pharmacological, biochemical and physiological profiles of GPCRs and their ligands. The novelty of the GPCR ontology lies in the use of diverse experimental datasets linked by a model to formally define these concepts. Using a reasoning system, GPCR ontology offers potential for knowledge-based classification of individuals (such as small molecules) as a function of the data.
Availability: The GPCR ontology is available at and the National Center for Biomedical Ontologies Web site.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3842764  PMID: 24078711
47.  Achievements and challenges in structural bioinformatics and computational biophysics 
Bioinformatics  2014;31(1):146-150.
Motivation: The field of structural bioinformatics and computational biophysics has undergone a revolution in the last 10 years. Developments that are captured annually through the 3DSIG meeting, upon which this article reflects.
Results: An increase in the accessible data, computational resources and methodology has resulted in an increase in the size and resolution of studied systems and the complexity of the questions amenable to research. Concomitantly, the parameterization and efficiency of the methods have markedly improved along with their cross-validation with other computational and experimental results.
Conclusion: The field exhibits an ever-increasing integration with biochemistry, biophysics and other disciplines. In this article, we discuss recent achievements along with current challenges within the field.
PMCID: PMC4271151  PMID: 25488929
48.  PAVIS: a tool for Peak Annotation and Visualization 
Bioinformatics  2013;29(23):3097-3099.
Summary: We introduce a web-based tool, Peak Annotation and Visualization (PAVIS), for annotating and visualizing ChIP-seq peak data. PAVIS is designed with non-bioinformaticians in mind and presents a straightforward user interface to facilitate biological interpretation of ChIP-seq peak or other genomic enrichment data. PAVIS, through association with annotation, provides relevant genomic context for each peak, such as peak location relative to genomic features including transcription start site, intron, exon or 5′/3′-untranslated region. PAVIS reports the relative enrichment P-values of peaks in these functionally distinct categories, and provides a summary plot of the relative proportion of peaks in each category. PAVIS, unlike many other resources, provides a peak-oriented annotation and visualization system, allowing dynamic visualization of tens to hundreds of loci from one or more ChIP-seq experiments, simultaneously. PAVIS enables rapid, and easy examination and cross-comparison of the genomic context and potential functions of the underlying genomic elements, thus supporting downstream hypothesis generation.
Availability and Implementation: PAVIS is publicly accessed at
Supplementary information: Supplementary data are available at Bioinformatics online
PMCID: PMC3834791  PMID: 24008416
49.  Precise inference of copy number alterations in tumor samples from SNP arrays 
Bioinformatics  2013;29(23):2964-2970.
Motivation: The accurate detection of copy number alterations (CNAs) in human genomes is important for understanding susceptibility to cancer and mechanisms of tumor progression. CNA detection in tumors from single nucleotide polymorphism (SNP) genotyping arrays is a challenging problem due to phenomena such as aneuploidy, stromal contamination, genomic waves and intra-tumor heterogeneity, issues that leading methods do not optimally address.
Results: Here we introduce methods and software (PennCNV-tumor) for fast and accurate CNA detection using signal intensity data from SNP genotyping arrays. We estimate stromal contamination by applying a maximum likelihood approach over multiple discrete genomic intervals. By conditioning on signal intensity across the genome, our method accounts for both aneuploidy and genomic waves. Finally, our method uses a hidden Markov model to integrate multiple sources of information, including total and allele-specific signal intensity at each SNP, as well as physical maps to make posterior inferences of CNAs. Using real data from cancer cell-lines and patient tumors, we demonstrate substantial improvements in accuracy and computational efficiency compared with existing methods.
Availability: Source code, documentation and example datasets are freely available at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3834792  PMID: 24021380
50.  Explicet: graphical user interface software for metadata-driven management, analysis and visualization of microbiome data 
Bioinformatics  2013;29(23):3100-3101.
Summary: Studies of the human microbiome, and microbial community ecology in general, have blossomed of late and are now a burgeoning source of exciting research findings. Along with the advent of next-generation sequencing platforms, which have dramatically increased the scope of microbiome-related projects, several high-performance sequence analysis pipelines (e.g. QIIME, MOTHUR, VAMPS) are now available to investigators for microbiome analysis. The subject of our manuscript, the graphical user interface-based Explicet software package, fills a previously unmet need for a robust, yet intuitive means of integrating the outputs of the software pipelines with user-specified metadata and then visualizing the combined data.
Availability and Implementation: Explicet is implemented in C++ via the Qt framework and supported in native code on all major operating systems (Windows, Macintosh, Linux). The source code, documents and tutorials are freely available under an open-source license at
PMCID: PMC3834795  PMID: 24021386

Results 26-50 (2555)