PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (2431)
 

Clipboard (0)
None
Journals
Year of Publication
1.  A variable selection method for genome-wide association studies 
Bioinformatics  2010;27(1):1-8.
Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at http://www.bios.unc.edu/~lin.
Access to WTCCC data: http://www.wtccc.org.uk/
Contact: lin@bios.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics Online.
doi:10.1093/bioinformatics/btq600
PMCID: PMC3025714  PMID: 21036813
2.  Efficient whole-genome association mapping using local phylogenies for unphased genotype data 
Bioinformatics  2008;24(19):2215-2221.
Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets.
Availability The software described in this article is available at http://www.daimi.au.dk/~mailund/Blossoc and distributed under the GNU General Public License.
Contact:mailund@birc.au.dk
doi:10.1093/bioinformatics/btn406
PMCID: PMC2553438  PMID: 18667442
3.  iFoldRNA: three-dimensional RNA structure prediction and folding 
Bioinformatics  2008;24(17):1951-1952.
Summary: Three-dimensional RNA structure prediction and folding is of significant interest in the biological research community. Here, we present iFoldRNA, a novel web-based methodology for RNA structure prediction with near atomic resolution accuracy and analysis of RNA folding thermodynamics. iFoldRNA rapidly explores RNA conformations using discrete molecular dynamics simulations of input RNA sequences. Starting from simplified linear-chain conformations, RNA molecules (<50 nt) fold to native-like structures within half an hour of simulation, facilitating rapid RNA structure prediction. All-atom reconstruction of energetically stable conformations generates iFoldRNA predicted RNA structures. The predicted RNA structures are within 2–5 Å root mean squre deviations (RMSDs) from corresponding experimentally derived structures. RNA folding parameters including specific heat, contact maps, simulation trajectories, gyration radii, RMSDs from native state, fraction of native-like contacts are accessible from iFoldRNA. We expect iFoldRNA will serve as a useful resource for RNA structure prediction and folding thermodynamic analyses.
Availability: http://iFoldRNA.dokhlab.org.
Contact: dokh@med.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn328
PMCID: PMC2559968  PMID: 18579566
4.  Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence 
Bioinformatics  2008;24(16):1805-1811.
Motivation: A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of genotype–phenotype correlation with a priori evidence of biological relevance.
Results: We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains including statistical evidence of genotype–phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
Availability: A comprehensive database of biological prioritization scores for all known SNPs is available at http://zork.wustl.edu/gin. This can be used to prioritize nicotine dependence association studies through a straightforward mathematical formula—no special software is necessary.
Contact: ssaccone@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn315
PMCID: PMC2610477  PMID: 18565990
5.  Comprehensive in silico mutagenesis highlights functionally important residues in proteins 
Bioinformatics  2008;24(16):i207-i212.
Motivation: Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Results: Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1 kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
Availability: ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Contact: yb2009@columbia.edu
doi:10.1093/bioinformatics/btn268
PMCID: PMC2597370  PMID: 18689826
6.  LOT: a tool for linkage analysis of ordinal traits for pedigree data 
Bioinformatics  2008;24(15):1737-1739.
Summary: Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
Availability: The LOT program is available for download at http://c2s2.yale.edu/software/LOT/
Contact: heping.zhang@yale.edu
doi:10.1093/bioinformatics/btn258
PMCID: PMC2566542  PMID: 18535081
7.  Powerful fusion: PSI-BLAST and consensus sequences 
Bioinformatics  2008;24(18):1987-1993.
Motivation: A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
Results: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a very popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
Availability: http://www.rostlab.org/services/consensus/
Contact: dariusz@mit.edu
doi:10.1093/bioinformatics/btn384
PMCID: PMC2577777  PMID: 18678588
8.  Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification 
Bioinformatics  2008;24(13):i348-i356.
Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Contact:noble@gs.washington.edu
doi:10.1093/bioinformatics/btn189
PMCID: PMC2665034  PMID: 18586734
9.  Memory-efficient dynamic programming backtrace and pairwise local sequence alignment 
Bioinformatics  2008;24(16):1772-1778.
Motivation: A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Results: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Availability: Sample C++-code for optimal backtrace is available in the Supplementary Materials.
Contact: leen@cs.rpi.edu
Supplementary information: Supplementary data is available at Bioinformatics online.
doi:10.1093/bioinformatics/btn308
PMCID: PMC2668612  PMID: 18558620
11.  GPU linear and non-linear Poisson–Boltzmann solver module for DelPhi 
Bioinformatics  2013;30(4):569-570.
Summary: In this work, we present a CUDA-based GPU implementation of a Poisson–Boltzmann equation solver, in both the linear and non-linear versions, using double precision. A finite difference scheme is adopted and made suitable for the GPU architecture. The resulting code was interfaced with the electrostatics software for biomolecules DelPhi, which is widely used in the computational biology community. The algorithm has been implemented using CUDA and tested over a few representative cases of biological interest. Details of the implementation and performance test results are illustrated. A speedup of ∼10 times was achieved both in the linear and non-linear cases.
Availability and implementation: The module is open-source and available at http://www.electrostaticszone.eu/index.php/downloads.
Contact: walter.rocchia@iit.it
Supplementary information: Supplementary data are available at Bioinformatics online
doi:10.1093/bioinformatics/btt699
PMCID: PMC3928518  PMID: 24292939
12.  ChromoHub V2: cancer genomics 
Bioinformatics  2013;30(4):590-592.
Summary: Cancer genomics data produced by next-generation sequencing support the notion that epigenetic mechanisms play a central role in cancer. We have previously developed Chromohub, an open access online interface where users can map chemical, structural and biological data from public repositories on phylogenetic trees of protein families involved in chromatin mediated-signaling. Here, we describe a cancer genomics interface that was recently added to Chromohub; the frequency of mutation, amplification and change in expression of chromatin factors across large cohorts of cancer patients is regularly extracted from The Cancer Genome Atlas and the International Cancer Genome Consortium and can now be mapped on phylogenetic trees of epigenetic protein families. Explorators of chromatin signaling can now easily navigate the cancer genomics landscape of writers, readers and erasers of histone marks, chromatin remodeling complexes, histones and their chaperones.
Availability and implementation: http://www.thesgc.org/chromohub/.
Contact: matthieu.schapira@utoronto.ca
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt710
PMCID: PMC3928521  PMID: 24319001
13.  HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data 
Bioinformatics  2013;30(4):588-589.
Summary: We report on the development of the high-throughput screening (HTS) Navigator software to analyze and visualize the results of HTS of chemical libraries. The HTS Navigator processes output files from different plate readers' formats, computes the overall HTS matrix, automatically detects hits and has different types of baseline navigation and correction features. The software incorporates advanced cheminformatics capabilities such as chemical structure storage and visualization, fast similarity search and chemical neighborhood analysis for retrieved hits. The software is freely available for academic laboratories.
Availability and implementation: http://fourches.web.unc.edu/
Contact: fourches@email.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt718
PMCID: PMC3928525  PMID: 24376084
14.  BETASEQ: a powerful novel method to control type-I error inflation in partially sequenced data for rare variant association testing 
Bioinformatics  2013;30(4):480-487.
Summary: Despite its great capability to detect rare variant associations, next-generation sequencing is still prohibitively expensive when applied to large samples. In case-control studies, it is thus appealing to sequence only a subset of cases to discover variants and genotype the identified variants in controls and the remaining cases under the reasonable assumption that causal variants are usually enriched among cases. However, this approach leads to inflated type-I error if analyzed naively for rare variant association. Several methods have been proposed in recent literature to control type-I error at the cost of either excluding some sequenced cases or correcting the genotypes of discovered rare variants. All of these approaches thus suffer from certain extent of information loss and thus are underpowered. We propose a novel method (BETASEQ), which corrects inflation of type-I error by supplementing pseudo-variants while keeps the original sequence and genotype data intact. Extensive simulations and real data analysis demonstrate that, in most practical situations, BETASEQ leads to higher testing powers than existing approaches with guaranteed (controlled or conservative) type-I error.
Availability and implementation: BETASEQ and associated R files, including documentation, examples, are available at http://www.unc.edu/∼yunmli/betaseq
Contact: songyan@unc.edu or yunli@med.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt719
PMCID: PMC3928526  PMID: 24336643
15.  DIVE: a data intensive visualization engine 
Bioinformatics  2013;30(4):593-595.
Summary: Modern scientific investigation is generating increasingly larger datasets, yet analyzing these data with current tools is challenging. DIVE is a software framework intended to facilitate big data analysis and reduce the time to scientific insight. Here, we present features of the framework and demonstrate DIVE’s application to the Dynameomics project, looking specifically at two proteins.
Availability and implementation: Binaries and documentation are available at http://www.dynameomics.org/DIVE/DIVESetup.exe.
Contact: daggett@uw.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt721
PMCID: PMC3928528  PMID: 24336804
16.  Fast pairwise IBD association testing in genome-wide association studies 
Bioinformatics  2013;30(2):206-213.
Motivation: Recently, investigators have proposed state-of-the-art Identity-by-descent (IBD) mapping methods to detect IBD segments between purportedly unrelated individuals. The IBD information can then be used for association testing in genetic association studies. One approach for this IBD association testing strategy is to test for excessive IBD between pairs of cases (‘pairwise method’). However, this approach is inefficient because it requires a large number of permutations. Moreover, a limited number of permutations define a lower bound for P-values, which makes fine-mapping of associated regions difficult because, in practice, a much larger genomic region is implicated than the region that is actually associated.
Results: In this article, we introduce a new pairwise method ‘Fast-Pairwise’. Fast-Pairwise uses importance sampling to improve efficiency and enable approximation of extremely small P-values. Fast-Pairwise method takes only days to complete a genome-wide scan. In the application to the WTCCC type 1 diabetes data, Fast-Pairwise successfully fine-maps a known human leukocyte antigen gene that is known to cause the disease.
Availability: Fast-Pairwise is publicly available at: http://genetics.cs.ucla.edu/graphibd.
Contact: eeskin@cs.ucla.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt609
PMCID: PMC3892684  PMID: 24158599
17.  Testing multiple biological mediators simultaneously 
Bioinformatics  2013;30(2):214-220.
Motivation: Modern biomedical and epidemiological studies often measure hundreds or thousands of biomarkers, such as gene expression or metabolite levels. Although there is an extensive statistical literature on adjusting for ‘multiple comparisons’ when testing whether these biomarkers are directly associated with a disease, testing whether they are biological mediators between a known risk factor and a disease requires a more complex null hypothesis, thus offering additional methodological challenges.
Results: We propose a permutation approach that tests multiple putative mediators and controls the family wise error rate. We demonstrate that, unlike when testing direct associations, replacing the Bonferroni correction with a permutation approach that focuses on the maximum of the test statistics can significantly improve the power to detect mediators even when all biomarkers are independent. Through simulations, we show the power of our method is 2–5× larger than the power achieved by Bonferroni correction. Finally, we apply our permutation test to a case-control study of dietary risk factors and colorectal adenoma to show that, of 149 test metabolites, docosahexaenoate is a possible mediator between fish consumption and decreased colorectal adenoma risk.
Availability and implementation: R-package included in online Supplementary Material.
Contact: joshua.sampson@nih.gov
Supplementary information: Supplementary materials are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt633
PMCID: PMC3892685  PMID: 24202540
18.  On the simultaneous association analysis of large genomic regions: a massive multi-locus association test 
Bioinformatics  2013;30(2):157-164.
Motivation: For samples of unrelated individuals, we propose a general analysis framework in which hundred thousands of genetic loci can be tested simultaneously for association with complex phenotypes. The approach is built on spatial-clustering methodology, assuming that genetic loci that are associated with the target phenotype cluster in certain genomic regions. In contrast to standard methodology for multilocus analysis, which has focused on the dimension reduction of the data, our multilocus association-clustering test profits from the availability of large numbers of genetic loci by detecting clusters of loci that are associated with the phenotype.
Results: The approach is computationally fast and powerful, enabling the simultaneous association testing of large genomic regions. Even the entire genome or certain chromosomes can be tested simultaneously. Using simulation studies, the properties of the approach are evaluated. In an application to a genome-wide association study for chronic obstructive pulmonary disease, we illustrate the practical relevance of the proposed method by simultaneously testing all genotyped loci of the genome-wide association study and by testing each chromosome individually. Our findings suggest that statistical methodology that incorporates spatial-clustering information will be especially useful in whole-genome sequencing studies in which millions or billions of base pairs are recorded and grouped by genomic regions or genes, and are tested jointly for association.
Availability and implementation: Implementation of the approach is available upon request.
Contact: daq412@mail.harvard.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt654
PMCID: PMC3892690  PMID: 24262215
19.  SeqDepot: streamlined database of biological sequences and precomputed features 
Bioinformatics  2013;30(2):295-297.
Summary: Assembling and/or producing integrated knowledge of sequence features continues to be an onerous and redundant task despite a large number of existing resources. We have developed SeqDepot—a novel database that focuses solely on two primary goals: (i) assimilating known primary sequences with predicted feature data and (ii) providing the most simple and straightforward means to procure and readily use this information. Access to >28.5 million sequences and 300 million features is provided through a well-documented and flexible RESTful interface that supports fetching specific data subsets, bulk queries, visualization and searching by MD5 digests or external database identifiers. We have also developed an HTML5/JavaScript web application exemplifying how to interact with SeqDepot and Perl/Python scripts for use with local processing pipelines.
Availability: Freely available on the web at http://seqdepot.net/. REST access via http://seqdepot.net/api/v1. Database files and scripts may be downloaded from http://seqdepot.net/download.
Contact: ulrich.luke+sci@gmail.com
doi:10.1093/bioinformatics/btt658
PMCID: PMC3892692  PMID: 24234005
20.  Pathway Commons at Virtual Cell: use of pathway data for mathematical modeling 
Bioinformatics  2013;30(2):292-294.
Summary: Pathway Commons is a resource permitting simultaneous queries of multiple pathway databases. However, there is no standard mechanism for using these data (stored in BioPAX format) to annotate and build quantitative mathematical models. Therefore, we developed a new module within the virtual cell modeling and simulation software. It provides pathway data retrieval and visualization and enables automatic creation of executable network models directly from qualitative connections between pathway nodes.
Availability and implementation: Available at Virtual Cell (http://vcell.org/). Application runs on all major platforms and does not require registration for use on the user’s computer. Tutorials and video are available at user guide page.
Contact: vcell_support@uchc.edu
doi:10.1093/bioinformatics/btt660
PMCID: PMC3892693  PMID: 24273241
21.  Using Genome Query Language to uncover genetic variation 
Bioinformatics  2013;30(1):1-8.
Motivation: With high-throughput DNA sequencing costs dropping <$1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. To address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants.
Results: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5–10 lines of high-level code and search large datasets (100 GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple datasets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection and focuses on statistical inference.
Availability: GQL can be downloaded from http://cseweb.ucsd.edu/~ckozanit/gql.
Contact: ckozanit@ucsd.edu or vbafna@cs.ucsd.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt250
PMCID: PMC3866549  PMID: 23751181
22.  A C library for retrieving specific reactions from the BioModels database 
Bioinformatics  2013;30(1):129-130.
Summary: We describe libSBMLReactionFinder, a C library for retrieving specific biochemical reactions from the curated systems biology markup language models contained in the BioModels database. The library leverages semantic annotations in the database to associate reactions with human-readable descriptions, making the reactions retrievable through simple string searches. Our goal is to provide a useful tool for quantitative modelers who seek to accelerate modeling efforts through the reuse of previously published representations of specific chemical reactions.
Availability and implementation: The library is open-source and dual licensed under the Mozilla Public License Version 2.0 and GNU General Public License Version 2.0. Project source code, downloads and documentation are available at http://code.google.com/p/lib-sbml-reaction-finder.
Contact: mneal@uw.edu
doi:10.1093/bioinformatics/btt567
PMCID: PMC3866552  PMID: 24078714
23.  A user-oriented web crawler for selectively acquiring online content in e-health research 
Bioinformatics  2013;30(1):104-114.
Motivation: Life stories of diseased and healthy individuals are abundantly available on the Internet. Collecting and mining such online content can offer many valuable insights into patients’ physical and emotional states throughout the pre-diagnosis, diagnosis, treatment and post-treatment stages of the disease compared with those of healthy subjects. However, such content is widely dispersed across the web. Using traditional query-based search engines to manually collect relevant materials is rather labor intensive and often incomplete due to resource constraints in terms of human query composition and result parsing efforts. The alternative option, blindly crawling the whole web, has proven inefficient and unaffordable for e-health researchers.
Results: We propose a user-oriented web crawler that adaptively acquires user-desired content on the Internet to meet the specific online data source acquisition needs of e-health researchers. Experimental results on two cancer-related case studies show that the new crawler can substantially accelerate the acquisition of highly relevant online content compared with the existing state-of-the-art adaptive web crawling technology. For the breast cancer case study using the full training set, the new method achieves a cumulative precision between 74.7 and 79.4% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 32.8 and 37.0% using the peer method for the same time period. For the lung cancer case study using the full training set, the new method achieves a cumulative precision between 56.7 and 61.2% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 29.3 and 32.4% using the peer method. Using the reduced training set in the breast cancer case study, the cumulative precision of our method is between 44.6 and 54.9%, whereas the cumulative precision of the peer method is between 24.3 and 26.3%; for the lung cancer case study using the reduced training set, the cumulative precisions of our method and the peer method are, respectively, between 35.7 and 46.7% versus between 24.1 and 29.6%. These numbers clearly show a consistently superior accuracy of our method in discovering and acquiring user-desired online content for e-health research.
Availability and implementation: The implementation of our user-oriented web crawler is freely available to non-commercial users via the following Web site: http://bsec.ornl.gov/AdaptiveCrawler.shtml. The Web site provides a step-by-step guide on how to execute the web crawler implementation. In addition, the Web site provides the two study datasets including manually labeled ground truth, initial seeds and the crawling results reported in this article.
Contact: xus1@ornl.gov
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt571
PMCID: PMC3866553  PMID: 24078710
24.  MSPrep—Summarization, normalization and diagnostics for processing of mass spectrometry–based metabolomic data 
Bioinformatics  2013;30(1):133-134.
Motivation: Although R packages exist for the pre-processing of metabolomic data, they currently do not incorporate additional analysis steps of summarization, filtering and normalization of aligned data. We developed the MSPrep R package to complement other packages by providing these additional steps, implementing a selection of popular normalization algorithms and generating diagnostics to help guide investigators in their analyses.
Availability: http://www.sourceforge.net/projects/msprep
Contact: grant.hughes@ucdenver.edu
Supplementary Information: Supplementary materials are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt589
PMCID: PMC3866554  PMID: 24174567
25.  PhosphoNetworks: a database for human phosphorylation networks 
Bioinformatics  2013;30(1):141-142.
Summary: Phosphorylation plays an important role in cellular signal transduction. Current phosphorylation-related databases often focus on the phosphorylation sites, which are mainly determined by mass spectrometry. Here, we present PhosphoNetworks, a phosphorylation database built on a high-resolution map of phosphorylation networks. This high-resolution map of phosphorylation networks provides not only the kinase–substrate relationships (KSRs), but also the specific phosphorylation sites on which the kinases act on the substrates. The database contains the most comprehensive dataset for KSRs, including the relationships from a recent high-throughput project for identification of KSRs using protein microarrays, as well as known KSRs curated from the literature. In addition, the database also includes several analytical tools for dissecting phosphorylation networks. PhosphoNetworks is expected to play a prominent role in proteomics and phosphorylation-related disease research.
Availability and implementation: http://www.phosphonetworks.org
Contact: jiang.qian@jhmi.edu
doi:10.1093/bioinformatics/btt627
PMCID: PMC3866559  PMID: 24227675

Results 1-25 (2431)