Search tips
Search criteria

Results 1-25 (2620)

Clipboard (0)
Year of Publication
1.  A variable selection method for genome-wide association studies 
Bioinformatics  2010;27(1):1-8.
Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at
Access to WTCCC data:
Supplementary information: Supplementary data are available at Bioinformatics Online.
PMCID: PMC3025714  PMID: 21036813
2.  Efficient whole-genome association mapping using local phylogenies for unphased genotype data 
Bioinformatics  2008;24(19):2215-2221.
Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets.
Availability The software described in this article is available at and distributed under the GNU General Public License.
PMCID: PMC2553438  PMID: 18667442
3.  iFoldRNA: three-dimensional RNA structure prediction and folding 
Bioinformatics  2008;24(17):1951-1952.
Summary: Three-dimensional RNA structure prediction and folding is of significant interest in the biological research community. Here, we present iFoldRNA, a novel web-based methodology for RNA structure prediction with near atomic resolution accuracy and analysis of RNA folding thermodynamics. iFoldRNA rapidly explores RNA conformations using discrete molecular dynamics simulations of input RNA sequences. Starting from simplified linear-chain conformations, RNA molecules (<50 nt) fold to native-like structures within half an hour of simulation, facilitating rapid RNA structure prediction. All-atom reconstruction of energetically stable conformations generates iFoldRNA predicted RNA structures. The predicted RNA structures are within 2–5 Å root mean squre deviations (RMSDs) from corresponding experimentally derived structures. RNA folding parameters including specific heat, contact maps, simulation trajectories, gyration radii, RMSDs from native state, fraction of native-like contacts are accessible from iFoldRNA. We expect iFoldRNA will serve as a useful resource for RNA structure prediction and folding thermodynamic analyses.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2559968  PMID: 18579566
4.  Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence 
Bioinformatics  2008;24(16):1805-1811.
Motivation: A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of genotype–phenotype correlation with a priori evidence of biological relevance.
Results: We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains including statistical evidence of genotype–phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
Availability: A comprehensive database of biological prioritization scores for all known SNPs is available at This can be used to prioritize nicotine dependence association studies through a straightforward mathematical formula—no special software is necessary.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2610477  PMID: 18565990
5.  Comprehensive in silico mutagenesis highlights functionally important residues in proteins 
Bioinformatics  2008;24(16):i207-i212.
Motivation: Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Results: Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1 kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
Availability: ASEdb and glucokinase scores are available at For submissions of large/whole proteins for processing please contact the author.
PMCID: PMC2597370  PMID: 18689826
6.  LOT: a tool for linkage analysis of ordinal traits for pedigree data 
Bioinformatics  2008;24(15):1737-1739.
Summary: Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
Availability: The LOT program is available for download at
PMCID: PMC2566542  PMID: 18535081
7.  Powerful fusion: PSI-BLAST and consensus sequences 
Bioinformatics  2008;24(18):1987-1993.
Motivation: A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
Results: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a very popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
PMCID: PMC2577777  PMID: 18678588
8.  Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification 
Bioinformatics  2008;24(13):i348-i356.
Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability: Python and C source code are available upon request from the authors. The curated training sets are available at The Graphical Model Tool Kit (GMTK) is freely available at
PMCID: PMC2665034  PMID: 18586734
9.  Memory-efficient dynamic programming backtrace and pairwise local sequence alignment 
Bioinformatics  2008;24(16):1772-1778.
Motivation: A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Results: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Availability: Sample C++-code for optimal backtrace is available in the Supplementary Materials.
Supplementary information: Supplementary data is available at Bioinformatics online.
PMCID: PMC2668612  PMID: 18558620
14.  Census 2: isobaric labeling data analysis 
Bioinformatics  2014;30(15):2208-2209.
Motivation: We introduce Census 2, an update of a mass spectrometry data analysis tool for peptide/protein quantification. New features for analysis of isobaric labeling, such as Tandem Mass Tag (TMT) or Isobaric Tags for Relative and Absolute Quantification (iTRAQ), have been added in this version, including a reporter ion impurity correction, a reporter ion intensity threshold filter and an option for weighted normalization to correct mixing errors. TMT/iTRAQ analysis can be performed on experiments using HCD (High Energy Collision Dissociation) only, CID (Collision Induced Dissociation)/HCD (High Energy Collision Dissociation) dual scans or HCD triple-stage mass spectrometry data. To improve measurement accuracy, we implemented weighted normalization, multiple tandem spectral approach, impurity correction and dynamic intensity threshold features.
Availability and implementation: Census 2 supports multiple input file formats including MS1/MS2, DTASelect, mzXML and pepXML. It requires JAVA version 6 or later to run. Free download of Census 2 for academic users is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4155478  PMID: 24681903
15.  Covariate-modulated local false discovery rate for genome-wide association studies 
Bioinformatics  2014;30(15):2098-2104.
Motivation: Genome-wide association studies (GWAS) have largely failed to identify most of the genetic basis of highly heritable diseases and complex traits. Recent work has suggested this could be because many genetic variants, each with individually small effects, compose their genetic architecture, limiting the power of GWAS, given currently obtainable sample sizes. In this scenario, Bonferroni-derived thresholds are severely underpowered to detect the vast majority of associations. Local false discovery rate (fdr) methods provide more power to detect non-null associations, but implicit assumptions about the exchangeability of single nucleotide polymorphisms (SNPs) limit their ability to discover non-null loci.
Methods: We propose a novel covariate-modulated local false discovery rate (cmfdr) that incorporates prior information about gene element–based functional annotations of SNPs, so that SNPs from categories enriched for non-null associations have a lower fdr for a given value of a test statistic than SNPs in unenriched categories. This readjustment of fdr based on functional annotations is achieved empirically by fitting a covariate-modulated parametric two-group mixture model. The proposed cmfdr methodology is applied to a large Crohn’s disease GWAS.
Results: Use of cmfdr dramatically improves power, e.g. increasing the number of loci declared significant at the 0.05 fdr level by a factor of 5.4. We also demonstrate that SNPs were declared significant using cmfdr compared with usual fdr replicate in much higher numbers, while maintaining similar replication rates for a given fdr cutoff in de novo samples, using the eight Crohn’s disease substudies as independent training and test datasets.
Availability an implementation:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103587  PMID: 24711653
16.  PRADA: pipeline for RNA sequencing data analysis 
Bioinformatics  2014;30(15):2224-2226.
Summary: Technological advances in high-throughput sequencing necessitate improved computational tools for processing and analyzing large-scale datasets in a systematic automated manner. For that purpose, we have developed PRADA (Pipeline for RNA-Sequencing Data Analysis), a flexible, modular and highly scalable software platform that provides many different types of information available by multifaceted analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. PRADA uses a dual-mapping strategy that increases sensitivity and refines the analytical endpoints. PRADA has been used extensively and successfully in the glioblastoma and renal clear cell projects of The Cancer Genome Atlas program.
Availability and implementation:
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103589  PMID: 24695405
17.  Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity 
Bioinformatics  2014;30(15):2121-2129.
Motivation: Next-generation sequencing (NGS) has revolutionized the study of cancer genomes. However, the reads obtained from NGS of tumor samples often consist of a mixture of normal and tumor cells, which themselves can be of multiple clonal types. A prominent problem in the analysis of cancer genome sequencing data is deconvolving the mixture to identify the reads associated with tumor cells or a particular subclone of tumor cells. Solving the problem is, however, challenging because of the so-called ‘identifiability problem’, where different combinations of tumor purity and ploidy often explain the sequencing data equally well.
Results: We propose a new model to resolve the identifiability problem by integrating two types of sequencing information—somatic copy number alterations and loss of heterozygosity—within a unified probabilistic framework. We derive algorithms to solve our model, and implement them in a software package called PyLOH. We benchmark the performance of PyLOH using both simulated data and 12 breast cancer sequencing datasets and show that PyLOH outperforms existing methods in disambiguating the identifiability problem and estimating tumor purity.
Availability and implementation: The PyLOH package is written in Python and is publicly available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103592  PMID: 24695406
18.  MetDisease—connecting metabolites to diseases via literature 
Bioinformatics  2014;30(15):2239-2241.
Motivation: In recent years, metabolomics has emerged as an approach to perform large-scale characterization of small molecules in biological systems. Metabolomics posed a number of bioinformatics challenges associated in data analysis and interpretation. Genome-based metabolic reconstructions have established a powerful framework for connecting metabolites to genes through metabolic reactions and enzymes that catalyze them. Pathway databases and bioinformatics tools that use this framework have proven to be useful for annotating experimental metabolomics data. This framework can be used to infer connections between metabolites and diseases through annotated disease genes. However, only about half of experimentally detected metabolites can be mapped to canonical metabolic pathways. We present a new Cytoscape 3 plug-in, MetDisease, which uses an alternative approach to link metabolites to disease information. MetDisease uses Medical Subject Headings (MeSH) disease terms mapped to PubChem compounds through literature to annotate compound networks.
Availability and implementation: MetDisease can be downloaded from or installed via the Cytoscape app manager. Further information about MetDisease can be found at
Supplementary information: Supplementary Data are available at Bioinformatics online
PMCID: PMC4103594  PMID: 24713438
19.  A fast and powerful tree-based association test for detecting complex joint effects in case–control studies 
Bioinformatics  2014;30(15):2171-2178.
Motivation: Multivariate tests derived from the logistic regression model are widely used to assess the joint effect of multiple predictors on a disease outcome in case–control studies. These tests become less optimal if the joint effect cannot be approximated adequately by the additive model. The tree-structure model is an attractive alternative, as it is more apt to capture non-additive effects. However, the tree model is used most commonly for prediction and seldom for hypothesis testing, mainly because of the computational burden associated with the resampling-based procedure required for estimating the significance level.
Results: We designed a fast algorithm for building the tree-structure model and proposed a robust TREe-based Association Test (TREAT) that incorporates an adaptive model selection procedure to identify the optimal tree model representing the joint effect. We applied TREAT as a multilocus association test on >20 000 genes/regions in a study of esophageal squamous cell carcinoma (ESCC) and detected a highly significant novel association between the gene CDKN2B and ESCC (). We also demonstrated, through simulation studies, the power advantage of TREAT over other commonly used tests.
Availability and implementation: The package TREAT is freely available for download at, implemented in C++ and R and supported on 64-bit Linux and 64-bit MS Windows.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103596  PMID: 24794927
20.  A change-point model for identifying 3′UTR switching by next-generation RNA sequencing 
Bioinformatics  2014;30(15):2162-2170.
Motivation: Next-generation RNA sequencing offers an opportunity to investigate transcriptome in an unprecedented scale. Recent studies have revealed widespread alternative polyadenylation (polyA) in eukaryotes, leading to various mRNA isoforms differing in their 3′ untranslated regions (3′UTR), through which, the stability, localization and translation of mRNA can be regulated. However, very few, if any, methods and tools are available for directly analyzing this special alternative RNA processing event. Conventional methods rely on annotation of polyA sites; yet, such knowledge remains incomplete, and identification of polyA sites is still challenging. The goal of this article is to develop methods for detecting 3′UTR switching without any prior knowledge of polyA annotations.
Results: We propose a change-point model based on a likelihood ratio test for detecting 3′UTR switching. We develop a directional testing procedure for identifying dramatic shortening or lengthening events in 3′UTR, while controlling mixed directional false discovery rate at a nominal level. To our knowledge, this is the first approach to analyze 3′UTR switching directly without relying on any polyA annotations. Simulation studies and applications to two real datasets reveal that our proposed method is powerful, accurate and feasible for the analysis of next-generation RNA sequencing data.
Conclusions: The proposed method will fill a void among alternative RNA processing analysis tools for transcriptome studies. It can help to obtain additional insights from RNA sequencing data by understanding gene regulation mechanisms through the analysis of 3′UTR switching.
Availability and implementation: The software is implemented in Java and can be freely downloaded from
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103598  PMID: 24728858
21.  methylC Track: visual integration of single-base resolution DNA methylation data on the WashU EpiGenome Browser 
Bioinformatics  2014;30(15):2206-2207.
Summary: We present methylC track, an efficient mechanism for visualizing single-base resolution DNA methylation data on a genome browser. The methylC track dynamically integrates the level of methylation, the position and context of the methylated cytosine (i.e. CG, CHG and CHH), strand and confidence level (e.g. read coverage depth in the case of whole-genome bisulfite sequencing data). Investigators can access and integrate these information visually at specific locus or at the genome-wide level on the WashU EpiGenome Browser in the context of other rich epigenomic datasets.
Availability and implementation: The methylC track is part of the WashU EpiGenome Browser, which is open source and freely available at The most up-to-date instructions and tools for preparing methylC track are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103599  PMID: 24728854
22.  Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic 
Bioinformatics  2014;30(15):2179-2188.
Motivation: Sufficiently powered case–control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basis of ethnicity and other potential confounding factors, and one has access to the aligned reads in both groups, we investigate the effect of systematic differences in read depth and selection threshold when comparing allele frequencies between cases and controls. We propose a novel likelihood-based method, the robust variance score (RVS), that substitutes genotype calls by their expected values given observed sequence data.
Results: We show theoretically that the RVS eliminates read depth bias in the estimation of minor allele frequency. We also demonstrate that, using simulated and real NGS data, the RVS method controls Type I error and has comparable power to the ‘gold standard’ analysis with the true underlying genotypes for both common and rare variants.
Availability and implementation: An RVS R script and instructions can be found at, and at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103600  PMID: 24733292
23.  Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives 
Bioinformatics  2014;30(15):2189-2196.
Motivation: Family-based designs are regaining popularity for genomic sequencing studies because they provide a way to test cosegregation with disease of variants that are too rare in the population to be tested individually in a conventional case–control study.
Results: Where only a few affected subjects per family are sequenced, the probability that any variant would be shared by all affected relatives—given it occurred in any one family member—provides evidence against the null hypothesis of a complete absence of linkage and association. A P-value can be obtained as the sum of the probabilities of sharing events as (or more) extreme in one or more families. We generalize an existing closed-form expression for exact sharing probabilities to more than two relatives per family. When pedigree founders are related, we show that an approximation of sharing probabilities based on empirical estimates of kinship among founders obtained from genome-wide marker data is accurate for low levels of kinship. We also propose a more generally applicable approach based on Monte Carlo simulations. We applied this method to a study of 55 multiplex families with apparent non-syndromic forms of oral clefts from four distinct populations, with whole exome sequences available for two or three affected members per family. The rare single nucleotide variant rs149253049 in ADAMTS9 shared by affected relatives in three Indian families achieved significance after correcting for multiple comparisons (p=2×10−6).
Availability and implementation: Source code and binaries of the R package RVsharing are freely available for download at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4103601  PMID: 24740360
24.  The automated function prediction SIG looks back at 2013 and prepares for 2014 
Bioinformatics  2014;30(14):2091-2092.
Contact: or
PMCID: PMC4080736  PMID: 24590444
25.  TroX: a new method to learn about the genesis of aneuploidy from trisomic products of conception 
Bioinformatics  2014;30(14):2035-2042.
Motivation: An estimated 10–30% of clinically recognized conceptions are aneuploid, leading to spontaneous miscarriages, in vitro fertilization failures and, when viable, severe developmental disabilities. With the ongoing reduction in the cost of genotyping and DNA sequencing, the use of high-density single nucleotide polymorphism (SNP) markers for clinical diagnosis of aneuploidy and biomedical research into its causes is becoming common practice. A reliable, flexible and computationally feasible method for inferring the sources of aneuploidy is thus crucial.
Results: We propose a new method, TroX, for analyzing human trisomy data using high density SNP markers from a trisomic individual or product of conception and one parent. Using a hidden Markov model, we infer the stage of the meiotic error (I or II) and the individual in which non-disjunction event occurred, as well as the crossover locations on the trisomic chromosome. A novel and important feature of the method is its reliance on data from the proband and only one parent, reducing the experimental cost by a third and enabling a larger set of data to be used. We evaluate our method by applying it to simulated trio data as well as to genotype data for 282 trios that include a child trisomic for chromosome 21. The analyses show the method to be highly reliable even when data from only one parent are available. With the increasing availability of DNA samples from mother and fetus, application of approaches such as ours should yield unprecedented insights into the genetic risk factors for aneuploidy.
Availability and implementation: An R package implementing TroX is available for download at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4080739  PMID: 24659032

Results 1-25 (2620)