Search tips
Search criteria

Results 1-25 (369)

Clipboard (0)
Year of Publication
1.  A variable selection method for genome-wide association studies 
Biometrics  2011;27(1):1-8.
Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false-positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium (LD) patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
PMCID: PMC3025714  PMID: 21036813
2.  PathVisio-Validator: a rule-based validation plugin for graphical pathway notations 
Bioinformatics  2011;28(6):889-890.
Purpose: The PathVisio-Validator plugin aims to simplify the task of producing biological pathway diagrams that follow graphical standardized notations, such as Molecular Interaction Maps or the Systems Biology Graphical Notation. This plugin assists in the creation of pathway diagrams by ensuring correct usage of a notation, and thereby reducing ambiguity when diagrams are shared among biologists. Rulesets, needed in the validation process, can be generated for any graphical notation that a developer desires, using either Schematron or Groovy. The plugin also provides support for filtering validation results, validating on a subset of rules, and distinguishing errors and warnings.
Availability: The PathVisio-Validator plugin works with versions of PathVisio 2.0.11 and later on Windows, Mac OS X and Linux. The plugin along with the instructions, example rulesets for Groovy and Schematron, and Java source code can be downloaded at The software is developed under the open-source Apache 2.0 License and is freely available for both commercial and academic use.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3307104  PMID: 22199389
3.  Modeling community-wide molecular networks of multicellular systems 
Bioinformatics  2011;28(5):694-700.
Motivation: Multicellular systems, such as tissues, are composed of different cell types that form a heterogeneous community. Behavior of these systems is determined by complex regulatory networks within (intracellular networks) and between (intercellular networks) cells. Increasingly more studies are applying genome-wide experimental approaches to delineate the contributions of individual cell types (e.g. stromal, epithelial, vascular cells) to collective behavior of heterogeneous cell communities (e.g. tumors). Although many computational methods have been developed for analyses of intracellular networks based on genome-scale data, these efforts have not been extended toward analyzing genomic data from heterogeneous cell communities.
Results: Here, we propose a network-based approach for analyses of genome-scale data from multiple cell types to extract community-wide molecular networks comprised of intra- and intercellular interactions. Intercellular interactions in this model can be physical interactions between proteins or indirect interactions mediated by secreted metabolites of neighboring cells. Applying this method on data from a recent study on xenograft mouse models of human lung adenocarcinoma, we uncover an extensive network of intra- and intercellular interactions involved in the acquired resistance to angiogenesis inhibitors.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3338330  PMID: 22210865
4.  BPDA2d—a 2D global optimization-based Bayesian peptide detection algorithm for liquid chromatograph–mass spectrometry 
Bioinformatics  2011;28(4):564-572.
Motivation: Peptide detection is a crucial step in mass spectrometry (MS) based proteomics. Most existing algorithms are based upon greedy isotope template matching and thus may be prone to error propagation and ineffective to detect overlapping peptides. In addition, existing algorithms usually work at different charge states separately, isolating useful information that can be drawn from other charge states, which may lead to poor detection of low abundance peptides.
Results: BPDA2d models spectra as a mixture of candidate peptide signals and systematically evaluates all possible combinations of possible peptide candidates to interpret the given spectra. For each candidate, BPDA2d takes into account its elution profile, charge state distribution and isotope pattern, and it combines all evidence to infer the candidate's signal and existence probability. By piecing all evidence together—especially by deriving information across charge states—low abundance peptides can be better identified and peptide detection rates can be improved. Instead of local template matching, BPDA2d performs global optimization for all candidates and systematically optimizes their signals. Since BPDA2d looks for the optimal among all possible interpretations of the given spectra, it has the capability in handling complex spectra where features overlap. BPDA2d estimates the posterior existence probability of detected peptides, which can be directly used for probability-based evaluation in subsequent processing steps. Our experiments indicate that BPDA2d outperforms state-of-the-art detection methods on both simulated data and real liquid chromatography–mass spectrometry data, according to sensitivity and detection accuracy.
Availability: The BPDA2d software package is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3278754  PMID: 22155863
5.  Improved mean estimation and its application to diagonal discriminant analysis 
Bioinformatics  2011;28(4):531-537.
Motivation: High-dimensional data such as microarrays have created new challenges to traditional statistical methods. One such example is on class prediction with high-dimension, low-sample size data. Due to the small sample size, the sample mean estimates are usually unreliable. As a consequence, the performance of the class prediction methods using the sample mean may also be unsatisfactory. To obtain more accurate estimation of parameters some statistical methods, such as regularizations through shrinkage, are often desired.
Results: In this article, we investigate the family of shrinkage estimators for the mean value under the quadratic loss function. The optimal shrinkage parameter is proposed under the scenario when the sample size is fixed and the dimension is large. We then construct a shrinkage-based diagonal discriminant rule by replacing the sample mean by the proposed shrinkage mean. Finally, we demonstrate via simulation studies and real data analysis that the proposed shrinkage-based rule outperforms its original competitor in a wide range of settings.
PMCID: PMC3278755  PMID: 22171335
6.  The role of miRNAs in complex formation and control 
Bioinformatics  2011;28(4):453-456.
Summary: microRibonucleic acid (miRNAs) are small regulatory molecules that act by mRNA degradation or via translational repression. Although many miRNAs are ubiquitously expressed, a small subset have differential expression patterns that may give rise to tissue-specific complexes.
Motivation: This work studies gene targeting patterns amongst miRNAs with differential expression profiles, and links this to control and regulation of protein complexes.
Results: We find that, when a pair of miRNAs are not expressed in the same tissues, there is a higher tendency for them to target the direct partners of the same hub proteins. At the same time, they also avoid targeting the same set of hub-spokes. Moreover, the complexes corresponding to these hub-spokes tend to be specific and nonoverlapping. This suggests that the effect of miRNAs on the formation of complexes is specific.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3278756  PMID: 22180412
7.  Optimal structural inference of signaling pathways from unordered and overlapping gene sets 
Bioinformatics  2011;28(4):546-556.
Motivation: A plethora of bioinformatics analysis has led to the discovery of numerous gene sets, which can be interpreted as discrete measurements emitted from latent signaling pathways. Their potential to infer signaling pathway structures, however, has not been sufficiently exploited. Existing methods accommodating discrete data do not explicitly consider signal cascading mechanisms that characterize a signaling pathway. Novel computational methods are thus needed to fully utilize gene sets and broaden the scope from focusing only on pairwise interactions to the more general cascading events in the inference of signaling pathway structures.
Results: We propose a gene set based simulated annealing (SA) algorithm for the reconstruction of signaling pathway structures. A signaling pathway structure is a directed graph containing up to a few hundred nodes and many overlapping signal cascades, where each cascade represents a chain of molecular interactions from the cell surface to the nucleus. Gene sets in our context refer to discrete sets of genes participating in signal cascades, the basic building blocks of a signaling pathway, with no prior information about gene orderings in the cascades. From a compendium of gene sets related to a pathway, SA aims to search for signal cascades that characterize the optimal signaling pathway structure. In the search process, the extent of overlap among signal cascades is used to measure the optimality of a structure. Throughout, we treat gene sets as random samples from a first-order Markov chain model. We evaluated the performance of SA in three case studies. In the first study conducted on 83 KEGG pathways, SA demonstrated a significantly better performance than Bayesian network methods. Since both SA and Bayesian network methods accommodate discrete data, use a ‘search and score’ network learning strategy and output a directed network, they can be compared in terms of performance and computational time. In the second study, we compared SA and Bayesian network methods using four benchmark datasets from DREAM. In our final study, we showcased two context-specific signaling pathways activated in breast cancer.
Availibility: Source codes are available from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3278757  PMID: 22199386
8.  CLARE: Cracking the LAnguage of Regulatory Elements 
Bioinformatics  2011;28(4):581-583.
Summary: CLARE is a computational method designed to reveal sequence encryption of tissue-specific regulatory elements. Starting with a set of regulatory elements known to be active in a particular tissue/process, it learns the sequence code of the input set and builds a predictive model from features specific to those elements. The resulting model can then be applied to user-supplied genomic regions to identify novel candidate regulatory elements. CLARE's model also provides a detailed analysis of transcription factors that most likely bind to the elements, making it an invaluable tool for understanding mechanisms of tissue-specific gene regulation.
Availability: CLARE is freely accessible at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3278760  PMID: 22199387
9.  ART: a next-generation sequencing read simulator 
Bioinformatics  2011;28(4):593-594.
Summary: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles.
Availability: Both source and binary software packages are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3278762  PMID: 22199392
10.  VarSifter: Visualizing and analyzing exome-scale sequence variation data on a desktop computer 
Bioinformatics  2011;28(4):599-600.
Summary: VarSifter is a graphical software tool for desktop computers that allows investigators of varying computational skills to easily and quickly sort, filter, and sift through sequence variation data. A variety of filters and a custom query framework allow filtering based on any combination of sample and annotation information. By simplifying visualization and analyses of exome-scale sequence variation data, this program will help bring the power and promise of massively-parallel DNA sequencing to a broader group of researchers.
Availability and Implementation: VarSifter is written in Java, and is freely available in source and binary versions, along with a User Guide, at
Supplementary Information: Additional figures and methods available online at the journal's website.
PMCID: PMC3278764  PMID: 22210868
11.  RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing 
Bioinformatics  2011;28(3):430-432.
Summary: Reduced representation bisulfite sequencing (RRBS) is a powerful yet cost-efficient method for studying DNA methylation on a genomic scale. RRBS involves restriction-enzyme digestion, bisulfite conversion and size selection, resulting in DNA sequencing data that require special bioinformatic handling. Here, we describe RRBSMAP, a short-read alignment tool that is designed for handling RRBS data in a user-friendly and scalable way. RRBSMAP uses wildcard alignment, and avoids the need for any preprocessing or post-processing steps. We benchmarked RRBSMAP against a well-validated MAQ-based pipeline for RRBS read alignment and observed similar accuracy but much improved runtime performance, easier handling and better scaling to large sample sets. In summary, RRBSMAP removes bioinformatic hurdles and reduces the computational burden of large-scale epigenome association studies performed with RRBS.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268241  PMID: 22155871
12.  Metscape 2 bioinformatics tool for the analysis and visualization of metabolomics and gene expression data 
Bioinformatics  2011;28(3):373-380.
Motivation: Metabolomics is a rapidly evolving field that holds promise to provide insights into genotype–phenotype relationships in cancers, diabetes and other complex diseases. One of the major informatics challenges is providing tools that link metabolite data with other types of high-throughput molecular data (e.g. transcriptomics, proteomics), and incorporate prior knowledge of pathways and molecular interactions.
Results: We describe a new, substantially redesigned version of our tool Metscape that allows users to enter experimental data for metabolites, genes and pathways and display them in the context of relevant metabolic networks. Metscape 2 uses an internal relational database that integrates data from KEGG and EHMN databases. The new version of the tool allows users to identify enriched pathways from expression profiling data, build and analyze the networks of genes and metabolites, and visualize changes in the gene/metabolite data. We demonstrate the applications of Metscape to annotate molecular pathways for human and mouse metabolites implicated in the pathogenesis of sepsis-induced acute lung injury, for the analysis of gene expression and metabolite data from pancreatic ductal adenocarcinoma, and for identification of the candidate metabolites involved in cancer and inflammation.
Availability: Metscape is part of the National Institutes of Health-supported National Center for Integrative Biomedical Informatics (NCIBI) suite of tools, freely available at It can be downloaded from or installed via Cytoscape plugin manager.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268237  PMID: 22135418
13.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data 
Bioinformatics  2011;28(3):311-317.
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at, implemented in C and supported on Linux and Mac OS X.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268238  PMID: 22155872
14.  GenomeRunner: automating genome exploration 
Bioinformatics  2011;28(3):419-420.
Motivation: One of the challenges in interpreting high-throughput genomic studies such as a genome-wide associations, microarray or ChIP-seq is their open-ended nature—once a set of experimentally identified regions is identified as statistically significant, at least two questions arise: (i) besides P-value, do any of these significant regions stand out in terms of biological implications? (ii) Does the set of significant regions, as a whole, have anything in common genome wide? These issues are difficult to address because of the growing number of annotated genomic features (e.g. single nucleotide polymorphisms, transcription factor binding sites, methylation peaks, etc.), and it is difficult to know a priori which features would be most fruitful to analyze. Our goal is to provide partial automation of this process to begin examining associations between experimental features and annotated genomic regions in a hypothesis-free, data-driven manner.
Results: We created GenomeRunner—a tool for automating annotation and enrichment of genomic features of interest (FOI) with annotated genomic features (GFs), in different organisms. Besides simple association of FOIs with known GFs GenomeRunner tests whether the enriched FOIs, as a group, are statistically associated with a large and growing set of genomic features.
Availability: GenomeRunner setup files and source code are freely available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268239  PMID: 22155868
15.  Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools 
Bioinformatics  2011;28(3):421-422.
Motivation: Storing, annotating and analyzing variants from next-generation sequencing projects can be difficult due to the availability of a wide array of data formats, tools and annotation sources, as well as the sheer size of the data files. Useful tools, including the GATK, ANNOVAR and BEDTools can be integrated into custom pipelines for annotating and analyzing sequence variants. However, building flexible pipelines that support the tracking of variants alongside their samples, while enabling updated annotation and reanalyses, is not a simple task.
Results: We have developed variant tools, a flexible annotation and analysis toolset that greatly simplifies the storage, annotation and filtering of variants and the analysis of the underlying samples. variant tools can be used to manage and analyze genetic variants obtained from sequence alignments, and the command-line driven toolset could be used as a foundation for building more sophisticated analytical methods.
Availability and implementation: variant tools consists of two command-line driven programs vtools and vtools_report. It is freely available at, distributed under a GPL license.
PMCID: PMC3268240  PMID: 22138362
16.  M3: an improved SNP calling algorithm for Illumina BeadArray data 
Bioinformatics  2011;28(3):358-365.
Summary: Genotype calling from high-throughput platforms such as Illumina and Affymetrix is a critical step in data processing, so that accurate information on genetic variants can be obtained for phenotype–genotype association studies. A number of algorithms have been developed to infer genotypes from data generated through the Illumina BeadStation platform, including GenCall, GenoSNP, Illuminus and CRLMM. Most of these algorithms are built on population-based statistical models to genotype every SNP in turn, such as GenCall with the GenTrain clustering algorithm, and require a large reference population to perform well. These approaches may not work well for rare variants where only a small proportion of the individuals carry the variant. A fundamentally different approach, implemented in GenoSNP, adopts a single nucleotide polymorphism (SNP)-based model to infer genotypes of all the SNPs in one individual, making it an appealing alternative to call rare variants. However, compared to the population-based strategies, more SNPs in GenoSNP may fail the Hardy–Weinberg Equilibrium test. To take advantage of both strategies, we propose a two-stage SNP calling procedure, named the modified mixture model (M3), to improve call accuracy for both common and rare variants. The effectiveness of our approach is demonstrated through applications to genotype calling on a set of HapMap samples used for quality control purpose in a large case–control study of cocaine dependence. The increase in power with M3 is greater for rare variants than for common variants depending on the model.
Availability: M3 algorithm:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268244  PMID: 22155947
17.  MTBindingSim: simulate protein binding to microtubules 
Bioinformatics  2011;28(3):441-443.
Summary: Many protein–protein interactions are more complex than can be accounted for by 1:1 binding models. However, biochemists have few tools available to help them recognize and predict the behaviors of these more complicated systems, making it difficult to design experiments that distinguish between possible binding models. MTBindingSim provides researchers with an environment in which they can rapidly compare different models of binding for a given scenario. It is written specifically with microtubule polymers in mind, but many of its models apply equally well to any polymer or any protein–protein interaction. MTBindingSim can thus both help in training intuition about binding models and with experimental design.
Availability and implementation: MTBindingSim is implemented in MATLAB and runs either within MATLAB (on Windows, Mac or Linux) or as a binary without MATLAB (on Windows or Mac). The source code (licensed under the GNU General Public License) and binaries are freely available at
PMCID: PMC3268247  PMID: 22171336
18.  Discovering transcription factor regulatory targets using gene expression and binding data 
Bioinformatics  2011;28(2):206-213.
Motivation: Identifying the target genes regulated by transcription factors (TFs) is the most basic step in understanding gene regulation. Recent advances in high-throughput sequencing technology, together with chromatin immunoprecipitation (ChIP), enable mapping TF binding sites genome wide, but it is not possible to infer function from binding alone. This is especially true in mammalian systems, where regulation often occurs through long-range enhancers in gene-rich neighborhoods, rather than proximal promoters, preventing straightforward assignment of a binding site to a target gene.
Results: We present EMBER (Expectation Maximization of Binding and Expression pRofiles), a method that integrates high-throughput binding data (e.g. ChIP-chip or ChIP-seq) with gene expression data (e.g. DNA microarray) via an unsupervised machine learning algorithm for inferring the gene targets of sets of TF binding sites. Genes selected are those that match overrepresented expression patterns, which can be used to provide information about multiple TF regulatory modes. We apply the method to genome-wide human breast cancer data and demonstrate that EMBER confirms a role for the TFs estrogen receptor alpha, retinoic acid receptors alpha and gamma in breast cancer development, whereas the conventional approach of assigning regulatory targets based on proximity does not. Additionally, we compare several predicted target genes from EMBER to interactions inferred previously, examine combinatorial effects of TFs on gene regulation and illustrate the ability of EMBER to discover multiple modes of regulation.
Availability: All code used for this work is available at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3259433  PMID: 22084256
19.  Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort 
Bioinformatics  2011;28(2):229-237.
Motivation: Recent advances in high-throughput genotyping and brain imaging techniques enable new approaches to study the influence of genetic variation on brain structures and functions. Traditional association studies typically employ independent and pairwise univariate analysis, which treats single nucleotide polymorphisms (SNPs) and quantitative traits (QTs) as isolated units and ignores important underlying interacting relationships between the units. New methods are proposed here to overcome this limitation.
Results: Taking into account the interlinked structure within and between SNPs and imaging QTs, we propose a novel Group-Sparse Multi-task Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci for multiple disease-relevant QTs and apply it to a study in mild cognitive impairment and Alzheimer's disease. Built upon regression analysis, our model uses a new form of regularization, group ℓ2,1-norm (G2,1-norm), to incorporate the biological group structures among SNPs induced from their genetic arrangement. The new G2,1-norm considers the regression coefficients of all the SNPs in each group with respect to all the QTs together and enforces sparsity at the group level. In addition, an ℓ2,1-norm regularization is utilized to couple feature selection across multiple tasks to make use of the shared underlying mechanism among different brain regions. The effectiveness of the proposed method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected SNP predictors relevant to the imaging QTs.
Availability: Software is publicly available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3259438  PMID: 22155867
20.  FTSite: high accuracy detection of ligand binding sites on unbound protein structures 
Bioinformatics  2011;28(2):286-287.
Motivation: Binding site identification is a classical problem that is important for a range of applications, including the structure-based prediction of function, the elucidation of functional relationships among proteins, protein engineering and drug design. We describe an accurate method of binding site identification, namely FTSite. This method is based on experimental evidence that ligand binding sites also bind small organic molecules of various shapes and polarity. The FTSite algorithm does not rely on any evolutionary or statistical information, but achieves near experimental accuracy: it is capable of identifying the binding sites in over 94% of apo proteins from established test sets that have been used to evaluate many other binding site prediction methods.
Availability: FTSite is freely available as a web-based server at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3259439  PMID: 22113084
21.  Sensitive and fast mapping of di-base encoded reads 
Bioinformatics  2011;28(1):150.
PMCID: PMC3276229
22.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data 
Bioinformatics  2011;28(1):132-133.
Summary: Next-generation sequencing (NGS) is an ideal framework for the characterization of highly variable pathogens, with a deep resolution able to capture minority variants. However, the reconstruction of all variants of a viral population infecting a host is a challenging task for genome regions larger than the average NGS read length. QuRe is a program for viral quasispecies reconstruction, specifically developed to analyze long read (>100 bp) NGS data. The software performs alignments of sequence fragments against a reference genome, finds an optimal division of the genome into sliding windows based on coverage and diversity and attempts to reconstruct all the individual sequences of the viral quasispecies—along with their prevalence—using a heuristic algorithm, which matches multinomial distributions of distinct viral variants overlapping across the genome division. QuRe comes with a built-in Poisson error correction method and a post-reconstruction probabilistic clustering, both parameterized on given error rates in homopolymeric and non-homopolymeric regions.
Availability: QuRe is platform-independent, multi-threaded software implemented in Java. It is distributed under the GNU General Public License, available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3244773  PMID: 22088846
23.  Graph accordance of next-generation sequence assemblies 
Bioinformatics  2011;28(1):13-16.
Motivation: No individual assembly algorithm addresses all the known limitations of assembling short-length sequences. Overall reduced sequence contig length is the major problem that challenges the usage of these assemblies. We describe an algorithm to take advantages of different assembly algorithms or sequencing platforms to improve the quality of next-generation sequence (NGS) assemblies.
Results: The algorithm is implemented as a graph accordance assembly (GAA) program. The algorithm constructs an accordance graph to capture the mapping information between the target and query assemblies. Based on the accordance graph, the contigs or scaffolds of the target assembly can be extended, merged or bridged together. Extra constraints, including gap sizes, mate pairs, scaffold order and orientation, are explored to enforce those accordance operations in the correct context. We applied GAA to various chicken NGS assemblies and the results demonstrate improved contiguity statistics and higher genome and gene coverage.
Availability: GAA is implemented in OO perl and is available here:
PMCID: PMC3244760  PMID: 22025481
24.  Detecting genome-wide epistases based on the clustering of relatively frequent items 
Bioinformatics  2011;28(1):5-12.
Motivation: In genome-wide association studies (GWAS), up to millions of single nucleotide polymorphisms (SNPs) are genotyped for thousands of individuals. However, conventional single locus-based approaches are usually unable to detect gene–gene interactions underlying complex diseases. Due to the huge search space for complicated high order interactions, many existing multi-locus approaches are slow and may suffer from low detection power for GWAS.
Results: In this article, we develop a simple, fast and effective algorithm to detect genome-wide multi-locus epistatic interactions based on the clustering of relatively frequent items. Extensive experiments on simulated data show that our algorithm is fast and more powerful in general than some recently proposed methods. On a real genome-wide case–control dataset for age-related macular degeneration (AMD), the algorithm has identified genotype combinations that are significantly enriched in the cases.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3244765  PMID: 22053078
25.  Detecting differential binding of transcription factors with ChIP-seq 
Bioinformatics  2011;28(1):121-122.
Summary: Increasing number of ChIP-seq experiments are investigating transcription factor binding under multiple experimental conditions, for example, various treatment conditions, several distinct time points and different treatment dosage levels. Hence, identifying differential binding sites across multiple conditions is of practical importance in biological and medical research. To this end, we have developed a powerful and flexible program, called DBChIP, to detect differentially bound sharp binding sites across multiple conditions, with or without matching control samples. By assigning uncertainty measure to the putative differential binding sites, DBChIP facilitates downstream analysis. DBChIP is implemented in R programming language and can work with a wide range of sequencing file formats.
Availability: R package DBChIP is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3244766  PMID: 22057161

Results 1-25 (369)