Search tips
Search criteria

Results 1-25 (487)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Comment on ‘MeSH-up: effective MeSH text classification for improved document retrieval’ 
Bioinformatics  2009;25(20):2770-2771.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2765257  PMID: 19671694
2.  The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction 
Bioinformatics  2009;25(18):2404-2410.
Motivation: Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question.
Results: We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches—even those employing the same training data—is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations.
Availability: The mitochondrial benchmark gold standard, as well as experimental results and additional data, is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2735660  PMID: 19561015
3.  HAPLOWSER: a whole-genome haplotype browser for personal genome and metagenome 
Bioinformatics  2009;25(18):2430-2431.
Summary: Haplotype assembly is becoming a very important tool in genome sequencing of human and other organisms. Although haplotypes were previously inferred from genome assemblies, there has never been a comparative haplotype browser that depicts a global picture of whole-genome alignments among haplotypes of different organisms. We introduce a whole-genome HAPLotype brOWSER (HAPLOWSER), providing evolutionary perspectives from multiple aligned haplotypes and functional annotations. Haplowser enables the comparison of haplotypes from metagenomes, and associates conserved regions or the bases at the conserved regions with functional annotations and custom tracks. The associations are quantified for further analysis and presented as pie charts. Functional annotations and custom tracks that are projected onto haplotypes are saved as multiple files in FASTA format. Haplowser provides a user-friendly interface, and can display alignments of haplotypes with functional annotations at any resolution.
Availability: Haplowser, written in Java, supports multiple platforms including Windows and Linux. Haplowser is publicly available at
Supplementary information: Supplementary data are available at
PMCID: PMC2735662  PMID: 19561337
4.  Antimony: a modular model definition language 
Bioinformatics  2009;25(18):2452-2454.
Motivation: Model exchange in systems and synthetic biology has been standardized for computers with the Systems Biology Markup Language (SBML) and CellML, but specialized software is needed for the generation of models in these formats. Text-based model definition languages allow researchers to create models simply, and then export them to a common exchange format. Modular languages allow researchers to create and combine complex models more easily. We saw a use for a modular text-based language, together with a translation library to allow other programs to read the models as well.
Summary: The Antimony language provides a way for a researcher to use simple text statements to create, import, and combine biological models, allowing complex models to be built from simpler models, and provides a special syntax for the creation of modular genetic networks. The libAntimony library allows other software packages to import these models and convert them either to SBML or their own internal format.
Availability: The Antimony language specification and the libAntimony library are available under a BSD license from
PMCID: PMC2735663  PMID: 19578039
5.  Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets 
Bioinformatics  2009;25(18):2348-2354.
Motivation: Recently, many univariate and several multivariate approaches have been suggested for testing differential expression of gene sets between different phenotypes. However, despite a wealth of literature studying their performance on simulated and real biological data, still there is a need to quantify their relative performance when they are testing different null hypotheses.
Results: In this article, we compare the performance of univariate and multivariate tests on both simulated and biological data. In the simulation study we demonstrate that high correlations equally affect the power of both, univariate as well as multivariate tests. In addition, for most of them the power is similarly affected by the dimensionality of the gene set and by the percentage of genes in the set, for which expression is changing between two phenotypes. The application of different test statistics to biological data reveals that three statistics (sum of squared t-tests, Hotelling's T2, N-statistic), testing different null hypotheses, find some common but also some complementing differentially expressed gene sets under specific settings. This demonstrates that due to complementing null hypotheses each test projects on different aspects of the data and for the analysis of biological data it is beneficial to use all three tests simultaneously instead of focusing exclusively on just one.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2735665  PMID: 19574285
6.  Cross-scale, cross-pathway evaluation using an agent-based non-small cell lung cancer model 
Bioinformatics  2009;25(18):2389-2396.
We present a multiscale agent-based non-small cell lung cancer model that consists of a 3D environment with which cancer cells interact while processing phenotypic changes. At the molecular level, transforming growth factor β (TGFβ) has been integrated into our previously developed in silico model as a second extrinsic input in addition to epidermal growth factor (EGF). The main aim of this study is to investigate how the effects of individual and combinatorial change in EGF and TGFβ concentrations at the molecular level alter tumor growth dynamics on the multi-cellular level, specifically tumor volume and expansion rate. Our simulation results show that separate EGF and TGFβ fluctuations trigger competing multi-cellular phenotypes, yet synchronous EGF and TGFβ signaling yields a spatially more aggressive tumor that overall exhibits an EGF-driven phenotype. By altering EGF and TGFβ concentration levels simultaneously and asynchronously, we discovered a particular region of EGF-TGFβ profiles that ensures phenotypic stability of the tumor system. Within this region, concentration changes in EGF and TGFβ do not impact the resulting multi-cellular response substantially, while outside these concentration ranges, a change at the molecular level will substantially alter either tumor volume or tumor expansion rate, or both. By evaluating tumor growth dynamics across different scales, we show that, under certain conditions, therapeutic targeting of only one signaling pathway may be insufficient. Potential implications of these in silico results for future clinico-pharmacological applications are discussed.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2735669  PMID: 19578172
7.  WebArrayDB: cross-platform microarray data analysis and public data repository 
Bioinformatics  2009;25(18):2425-2429.
Motivation: Cross-platform microarray analysis is an increasingly important research tool, but researchers still lack open source tools for storing, integrating and analyzing large amounts of microarray data obtained from different array platforms.
Results: An open source integrated microarray database and analysis suite, WebArrayDB (, has been developed that features convenient uploading of data for storage in a MIAME (Minimal Information about a Microarray Experiment) compliant fashion, and allows data to be mined with a large variety of R-based tools, including data analysis across multiple platforms. Different methods for probe alignment, normalization and statistical analysis are included to account for systematic bias. Student's t-test, moderated t-tests, non-parametric tests and analysis of variance or covariance (ANOVA/ANCOVA) are among the choices of algorithms for differential analysis of data. Users also have the flexibility to define new factors and create new analysis models to fit complex experimental designs. All data can be queried or browsed through a web browser. The computations can be performed in parallel on symmetric multiprocessing (SMP) systems or Linux clusters.
Availability: The software package is available for the use on a public web server ( or can be downloaded.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2735672  PMID: 19602526
8.  A genetic programming approach for Burkholderia Pseudomallei diagnostic pattern discovery 
Bioinformatics  2009;25(17):2256-2262.
Motivation: Finding diagnostic patterns for fighting diseases like Burkholderia pseudomallei using biomarkers involves two key issues. First, exhausting all subsets of testable biomarkers (antigens in this context) to find a best one is computationally infeasible. Therefore, a proper optimization approach like evolutionary computation should be investigated. Second, a properly selected function of the antigens as the diagnostic pattern which is commonly unknown is a key to the diagnostic accuracy and the diagnostic effectiveness in clinical use.
Results: A conversion function is proposed to convert serum tests of antigens on patients to binary values based on which Boolean functions as the diagnostic patterns are developed. A genetic programming approach is designed for optimizing the diagnostic patterns in terms of their accuracy and effectiveness. During optimization, it is aimed to maximize the coverage (the rate of positive response to antigens) in the infected patients and minimize the coverage in the non-infected patients while maintaining the fewest number of testable antigens used in the Boolean functions as possible. The final coverage in the infected patients is 96.55% using 17 of 215 (7.4%) antigens with zero coverage in the non-infected patients. Among these 17 antigens, BPSL2697 is the most frequently selected one for the diagnosis of Burkholderia Pseudomallei. The approach has been evaluated using both the cross-validation and the Jack–knife simulation methods with the prediction accuracy as 93% and 92%, respectively. A novel approach is also proposed in this study to evaluate a model with binary data using ROC analysis.
PMCID: PMC2734322  PMID: 19561021
9.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples 
Bioinformatics  2009;25(17):2283-2285.
Summary: Massively parallel sequencing technologies hold incredible promise for the study of DNA sequence variation, particularly the identification of variants affecting human disease. The unprecedented throughput and relatively short read lengths of Roche/454, Illumina/Solexa, and other platforms have spurred development of a new generation of sequence alignment algorithms. Yet detection of sequence variants based on short read alignments remains challenging, and most currently available tools are limited to a single platform or aligner type. We present VarScan, an open source tool for variant detection that is compatible with several short read aligners. We demonstrate VarScan's ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples.
Availability and Implementation: Source code and documentation freely available at implemented as a Perl package and supported on Linux/UNIX, MS Windows and Mac OSX.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2734323  PMID: 19542151
10.  PROMISE: a tool to identify genomic features with a specific biologically interesting pattern of associations with multiple endpoint variables 
Bioinformatics  2009;25(16):2013-2019.
Motivation: In some applications, prior biological knowledge can be used to define a specific pattern of association of multiple endpoint variables with a genomic variable that is biologically most interesting. However, to our knowledge, there is no statistical procedure designed to detect specific patterns of association with multiple endpoint variables.
Results: Projection onto the most interesting statistical evidence (PROMISE) is proposed as a general procedure to identify genomic variables that exhibit a specific biologically interesting pattern of association with multiple endpoint variables. Biological knowledge of the endpoint variables is used to define a vector that represents the biologically most interesting values for statistics that characterize the associations of the endpoint variables with a genomic variable. A test statistic is defined as the dot-product of the vector of the observed association statistics and the vector of the most interesting values of the association statistics. By definition, this test statistic is proportional to the length of the projection of the observed vector of correlations onto the vector of most interesting associations. Statistical significance is determined via permutation. In simulation studies and an example application, PROMISE shows greater statistical power to identify genes with the interesting pattern of associations than classical multivariate procedures, individual endpoint analyses or listing genes that have the pattern of interest and are significant in more than one individual endpoint analysis.
Availability: Documented R routines are freely available from and will soon be available as a Bioconductor package from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2723006  PMID: 19528086
11.  A statistical framework for protein quantitation in bottom-up MS-based proteomics 
Bioinformatics  2009;25(16):2028-2034.
Motivation: Quantitative mass spectrometry-based proteomics requires protein-level estimates and associated confidence measures. Challenges include the presence of low quality or incorrectly identified peptides and informative missingness. Furthermore, models are required for rolling peptide-level information up to the protein level.
Results: We present a statistical model that carefully accounts for informative missingness in peak intensities and allows unbiased, model-based, protein-level estimation and inference. The model is applicable to both label-based and label-free quantitation experiments. We also provide automated, model-based, algorithms for filtering of proteins and peptides as well as imputation of missing values. Two LC/MS datasets are used to illustrate the methods. In simulation studies, our methods are shown to achieve substantially more discoveries than standard alternatives.
Availability: The software has been made available in the open-source proteomics platform DAnTE (
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2723007  PMID: 19535538
12.  Copy number variation has little impact on bead-array-based measures of DNA methylation 
Bioinformatics  2009;25(16):1999-2005.
Motivation: Integration of various genome-scale measures of molecular alterations is of great interest to researchers aiming to better define disease processes or identify novel targets with clinical utility. Particularly important in cancer are measures of gene copy number DNA methylation. However, copy number variation may bias the measurement of DNA methylation. To investigate possible bias, we analyzed integrated data obtained from 19 head and neck squamous cell carcinoma (HNSCC) tumors and 23 mesothelioma tumors.
Results: Statistical analysis of observational data produced results consistent with those anticipated from theoretical mathematical properties. Average beta value reported by Illumina GoldenGate (a bead-array platform) was significantly smaller than a similar measure constructed from the ratio of average dye intensities. Among CpGs that had only small variations in measured methylation across tumors (filtering out clearly biological methylation signatures), there were no systematic copy number effects on methylation for three and more than four copies; however, one copy led to small systematic negative effects, and no copies led to substantial significant negative effects.
Conclusions: Since mathematical considerations suggest little bias in methylation assayed using bead-arrays, the consistency of observational data with anticipated properties suggests little bias. However, further analysis of systematic copy number effects across CpGs suggest that though there may be little bias when there are copy number gains, small biases may result when one allele is lost, and substantial biases when both alleles are lost. These results suggest that further integration of these measures can be useful for characterizing the biological relationships between these somatic events.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2723008  PMID: 19542153
13.  Comments on the analysis of unbalanced microarray data 
Bioinformatics  2009;25(16):2035-2041.
Motivation: Permutation testing is very popular for analyzing microarray data to identify differentially expressed (DE) genes; estimating false discovery rates (FDRs) is a very popular way to address the inherent multiple testing problem. However, combining these approaches may be problematic when sample sizes are unequal.
Results: With unbalanced data, permutation tests may not be suitable because they do not test the hypothesis of interest. In addition, permutation tests can be biased. Using biased P-values to estimate the FDR can produce unacceptable bias in those estimates. Results also show that the approach of pooling permutation null distributions across genes can produce invalid P-values, since even non-DE genes can have different permutation null distributions. We encourage researchers to use statistics that have been shown to reliably discriminate DE genes, but caution that associated P-values may be either invalid, or a less-effective metric for discriminating DE genes.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732368  PMID: 19528084
14.  apLCMS—adaptive processing of high-resolution LC/MS data 
Bioinformatics  2009;25(15):1930-1936.
Motivation: Liquid chromatography-mass spectrometry (LC/MS) profiling is a promising approach for the quantification of metabolites from complex biological samples. Significant challenges exist in the analysis of LC/MS data, including noise reduction, feature identification/ quantification, feature alignment and computation efficiency.
Result: Here we present a set of algorithms for the processing of high-resolution LC/MS data. The major technical improvements include the adaptive tolerance level searching rather than hard cutoff or binning, the use of non-parametric methods to fine-tune intensity grouping, the use of run filter to better preserve weak signals and the model-based estimation of peak intensities for absolute quantification. The algorithms are implemented in an R package apLCMS, which can efficiently process large LC/ MS datasets.
Availability: The R package apLCMS is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2712336  PMID: 19414529
15.  Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins 
Bioinformatics  2009;25(15):1905-1914.
Motivation: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems.
Results: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2712339  PMID: 19447782
16.  A clustering approach for identification of enriched domains from histone modification ChIP-Seq data 
Bioinformatics  2009;25(15):1952-1958.
Motivation: Chromatin states are the key to gene regulation and cell identity. Chromatin immunoprecipitation (ChIP) coupled with high-throughput sequencing (ChIP-Seq) is increasingly being used to map epigenetic states across genomes of diverse species. Chromatin modification profiles are frequently noisy and diffuse, spanning regions ranging from several nucleosomes to large domains of multiple genes. Much of the early work on the identification of ChIP-enriched regions for ChIP-Seq data has focused on identifying localized regions, such as transcription factor binding sites. Bioinformatic tools to identify diffuse domains of ChIP-enriched regions have been lacking.
Results: Based on the biological observation that histone modifications tend to cluster to form domains, we present a method that identifies spatial clusters of signals unlikely to appear by chance. This method pools together enrichment information from neighboring nucleosomes to increase sensitivity and specificity. By using genomic-scale analysis, as well as the examination of loci with validated epigenetic states, we demonstrate that this method outperforms existing methods in the identification of ChIP-enriched signals for histone modification profiles. We demonstrate the application of this unbiased method in important issues in ChIP-Seq data analysis, such as data normalization for quantitative comparison of levels of epigenetic modifications across cell types and growth conditions.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732366  PMID: 19505939
17.  Rapid detection, classification and accurate alignment of up to a million or more related protein sequences 
Bioinformatics  2009;25(15):1869-1875.
Motivation: The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical.
Results: This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin–Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences.
Availability: A C++ implementation of MAPGAPS is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732367  PMID: 19505947
18.  Estimating the posterior probability that genome-wide association findings are true or false 
Bioinformatics  2009;25(14):1807-1813.
Motivation: A limitation of current methods used to declare significance in genome-wide association studies (GWAS) is that they do not provide clear information about the probability that GWAS findings are true of false. This lack of information increases the chance of false discoveries and may result in real effects being missed.
Results: We propose a method to estimate the posterior probability that a marker has (no) effect given its test statistic value, also called the local false discovery rate (FDR), in the GWAS. A critical step involves the estimation the parameters of the distribution of the true alternative tests. For this, we derived and implemented the real maximum likelihood function, which turned out to provide us with significantly more accurate estimates than the widely used mixture model likelihood. Actual GWAS data are used to illustrate properties of the posterior probability estimates empirically. In addition to evaluating individual markers, a variety of applications are conceivable. For instance, posterior probability estimates can be used to control the FDR more precisely than Benjamini–Hochberg procedure.
Availability: The codes are freely downloadable from the web site∼jbukszar.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2705227  PMID: 19420056
19.  ESG: extended similarity group method for automated protein function prediction 
Bioinformatics  2009;25(14):1739-1745.
Motivation: Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability.
Results: We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains.
Availability: ESG web server is available for automated protein function prediction at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2705228  PMID: 19435743
20.  Apollo: a community resource for genome annotation editing 
Bioinformatics  2009;25(14):1836-1837.
Summary: Apollo is a genome annotation-editing tool with an easy to use graphical interface. It is a component of the GMOD project, with ongoing development driven by the community. Recent additions to the software include support for the generic feature format version 3 (GFF3), continuous transcriptome data, a full Chado database interface, integration with remote services for on-the-fly BLAST and Primer BLAST analyses, graphical interfaces for configuring user preferences and full undo of all edit operations. Apollo's user community continues to grow, including its use as an educational tool for college and high-school students.
Availability: Apollo is a Java application distributed under a free and open source license. Installers for Windows, Linux, Unix, Solaris and Mac OS X are available at, and the source code is available from the SourceForge CVS repository at
PMCID: PMC2705230  PMID: 19439563
21.  Data structures and compression algorithms for genomic sequence data 
Bioinformatics  2009;25(14):1731-1738.
Motivation: The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.
Results: The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.
Availability: Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from
PMCID: PMC2705231  PMID: 19447783
22.  Relating periodicity of nucleosome organization and gene regulation 
Bioinformatics  2009;25(14):1782-1788.
Motivation: The relationship between nucleosome positioning and gene regulation is fundamental yet complex. Previous studies on genomic nucleosome positions have revealed a correlation between nucleosome occupancy on promoters and gene expression levels. Many of these studies focused on individual nucleosomes, especially those proximal to transcription start sites. To study the collective effect of multiple nucleosomes on the gene expression, we developed a mathematical approach based on autocorrelation to relate genomic nucleosome organization to gene regulation.
Results: We found that nucleosome organization in gene promoters can be well described by autocorrelation transformation. Some promoters show obvious periods in their nucleosome organization, while others have no clear periodicity. The genes with periodic nucleosome organization in promoters tend to be lower expressed than the genes without periodic nucleosome organization. These suggest that regular organization of nucleosomes plays a critical role in gene regulation. To quantitatively associate nucleosome organization and gene expression, we predicted gene expression solely based on nucleosome status and found that nucleosome status accounts for ∼25% of the observed gene expression variability. Furthermore, we explored the underlying forces that maintain the periodicity in nucleosome organization, namely intrinsic (i.e. DNA sequence) and extrinsic forces (i.e. chromatin remodeling factors). We found that the extrinsic factors play a critical role in maintaining the periodic nucleosome organization.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2705233  PMID: 19447785
23.  Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data 
Bioinformatics  2009;25(14):1715-1721.
Motivation: Chromatin immunoprecipitation (ChIP) experiments followed by array hybridization, or ChIP-chip, is a powerful approach for identifying transcription factor binding sites (TFBS) and has been widely used. Recently, massively parallel sequencing coupled with ChIP experiments (ChIP-seq) has been increasingly used as an alternative to ChIP-chip, offering cost-effective genome-wide coverage and resolution up to a single base pair. For many well-studied TFs, both ChIP-seq and ChIP-chip experiments have been applied and their data are publicly available. Previous analyses have revealed substantial technology-specific binding signals despite strong correlation between the two sets of results. Therefore, it is of interest to see whether the two data sources can be combined to enhance the detection of TFBS.
Results: In this work, hierarchical hidden Markov model (HHMM) is proposed for combining data from ChIP-seq and ChIP-chip. In HHMM, inference results from individual HMMs in ChIP-seq and ChIP-chip experiments are summarized by a higher level HMM. Simulation studies show the advantage of HHMM when data from both technologies co-exist. Analysis of two well-studied TFs, NRSF and CCCTC-binding factor (CTCF), also suggests that HHMM yields improved TFBS identification in comparison to analyses using individual data sources or a simple merger of the two.
Availability: Source code for the software ChIPmeta is freely available for download at∼hwchoi/, implemented in C and supported on linux.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732365  PMID: 19447789
24.  A graphical algorithm for fast computation of identity coefficients and generalized kinship coefficients 
Bioinformatics  2009;25(12):1561-1563.
Summary: Computing the probability of identity by descent sharing among n genes given only the pedigree of those genes is a computationally challenging problem, if n or the pedigree size is large. Here, I present a novel graphical algorithm for efficiently computing all generalized kinship coefficients for n genes. The graphical description transforms the problem from doing many recursion on the pedigree to doing a single traversal of a structure referred to as the kinship graph.
Availability: The algorithm is implemented for n = 4 in the software package IdCoefs at
Supplementary Information:Supplementary data are available at Bioinformatics online.
PMCID: PMC2687941  PMID: 19359355
25.  Enrichment constrained time-dependent clustering analysis for finding meaningful temporal transcription modules 
Bioinformatics  2009;25(12):1521-1527.
Motivation: Clustering is a popular data exploration technique widely used in microarray data analysis. When dealing with time-series data, most conventional clustering algorithms, however, either use one-way clustering methods, which fail to consider the heterogeneity of temporary domain, or use two-way clustering methods that do not take into account the time dependency between samples, thus producing less informative results. Furthermore, enrichment analysis is often performed independent of and after clustering and such practice, though capable of revealing biological significant clusters, cannot guide the clustering to produce biologically significant result.
Result:We present a new enrichment constrained framework (ECF) coupled with a time-dependent iterative signature algorithm (TDISA), which, by applying a sliding time window to incorporate the time dependency of samples and imposing an enrichment constraint to parameters of clustering, allows supervised identification of temporal transcription modules (TTMs) that are biologically meaningful. Rigorous mathematical definitions of TTM as well as the enrichment constraint framework are also provided that serve as objective functions for retrieving biologically significant modules. We applied the enrichment constrained time-dependent iterative signature algorithm (ECTDISA) to human gene expression time-series data of Kaposi's sarcoma-associated herpesvirus (KSHV) infection of human primary endothelial cells; the result not only confirms known biological facts, but also reveals new insight into the molecular mechanism of KSHV infection.
Availability: Data and Matlab code are available at∼yfhuang/ECTDISA.html
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2687989  PMID: 19351618

Results 1-25 (487)