Search tips
Search criteria

Results 1-25 (418)

Clipboard (0)
Year of Publication
1.  A variable selection method for genome-wide association studies 
Bioinformatics  2010;27(1):1-8.
Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at
Access to WTCCC data:
Supplementary information: Supplementary data are available at Bioinformatics Online.
PMCID: PMC3025714  PMID: 21036813
2.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking 
Bioinformatics (Oxford, England)  2010;26(9):1169-1175.
Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of Molecular Docking, which is in turn an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterise the complex, which also include parameters fitted to experimental or simulation data, and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions.
We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score’s performance was shown to improve dramatically with training set size and hence the future availability of more high quality structural and interaction data is expected to lead to improved versions of RF-Score.
PMCID: PMC3524828  PMID: 20236947
3.  Post-hoc power estimation in large-scale multiple testing problems 
Bioinformatics (Oxford, England)  2010;26(8):1050-1056.
The statistical power or multiple Type II error rate in large scale multiple testing problems as, for example, in gene expression microarray experiments, depends on typically unknown parameters and is therefore difficult to assess a priori. However, it has been suggested to estimate the multiple Type II error rate post-hoc, based on the observed data.
We consider a class of post-hoc estimators that are functions of the estimated proportion of true null hypotheses among all hypotheses. Numerous estimators for this proportion have been proposed and we investigate the statistical properties of the derived multiple Type II error rate estimators in an extensive simulation study.
The performance of the estimators in terms of the mean squared error depends sensitively on the distributional scenario. Estimators based on empirical distributions of the null hypotheses are superior in the presence of strongly correlated test statistics.
PMCID: PMC3500624  PMID: 20189938
4.  Multi-level mixed effects models for bead arrays 
Bioinformatics  2010;27(5):633-640.
Motivation: Bead arrays are becoming a popular platform for high-throughput expression arrays. However, the number of the beads targeting a transcript and the variation of their intensities differ from sample to sample in these arrays. This property results in different accuracy of expression intensities of a transcript across arrays.
Results: We provide evidence, with publicly available spike-in data, that the false discovery rate of differential expression is reduced by modeling bead-level variability with a multi-level mixed effects model. We compare the performance of our proposed model to existing analysis methods for bead arrays: the unweighted t-test and other weighted methods. Additionally, we provide theoretical insights into when the multi-level mixed effects model outperforms other methods. Finally, we provide a software program for differential expression analysis using the multi-level mixed effects model that analyzes tens of thousands of genes efficiently.
Availability: The software program is freely available on web at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3042178  PMID: 21169374
5.  Adjustment for local ancestry in genetic association analysis of admixed populations 
Bioinformatics  2010;27(5):670-677.
Motivation: Admixed populations offer a unique opportunity for mapping diseases that have large disease allele frequency differences between ancestral populations. However, association analysis in such populations is challenging because population stratification may lead to association with loci unlinked to the disease locus.
Methods and results: We show that local ancestry at a test single nucleotide polymorphism (SNP) may confound with the association signal and ignoring it can lead to spurious association. We demonstrate theoretically that adjustment for local ancestry at the test SNP is sufficient to remove the spurious association regardless of the mechanism of population stratification, whether due to local or global ancestry differences among study subjects; however, global ancestry adjustment procedures may not be effective. We further develop two novel association tests that adjust for local ancestry. Our first test is based on a conditional likelihood framework which models the distribution of the test SNP given disease status and flanking marker genotypes. A key advantage of this test lies in its ability to incorporate different directions of association in the ancestral populations. Our second test, which is computationally simpler, is based on logistic regression, with adjustment for local ancestry proportion. We conducted extensive simulations and found that the Type I error rates of our tests are under control; however, the global adjustment procedures yielded inflated Type I error rates when stratification is due to local ancestry difference.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3042179  PMID: 21169375
6.  PUGSVM: a caBIGTM analytical tool for multiclass gene selection and predictive classification 
Bioinformatics  2010;27(5):736-738.
Summary: Phenotypic Up-regulated Gene Support Vector Machine (PUGSVM) is a cancer Biomedical Informatics Grid (caBIG™) analytical tool for multiclass gene selection and classification. PUGSVM addresses the problem of imbalanced class separability, small sample size and high gene space dimensionality, where multiclass gene markers are defined by the union of one-versus-everyone phenotypic upregulated genes, and used by a well-matched one-versus-rest support vector machine. PUGSVM provides a simple yet more accurate strategy to identify statistically reproducible mechanistic marker genes for characterization of heterogeneous diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3042183  PMID: 21186245
7.  CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments 
Bioinformatics  2010;27(5):715-717.
Summary:CompleteMOTIFs (cMOTIFs) is an integrated web tool developed to facilitate systematic discovery of overrepresented transcription factor binding motifs from high-throughput chromatin immunoprecipitation experiments. Comprehensive annotations and Boolean logic operations on multiple peak locations enable users to focus on genomic regions of interest for de novo motif discovery using tools such as MEME, Weeder and ChIPMunk. The pipeline incorporates a scanning tool for known motifs from TRANSFAC and JASPAR databases, and performs an enrichment test using local or precalculated background models that significantly improve the motif scanning result. Furthermore, using the cMOTIFs pipeline, we demonstrated that multiple transcription factors could cooperatively bind to the upstream of important stem cell differentiation regulators.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3105477  PMID: 21183585
8.  Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences 
Bioinformatics  2010;27(5):626-632.
Motivation: With recent advances in sequencing, structural and functional studies of RNA lag behind the discovery of sequences. Computational analysis of RNA is increasingly important to reveal structure–function relationships with low cost and speed. The purpose of this study is to use multiple homologous sequences to infer a conserved RNA structure.
Results: A new algorithm, called Multilign, is presented to find the lowest free energy RNA secondary structure common to multiple sequences. Multilign is based on Dynalign, which is a program that simultaneously aligns and folds two sequences to find the lowest free energy conserved structure. For Multilign, Dynalign is used to progressively construct a conserved structure from multiple pairwise calculations, with one sequence used in all pairwise calculations. A base pair is predicted only if it is contained in the set of low free energy structures predicted by all Dynalign calculations. In this way, Multilign improves prediction accuracy by keeping the genuine base pairs and excluding competing false base pairs. Multilign has computational complexity that scales linearly in the number of sequences. Multilign was tested on extensive datasets of sequences with known structure and its prediction accuracy is among the best of available algorithms. Multilign can run on long sequences (> 1500 nt) and an arbitrarily large number of sequences.
Availability: The algorithm is implemented in ANSI C++ and can be downloaded as part of the RNAstructure package at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3042186  PMID: 21193521
9.  A Bayesian approach for estimating calibration curves and unknown concentrations in immunoassays 
Bioinformatics  2010;27(5):707-712.
Motivation: Immunoassays are primary diagnostic and research tools throughout the medical and life sciences. The common approach to the processing of immunoassay data involves estimation of the calibration curve followed by inversion of the calibration function to read off the concentration estimates. This approach, however, does not lend itself easily to acceptable estimation of confidence limits on the estimated concentrations. Such estimates must account for uncertainty in the calibration curve as well as uncertainty in the target measurement. Even point estimates can be problematic: because of the non-linearity of calibration curves and error heteroscedasticity, the neglect of components of measurement error can produce significant bias.
Methods: We have developed a Bayesian approach for the estimation of concentrations from immunoassay data that treats the propagation of measurement error appropriately. The method uses Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution of the target concentrations and numerically compute the relevant summary statistics. Software implementing the method is freely available for public use.
Results: The new method was tested on both simulated and experimental datasets with different measurement error models. The method outperformed the common inverse method on samples with large measurement errors. Even in cases with extreme measurements where the common inverse method failed, our approach always generated reasonable estimates for the target concentrations.
Availability: Project name: Baecs; Project home page:; Operating systems: Linux, MacOS X and Windows; Programming language: C++; License: Free for Academic Use.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3465100  PMID: 21149344
10.  The Bayesian lasso for genome-wide association studies 
Bioinformatics  2010;27(4):516-523.
Motivation: Despite their success in identifying genes that affect complex disease or traits, current genome-wide association studies (GWASs) based on a single SNP analysis are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling.
Method: We propose a two-stage procedure for multi-SNP modeling and analysis in GWASs, by first producing a ‘preconditioned’ response variable using a supervised principle component analysis and then formulating Bayesian lasso to select a subset of significant SNPs. The Bayesian lasso is implemented with a hierarchical model, in which scale mixtures of normal are used as prior distributions for the genetic effects and exponential priors are considered for their variances, and then solved by using the Markov chain Monte Carlo (MCMC) algorithm. Our approach obviates the choice of the lasso parameter by imposing a diffuse hyperprior on it and estimating it along with other parameters and is particularly powerful for selecting the most relevant SNPs for GWASs, where the number of predictors exceeds the number of observations.
Results: The new approach was examined through a simulation study. By using the approach to analyze a real dataset from the Framingham Heart Study, we detected several significant genes that are associated with body mass index (BMI). Our findings support the previous results about BMI-related SNPs and, meanwhile, gain new insights into the genetic control of this trait.
Availability: The computer code for the approach developed is available at Penn State Center for Statistical Genetics web site,
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3105480  PMID: 21156729
11.  Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data 
Bioinformatics  2010;27(4):495-501.
Motivation: Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene–gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes.
Results: We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.
PMCID: PMC3105481  PMID: 21159623
12.  Using bioinformatics to predict the functional impact of SNVs 
Bioinformatics  2010;27(4):441-448.
Motivation: The past decade has seen the introduction of fast and relatively inexpensive methods to detect genetic variation across the genome and exponential growth in the number of known single nucleotide variants (SNVs). There is increasing interest in bioinformatics approaches to identify variants that are functionally important from millions of candidate variants. Here, we describe the essential components of bionformatics tools that predict functional SNVs.
Results: Bioinformatics tools have great potential to identify functional SNVs, but the black box nature of many tools can be a pitfall for researchers. Understanding the underlying methods, assumptions and biases of these tools is essential to their intelligent application.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3105482  PMID: 21159622
13.  A computationally efficient modular optimal discovery procedure 
Bioinformatics  2010;27(4):509-515.
Motivation: It is well known that patterns of differential gene expression across biological conditions are often shared by many genes, particularly those within functional groups. Taking advantage of these patterns can lead to increased statistical power and biological clarity when testing for differential expression in a microarray experiment. The optimal discovery procedure (ODP), which maximizes the expected number of true positives for each fixed number of expected false positives, is a framework aimed at this goal. Storey et al. introduced an estimator of the ODP for identifying differentially expressed genes. However, their ODP estimator grows quadratically in computational time with respect to the number of genes. Reducing this computational burden is a key step in making the ODP practical for usage in a variety of high-throughput problems.
Results: Here, we propose a new estimate of the ODP called the modular ODP (mODP). The existing ‘full ODP’ requires that the likelihood function for each gene be evaluated according to the parameter estimates for all genes. The mODP assigns genes to modules according to a Kullback–Leibler distance, and then evaluates the statistic only at the module-averaged parameter estimates. We show that the mODP is relatively insensitive to the choice of the number of modules, but dramatically reduces the computational complexity from quadratic to linear in the number of genes. We compare the full ODP algorithm and mODP on simulated data and gene expression data from a recent study of Morrocan Amazighs. The mODP and full ODP algorithm perform very similarly across a range of comparisons.
Availability: The mODP methodology has been implemented into EDGE, a comprehensive gene expression analysis software package in R, available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3105483  PMID: 21186247
14.  Improving evolutionary models of protein interaction networks 
Bioinformatics  2010;27(3):376-382.
Motivation: Theoretical models of biological networks are valuable tools in evolutionary inference. Theoretical models based on gene duplication and divergence provide biologically plausible evolutionary mechanics. Similarities found between empirical networks and their theoretically generated counterpart are considered evidence of the role modeled mechanics play in biological evolution. However, the method by which these models are parameterized can lead to questions about the validity of the inferences. Selecting parameter values in order to produce a particular topological value obfuscates the possibility that the model may produce a similar topology for a large range of parameter values. Alternately, a model may produce a large range of topologies, allowing (incorrect) parameter values to produce a valid topology from an otherwise flawed model. In order to lend biological credence to the modeled evolutionary mechanics, parameter values should be derived from the empirical data. Furthermore, recent work indicates that the timing and fate of gene duplications are critical to proper derivation of these parameters.
Results: We present a methodology for deriving evolutionary rates from empirical data that is used to parameterize duplication and divergence models of protein interaction network evolution. Our method avoids shortcomings of previous methods, which failed to consider the effect of subsequent duplications. From our parameter values, we find that concurrent and existing existing duplication and divergence models are insufficient for modeling protein interaction network evolution. We introduce a model enhancement based on heritable interaction sites on the surface of a protein and find that it more closely reflects the high clustering found in the empirical network.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031028  PMID: 21067999
15.  Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution 
Bioinformatics  2010;27(3):326-333.
Motivation: Clustering protein sequence data into functionally specific families is a difficult but important problem in biological research. One useful approach for tackling this problem involves representing the sequence dataset as a protein similarity network, and afterwards clustering the network using advanced graph analysis techniques. Although a multitude of such network clustering algorithms have been developed over the past few years, comparing algorithms is often difficult because performance is affected by the specifics of network construction. We investigate an important aspect of network construction used in analyzing protein superfamilies and present a heuristic approach for improving the performance of several algorithms.
Results: We analyzed how the performance of network clustering algorithms relates to thresholding the network prior to clustering. Our results, over four different datasets, show how for each input dataset there exists an optimal threshold range over which an algorithm generates its most accurate clustering output. Our results further show how the optimal threshold range correlates with the shape of the edge weight distribution for the input similarity network. We used this correlation to develop an automated threshold selection heuristic in order to most optimally filter a similarity network prior to clustering. This heuristic allows researchers to process their protein datasets with runtime efficient network clustering algorithms without sacrificing the clustering accuracy of the final results.
Availability: Python code for implementing the automated threshold selection heuristic, together with the datasets used in our analysis, are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031030  PMID: 21118823
16.  Bayesian ensemble methods for survival prediction in gene expression data 
Bioinformatics  2010;27(3):359-367.
Motivation: We propose a Bayesian ensemble method for survival prediction in high-dimensional gene expression data. We specify a fully Bayesian hierarchical approach based on an ensemble ‘sum-of-trees’ model and illustrate our method using three popular survival models. Our non-parametric method incorporates both additive and interaction effects between genes, which results in high predictive accuracy compared with other methods. In addition, our method provides model-free variable selection of important prognostic markers based on controlling the false discovery rates; thus providing a unified procedure to select relevant genes and predict survivor functions.
Results: We assess the performance of our method several simulated and real microarray datasets. We show that our method selects genes potentially related to the development of the disease as well as yields predictive performance that is very competitive to many other existing methods.
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031034  PMID: 21148161
17.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature 
Bioinformatics  2010;27(3):408-415.
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Availability: Freely available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031038  PMID: 21138947
18.  Principal network analysis: identification of subnetworks representing major dynamics using gene expression data 
Bioinformatics  2010;27(3):391-398.
Motivation: Systems biology attempts to describe complex systems behaviors in terms of dynamic operations of biological networks. However, there is lack of tools that can effectively decode complex network dynamics over multiple conditions.
Results: We present principal network analysis (PNA) that can automatically capture major dynamic activation patterns over multiple conditions and then generate protein and metabolic subnetworks for the captured patterns. We first demonstrated the utility of this method by applying it to a synthetic dataset. The results showed that PNA correctly captured the subnetworks representing dynamics in the data. We further applied PNA to two time-course gene expression profiles collected from (i) MCF7 cells after treatments of HRG at multiple doses and (ii) brain samples of four strains of mice infected with two prion strains. The resulting subnetworks and their interactions revealed network dynamics associated with HRG dose-dependent regulation of cell proliferation and differentiation and early PrPSc accumulation during prion infection.
Availability: The web-based software is available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031040  PMID: 21193522
19.  Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS 
Bioinformatics  2010;27(2):284-285.
Motivation: Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research.
Results: Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations.
Availability: The SNPrank code is open source and available at
PMCID: PMC3018810  PMID: 21115438
20.  PriSM: a primer selection and matching tool for amplification and sequencing of viral genomes 
Bioinformatics  2010;27(2):266-267.
Summary: PriSM is a set of algorithms designed to select and match degenerate primer pairs for the amplification of viral genomes. The design of panels of hundreds of primer pairs takes just hours using this program, compared with days using a manual approach. PriSM allows for rapid in silico optimization of primers for downstream applications such as sequencing. As a validation, PriSM was used to create an amplification primer panel for human immunodeficiency virus (HIV) Clade B.
Availability: The program is freely available for use at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3018813  PMID: 21068001
21.  Predicting in vitro drug sensitivity using Random Forests 
Bioinformatics  2010;27(2):220-224.
Motivation: Panels of cell lines such as the NCI-60 have long been used to test drug candidates for their ability to inhibit proliferation. Predictive models of in vitro drug sensitivity have previously been constructed using gene expression signatures generated from gene expression microarrays. These statistical models allow the prediction of drug response for cell lines not in the original NCI-60. We improve on existing techniques by developing a novel multistep algorithm that builds regression models of drug response using Random Forest, an ensemble approach based on classification and regression trees (CART).
Results: This method proved successful in predicting drug response for both a panel of 19 Breast Cancer and 7 Glioma cell lines, outperformed other methods based on differential gene expression, and has general utility for any application that seeks to relate gene expression data to a continuous output variable.
Implementation: Software was written in the R language and will be available together with associated gene expression and drug response data as the package ivDrug at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3018816  PMID: 21134890
22.  Analyzing marginal cases in differential shotgun proteomics 
Bioinformatics  2010;27(2):275-276.
Summary: We present an approach to statistically pinpoint differentially expressed proteins that have quantitation values near the quantitation threshold and are not identified in all replicates (marginal cases). Our method uses a Bayesian strategy to combine parametric statistics with an empirical distribution built from the reproducibility quality of the technical replicates.
Availability:The software is freely available for academic use at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3018820  PMID: 21075743
23.  Inverse perturbation for optimal intervention in gene regulatory networks 
Bioinformatics  2010;27(1):103-110.
Motivation: Analysis and intervention in the dynamics of gene regulatory networks is at the heart of emerging efforts in the development of modern treatment of numerous ailments including cancer. The ultimate goal is to develop methods to intervene in the function of living organisms in order to drive cells away from a malignant state into a benign form. A serious limitation of much of the previous work in cancer network analysis is the use of external control, which requires intervention at each time step, for an indefinite time interval. This is in sharp contrast to the proposed approach, which relies on the solution of an inverse perturbation problem to introduce a one-time intervention in the structure of regulatory networks. This isolated intervention transforms the steady-state distribution of the dynamic system to the desired steady-state distribution.
Results: We formulate the optimal intervention problem in gene regulatory networks as a minimal perturbation of the network in order to force it to converge to a desired steady-state distribution of gene regulation. We cast optimal intervention in gene regulation as a convex optimization problem, thus providing a globally optimal solution which can be efficiently computed using standard toolboxes for convex optimization. The criteria adopted for optimality is chosen to minimize potential adverse effects as a consequence of the intervention strategy. We consider a perturbation that minimizes (i) the overall energy of change between the original and controlled networks and (ii) the time needed to reach the desired steady-state distribution of gene regulation. Furthermore, we show that there is an inherent trade-off between minimizing the energy of the perturbation and the convergence rate to the desired distribution. We apply the proposed control to the human melanoma gene regulatory network.
Availability: The MATLAB code for optimal intervention in gene regulatory networks can be found online:
Supplementary information:Supplementary data are available at Bioinformatics online.
PMCID: PMC3008638  PMID: 21062762
24.  von Bertalanffy 1.0: a COBRA toolbox extension to thermodynamically constrain metabolic models 
Bioinformatics  2010;27(1):142-143.
Motivation: In flux balance analysis of genome scale stoichiometric models of metabolism, the principal constraints are uptake or secretion rates, the steady state mass conservation assumption and reaction directionality. Here, we introduce an algorithmic pipeline for quantitative assignment of reaction directionality in multi-compartmental genome scale models based on an application of the second law of thermodynamics to each reaction. Given experimental or computationally estimated standard metabolite species Gibbs energy and metabolite concentrations, the algorithms bounds reaction Gibbs energy, which is transformed to in vivo pH, temperature, ionic strength and electrical potential.
Results: This cross-platform MATLAB extension to the COnstraint-Based Reconstruction and Analysis (COBRA) toolbox is computationally efficient, extensively documented and open source.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3008639  PMID: 21115436
25.  Biological impact of missing-value imputation on downstream analyses of gene expression profiles 
Bioinformatics  2010;27(1):78-86.
Motivation: Microarray experiments frequently produce multiple missing values (MVs) due to flaws such as dust, scratches, insufficient resolution or hybridization errors on the chips. Unfortunately, many downstream algorithms require a complete data matrix. The motivation of this work is to determine the impact of MV imputation on downstream analysis, and whether ranking of imputation methods by imputation accuracy correlates well with the biological impact of the imputation.
Methods: Using eight datasets for differential expression (DE) and classification analysis and eight datasets for gene clustering, we demonstrate the biological impact of missing-value imputation on statistical downstream analyses, including three commonly employed DE methods, four classifiers and three gene-clustering methods. Correlation between the rankings of imputation methods based on three root-mean squared error (RMSE) measures and the rankings based on the downstream analysis methods was used to investigate which RMSE measure was most consistent with the biological impact measures, and which downstream analysis methods were the most sensitive to the choice of imputation procedure.
Results: DE was the most sensitive to the choice of imputation procedure, while classification was the least sensitive and clustering was intermediate between the two. The logged RMSE (LRMSE) measure had the highest correlation with the imputation rankings based on the DE results, indicating that the LRMSE is the best representative surrogate among the three RMSE-based measures. Bayesian principal component analysis and least squares adaptive appeared to be the best performing methods in the empirical downstream evaluation.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3008641  PMID: 21045072

Results 1-25 (418)