Search tips
Search criteria

Results 1-25 (394)

Clipboard (0)
Year of Publication
1.  Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images 
Bioinformatics  2009;26(6):761-769.
Motivation: Recent advancements in high-throughput imaging have created new large datasets with tens of thousands of gene expression images. Methods for capturing these spatial and/or temporal expression patterns include in situ hybridization or fluorescent reporter constructs or tags, and results are still frequently assessed by subjective qualitative comparisons. In order to deal with available large datasets, fully automated analysis methods must be developed to properly normalize and model spatial expression patterns.
Results: We have developed image segmentation and registration methods to identify and extract spatial gene expression patterns from RNA in situ hybridization experiments of Drosophila embryos. These methods allow us to normalize and extract expression information for 78 621 images from 3724 genes across six time stages. The similarity between gene expression patterns is computed using four scoring metrics: mean squared error, Haar wavelet distance, mutual information and spatial mutual information (SMI). We additionally propose a strategy to calculate the significance of the similarity between two expression images, by generating surrogate datasets with similar spatial expression patterns using a Monte Carlo swap sampler. On data from an early development time stage, we show that SMI provides the most biologically relevant metric of comparison, and that our significance testing generalizes metrics to achieve similar performance. We exemplify the application of spatial metrics on the well-known Drosophila segmentation network.
Availability: A Java webstart application to register and compare patterns, as well as all source code, are available from:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3140183  PMID: 19942587
2.  ConceptGen: a gene set enrichment and gene set relation mapping tool 
Bioinformatics  2009;26(4):456-463.
Motivation: The elucidation of biological concepts enriched with differentially expressed genes has become an integral part of the analysis and interpretation of genomic data. Of additional importance is the ability to explore networks of relationships among previously defined biological concepts from diverse information sources, and to explore results visually from multiple perspectives. Accomplishing these tasks requires a unified framework for agglomeration of data from various genomic resources, novel visualizations, and user functionality.
Results: We have developed ConceptGen, a web-based gene set enrichment and gene set relation mapping tool that is streamlined and simple to use. ConceptGen offers over 20 000 concepts comprising 14 different types of biological knowledge, including data not currently available in any other gene set enrichment or gene set relation mapping tool. We demonstrate the functionalities of ConceptGen using gene expression data modeling TGF-beta-induced epithelial-mesenchymal transition and metabolomics data comparing metastatic versus localized prostate cancers.
Availability: ConceptGen is part of the NIH's National Center for Integrative Biomedical Informatics (NCIBI) and is freely available at For terms of use, visit
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852214  PMID: 20007254
3.  Power to detect selective allelic amplification in genome-wide scans of tumor data 
Bioinformatics  2009;26(4):518-528.
Motivation: Somatic amplification of particular genomic regions and selection of cellular lineages with such amplifications drives tumor development. However, pinpointing genes under such selection has been difficult due to the large span of these regions. Our recently-developed method, the amplification distortion test (ADT), identifies specific nucleotide alleles and haplotypes that confer better survival for tumor cells when somatically amplified. In this work, we focus on evaluating ADT's power to detect such causal variants across a variety of tumor dataset scenarios.
Results: Towards this end, we generated multiple parameter-based, synthetic datasets—derived from real data—that contain somatic copy number aberrations (CNAs) of various lengths and frequencies over germline single nucleotide polymorphisms (SNPs) genome-wide. Gold-standard causal sub-regions were assigned within these CNAs, followed by an assessment of ADT's ability to detect these sub-regions. Results indicate that ADT possesses high sensitivity and specificity in large sample sizes across most parameter cases, including those that more closely reflect existing SNP and CNA cancer data.
Availability: ADT is implemented in the Java software HADiT and can be downloaded through the SVN repository (via Develop→ Code→SVN Browse) at:
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852215  PMID: 20031965
4.  GonadSAGE: a comprehensive SAGE database for transcript discovery on male embryonic gonad development 
Bioinformatics  2009;26(4):585-586.
Summary: Serial analysis of gene expression (SAGE) provides an alternative, with additional advantages, to microarray gene expression studies. GonadSAGE is the first publicly available web-based SAGE database on male gonad development that covers six male mouse embryonic gonad stages, including E10.5, E11.5, E12.5, E13.5, E15.5 and E17.5. The sequence coverage of each SAGE library is beyond 150K, ‘which is the most extensive sequence-based male gonadal transcriptome to date’. An interactive web interface with customizable parameters is provided for analyzing male gonad transcriptome information. Furthermore, the data can be visualized and analyzed with the other genomic features in the UCSC genome browser. It represents an integrated platform that leads to a better understanding of male gonad development, and allows discovery of related novel targets and regulatory pathways.
Availability: GonadSAGE is at
PMCID: PMC2852216  PMID: 20028690
5.  Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data 
Bioinformatics  2009;26(4):501-508.
Motivation: Model-based clustering has been widely used, e.g. in microarray data analysis. Since for high-dimensional data variable selection is necessary, several penalized model-based clustering methods have been proposed tørealize simultaneous variable selection and clustering. However, the existing methods all assume that the variables are independent with the use of diagonal covariance matrices.
Results: To model non-independence of variables (e.g. correlated gene expressions) while alleviating the problem with the large number of unknown parameters associated with a general non-diagonal covariance matrix, we generalize the mixture of factor analyzers to that with penalization, which, among others, can effectively realize variable selection. We use simulated data and real microarray data to illustrate the utility and advantages of the proposed method over several existing ones.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852217  PMID: 20031967
6.  CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data 
Bioinformatics  2009;26(4):464-469.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852218  PMID: 20031968
7.  GWAF: an R package for genome-wide association analyses with family data 
Bioinformatics  2009;26(4):580-581.
Summary: GWAF, Genome-Wide Association analyses with Family, is an R package designed for GWAF. It implements association tests between a batch of genotyped or imputed single nucleotide polymorphisms (SNPs) and a binary or continuous trait with user specified genetic model, and generates informative results from the analyses. In addition, GWAF provides functions to visualize results. We evaluated GWAF using a simulated continuous trait and a binary trait dichotomized from the simulated continuous trait with real genotype data from the Framingham Heart Study's SNP Health Association Resource project.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852219  PMID: 20040588
8.  BRAT: bisulfite-treated reads analysis tool 
Bioinformatics  2009;26(4):572-573.
Summary: We present a new, accurate and efficient tool for mapping short reads obtained from the Illumina Genome Analyzer following sodium bisulfite conversion. Our tool, BRAT, supports single and paired-end reads and handles input files containing reads and mates of different lengths. BRAT is faster, maps more unique paired-end reads and has higher accuracy than existing programs. The software package includes tools to end-trim low-quality bases of the reads and to report nucleotide counts for mapped reads on the reference genome.
Availability: The source code is freely available for download at and is distributed as Open Source software under the GPLv3.0.
PMCID: PMC3716225  PMID: 20031974
9.  Globally, unrelated protein sequences appear random 
Bioinformatics  2009;26(3):310-318.
Motivation: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.
Results: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in α-helical secondary structures (but not β-strands). Five-residue consensus exceptional words are enriched for α-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for α-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852211  PMID: 19948773
10.  ChiBE: interactive visualization and manipulation of BioPAX pathway models 
Bioinformatics  2009;26(3):429-431.
Summary: Representing models of cellular processes or pathways in a graphically rich form facilitates interpretation of biological observations and generation of new hypotheses. Solving biological problems using large pathway datasets requires software that can combine data mapping, querying and visualization as well as providing access to diverse data resources on the Internet. ChiBE is an open source software application that features user-friendly multi-view display, navigation and manipulation of pathway models in BioPAX format. Pathway views are rendered in a feature-rich format, and may be laid out and edited with state-of-the-art visualization methods, including compound or nested structures for visualizing cellular compartments and molecular complexes. Users can easily query and visualize pathways through an integrated Pathway Commons query tool and analyze molecular profiles in pathway context.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2815657  PMID: 20007251
11.  Biomarker detection in the integration of multiple multi-class genomic studies 
Bioinformatics  2009;26(3):333-340.
Motivation: Systematic information integration of multiple-related microarray studies has become an important issue as the technology becomes mature and prevalent in the past decade. The aggregated information provides more robust and accurate biomarker detection. So far, published meta-analysis methods for this purpose mostly consider two-class comparison. Methods for combining multi-class studies and considering expression pattern concordance are rarely explored.
Results: In this article, we develop three integration methods for biomarker detection in multiple multi-class microarray studies: ANOVA-maxP, min-MCC and OW-min-MCC. We first consider a natural extension of combining P-values from the traditional ANOVA model. Since P-values from ANOVA do not guarantee to reflect the concordant expression pattern information across studies, we propose a multi-class correlation (MCC) measure to specifically seek for biomarkers of concordant inter-class patterns across a pair of studies. For both ANOVA and MCC approaches, we use extreme order statistics to identify biomarkers differentially expressed (DE) in all studies (i.e. ANOVA-maxP and min-MCC). The min-MCC method is further extended to identify biomarkers DE in partial studies by incorporating a recently developed optimally weighted (OW) technique (OW-min-MCC). All methods are evaluated by simulation studies and by three meta-analysis applications to multi-tissue mouse metabolism datasets, multi-condition mouse trauma datasets and multi-malignant-condition human prostate cancer datasets. The results show complementary strength of the three methods for different biological purposes.
Supplementary information: Supplementary data is available at Bioinformatics online.
PMCID: PMC2815659  PMID: 19965884
12.  A new gene selection procedure based on the covariance distance 
Bioinformatics  2009;26(3):348-354.
Motivation: Very little attention has been given to gene selection procedures based on intergene correlation structure, which is often neglected in the context of differential gene expression analysis. We propose a statistical procedure to select genes that have different associations with others across different phenotypes. This procedure is based on a new gene association score, called the covariance distance.
Results: We apply the proposed method, along with two alternative methods, to several simulated datasets and find out that our method is much more powerful than the other two. For biological data, we demonstrate that the analysis of differentially associated genes complements the analysis of differentially expressed genes. Combining both procedures provides a more comprehensive functional interpretation of the experimental results.
Availability: The code is downloadable from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2815661  PMID: 19996162
13.  Tmod: toolbox of motif discovery 
Bioinformatics  2009;26(3):405-407.
Summary: Motif discovery is an important topic in computational transcriptional regulation studies. In the past decade, many researchers have contributed to the field and many de novo motif-finding tools have been developed, each may have a different strength. However, most of these tools do not have a user-friendly interface and their results are not easily comparable. We present a software called Toolbox of Motif Discovery (Tmod) for Windows operating systems. The current version of Tmod integrates 12 widely used motif discovery programs: MDscan, BioProspector, AlignACE, Gibbs Motif Sampler, MEME, CONSENSUS, MotifRegressor, GLAM, MotifSampler, SeSiMCMC, Weeder and YMF. Tmod provides a unified interface to ease the use of these programs and help users to understand the tuning parameters. It allows plug-in motif-finding programs to run either separately or in a batch mode with predetermined parameters, and provides a summary comprising of outputs from multiple programs. Tmod is developed in C++ with the support of Microsoft Foundation Classes and Cygwin. Tmod can also be easily expanded to include future algorithms.
Availability: Tmod is available for download at∼junliu/Tmod/
PMCID: PMC2815662  PMID: 20007740
14.  Bayesian model selection for characterizing genomic imprinting effects and patterns 
Bioinformatics  2009;26(2):235-241.
Motivation: Although imprinted genes have been ubiquitously observed in nature, statistical methodology still has not been systematically developed for jointly characterizing genomic imprinting effects and patterns. To detect imprinting genes influencing quantitative traits, the least square and maximum likelihood approaches for fitting a single quantitative trait loci (QTL) and Bayesian method for simultaneously modeling multiple QTLs have been adopted in various studies.
Results: In a widely used F2 reciprocal mating population for mapping imprinting genes, we herein propose a genomic imprinting model which describes additive, dominance and imprinting effects of multiple imprinted quantitative trait loci (iQTL) for traits of interest. Depending upon the estimates of the above genetic effects, we categorized imprinting patterns into seven types, which provides a complete classification scheme for describing imprinting patterns. Bayesian model selection was employed to identify iQTL along with many genetic parameters in a computationally efficient manner. To make statistical inference on the imprinting types of iQTL detected, a set of Bayes factors were formulated using the posterior probabilities for the genetic effects being compared. We demonstrated the performance of the proposed method by computer simulation experiments and then applied this method to two real datasets. Our approach can be generally used to identify inheritance modes and determine the contribution of major genes for quantitative variations.
PMCID: PMC2804294  PMID: 19880366
15.  Quantifying uncertainty in genotype calls 
Bioinformatics  2009;26(2):242-249.
Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.
Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.
Availability: Software implementing of the method described in this article is available as free and open source code in the crlmm R/BioConductor package.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2804295  PMID: 19906825
16.  hPDI: a database of experimental human protein–DNA interactions 
Bioinformatics  2009;26(2):287-289.
Summary: The human protein DNA Interactome (hPDI) database holds experimental protein–DNA interaction data for humans identified by protein microarray assays. The unique characteristics of hPDI are that it contains consensus DNA-binding sequences not only for nearly 500 human transcription factors but also for >500 unconventional DNA-binding proteins, which are completely uncharacterized previously. Users can browse, search and download a subset or the entire data via a web interface. This database is freely accessible for any academic purposes.
PMCID: PMC2804296  PMID: 19900953
17.  Co-expression networks: graph properties and topological comparisons 
Bioinformatics  2009;26(2):205-214.
Motivation: Microarray-based gene expression data have been generated widely to study different biological processes and systems. Gene co-expression networks are often used to extract information about groups of genes that are ‘functionally’ related or co-regulated. However, the structural properties of such co-expression networks have not been rigorously studied and fully compared with known biological networks. In this article, we aim at investigating the structural properties of co-expression networks inferred for the species Saccharomyces Cerevisiae and comparing them with the topological properties of the known, well-established transcriptional network, MIPS physical network and protein–protein interaction (PPI) network of yeast.
Results: These topological comparisons indicate that co-expression networks are not distinctly related with either the PPI or the MIPS physical interaction networks, showing important structural differences between them. When focusing on a more literal comparison, vertex by vertex and edge by edge, the conclusion is the same: the fact that two genes exhibit a high gene expression correlation degree does not seem to obviously correlate with the existence of a physical binding between the proteins produced by these genes or the existence of a MIPS physical interaction between the genes. The comparison of the yeast regulatory network with inferred yeast co-expression networks would suggest, however, that they could somehow be related.
Conclusions: We conclude that the gene expression-based co-expression networks reflect more on the gene regulatory networks but less on the PPI or MIPS physical interaction networks.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2804297  PMID: 19910304
18.  Pathway analysis using random forests with bivariate node-split for survival outcomes 
Bioinformatics  2009;26(2):250-258.
Motivation: There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted.
Results: In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies.
Availability: R package Pwayrfsurvival is available from URL:∼hp44/pwayrfsurvival.htm
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2804301  PMID: 19933158
19.  EGAN: exploratory gene association networks 
Bioinformatics  2009;26(2):285-286.
Summary: Exploratory Gene Association Networks (EGAN) is a Java desktop application that provides a point-and-click environment for contextual graph visualization of high-throughput assay results. By loading the entire network of genes, pathways, interactions, annotation terms and literature references directly into memory, EGAN allows a biologist to repeatedly query and interpret multiple experimental results without incurring additional delays for data download/integration. Other compelling features of EGAN include: support for diverse -omics technologies, a simple and interactive graph display, sortable/searchable data tables, links to external web resources including ≥240 000 articles at PubMed, hypergeometric and GSEA-like enrichment statistics, pipeline-compatible automation via scripting and the ability to completely customize and/or supplement the network with new/proprietary data.
Availability: Runs on most operating systems via Java; downloadable from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2804305  PMID: 19933825
20.  SCAN: SNP and copy number annotation 
Bioinformatics  2009;26(2):259-262.
Motivation: Genome-wide association studies (GWAS) generate relationships between hundreds of thousands of single nucleotide polymorphisms (SNPs) and complex phenotypes. The contribution of the traditionally overlooked copy number variations (CNVs) to complex traits is also being actively studied. To facilitate the interpretation of the data and the designing of follow-up experimental validations, we have developed a database that enables the sensible prioritization of these variants by combining several approaches, involving not only publicly available physical and functional annotations but also multilocus linkage disequilibrium (LD) annotations as well as annotations of expression quantitative trait loci (eQTLs).
Results: For each SNP, the SCAN database provides: (i) summary information from eQTL mapping of HapMap SNPs to gene expression (evaluated by the Affymetrix exon array) in the full set of HapMap CEU (Caucasians from UT, USA) and YRI (Yoruba people from Ibadan, Nigeria) samples; (ii) LD information, in the case of a HapMap SNP, including what genes have variation in strong LD (pairwise or multilocus LD) with the variant and how well the SNP is covered by different high-throughput platforms; (iii) summary information available from public databases (e.g. physical and functional annotations); and (iv) summary information from other GWAS. For each gene, SCAN provides annotations on: (i) eQTLs for the gene (both local and distant SNPs) and (ii) the coverage of all variants in the HapMap at that gene on each high-throughput platform. For each genomic region, SCAN provides annotations on: (i) physical and functional annotations of all SNPs, genes and known CNVs within the region and (ii) all genes regulated by the eQTLs within the region.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852202  PMID: 19933162
21.  Joint estimation of DNA copy number from multiple platforms 
Bioinformatics  2009;26(2):153-160.
Motivation: DNA copy number variants (CNVs) are gains and losses of segments of chromosomes, and comprise an important class of genetic variation. Recently, various microarray hybridization-based techniques have been developed for high-throughput measurement of DNA copy number. In many studies, multiple technical platforms or different versions of the same platform were used to interrogate the same samples; and it became necessary to pool information across these multiple sources to derive a consensus molecular profile for each sample. An integrated analysis is expected to maximize resolution and accuracy, yet currently there is no well-formulated statistical method to address the between-platform differences in probe coverage, assay methods, sensitivity and analytical complexity.
Results: The conventional approach is to apply one of the CNV detection (‘segmentation’) algorithms to search for DNA segments of altered signal intensity. The results from multiple platforms are combined after segmentation. Here we propose a new method, Multi-Platform Circular Binary Segmentation (MPCBS), which pools statistical evidence across platforms during segmentation, and does not require pre-standardization of different data sources. It involves a weighted sum of t-statistics, which arises naturally from the generalized log-likelihood ratio of a multi-platform model. We show by comparing the integrated analysis of Affymetrix and Illumina SNP array data with Agilent and fosmid clone end-sequencing results on eight HapMap samples that MPCBS achieves improved spatial resolution, detection power and provides a natural consensus across platforms. We also apply the new method to analyze multi-platform data for tumor samples.
Availability: The R package for MPCBS is registered on R-Forge ( under project name MPCBS.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2852203  PMID: 19933593
22.  Identification of non-Hodgkin's lymphoma prognosis signatures using the CTGDR method 
Bioinformatics  2009;26(1):15-21.
Motivation: Although NHL (non-Hodgkin's lymphoma) is the fifth leading cause of cancer incidence and mortality in the USA, it remains poorly understood and is largely incurable. Biomedical studies have shown that genomic variations, measured with SNPs (single nucleotide polymorphisms) in genes, may have independent predictive power for disease-free survival in NHL patients beyond clinical measurements.
Results: We apply the CTGDR (clustering threshold gradient directed regularization) method to genetic association studies using SNPs, analyze data from an association study of NHL and identify prognosis signatures to diffuse large B cell lymphoma (DLBCL) and follicular lymphoma (FL), the two most common subtypes of NHL. With the CTGDR method, we are able to account for the joint effects of multiple genes/SNPs, whereas most existing studies are single-marker based. In addition, we are able to account for the ‘gene and SNP-within-gene’ hierarchical structure and identify not only predictive genes but also predictive SNPs within identified genes. In contrast, existing studies are limited to either gene or SNP identification, but not both. We propose using resampling methods to evaluate the predictive power and reproducibility of identified genes and SNPs. Simulation study and data analysis suggest satisfactory performance of the CTGDR method.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2796812  PMID: 19850755
23.  The gputools package enables GPU computing in R 
Bioinformatics  2009;26(1):134-135.
Motivation: By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers.
Results: R users can take advantage of the better performance provided by an Nvidia GPU.
Availability: The package is available from CRAN, the R project's repository of packages, at More information about our gputools R package is available at
PMCID: PMC2796814  PMID: 19850754
24.  GATE: software for the analysis and visualization of high-dimensional time series expression data 
Bioinformatics  2009;26(1):143-144.
Summary: We present Grid Analysis of Time series Expression (GATE), an integrated computational software platform for the analysis and visualization of high-dimensional biomolecular time series. GATE uses a correlation-based clustering algorithm to arrange molecular time series on a two-dimensional hexagonal array and dynamically colors individual hexagons according to the expression level of the molecular component to which they are assigned, to create animated movies of systems-level molecular regulatory dynamics. In order to infer potential regulatory control mechanisms from patterns of correlation, GATE also allows interactive interroga-tion of movies against a wide variety of prior knowledge datasets. GATE movies can be paused and are interactive, allowing users to reconstruct networks and perform functional enrichment analyses. Movies created with GATE can be saved in Flash format and can be inserted directly into PDF manuscript files as interactive figures.
Availability: GATE is available for download and is free for academic use from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2796822  PMID: 19892805
25.  Model aggregation: a building-block approach to creating large macromolecular regulatory networks 
Bioinformatics  2009;25(24):3289-3295.
Motivation: Models of regulatory networks become more difficult to construct and understand as they grow in size and complexity. Modelers naturally build large models from smaller components that each represent subsets of reactions within the larger network. To assist modelers in this process, we present model aggregation, which defines models in terms of components that are designed for the purpose of being combined.
Results: We have implemented a model editor that incorporates model aggregation, and we suggest supporting extensions to the Systems Biology Markup Language (SBML) Level 3. We illustrate aggregation with a model of the eukaryotic cell cycle ‘engine’ created from smaller pieces.
Availability: Java implementations are available in the JigCell Aggregation Connector. See
PMCID: PMC2788926  PMID: 19880372

Results 1-25 (394)