Search tips
Search criteria

Results 1-8 (8)

Clipboard (0)
Year of Publication
Document Types
1.  Visualization and probability-based scoring of structural variants within repetitive sequences 
Bioinformatics  2014;30(11):1514-1521.
Motivation: Repetitive sequences account for approximately half of the human genome. Accurately ascertaining sequences in these regions with next generation sequencers is challenging, and requires a different set of analytical techniques than for reads originating from unique sequences. Complicating the matter are repetitive regions subject to programmed rearrangements, as is the case with the antigen-binding domains in the Immunoglobulin (Ig) and T-cell receptor (TCR) loci.
Results: We developed a probability-based score and visualization method to aid in distinguishing true structural variants from alignment artifacts. We demonstrate the usefulness of this method in its ability to separate real structural variants from false positives generated with existing upstream analysis tools. We validated our approach using both target-capture and whole-genome experiments. Capture sequencing reads were generated from primary lymphoid tumors, cancer cell lines and an EBV-transformed lymphoblast cell line over the Ig and TCR loci. Whole-genome sequencing reads were from a lymphoblastoid cell-line.
Availability: We implement our method as an R package available at Code to reproduce the figures and results are also available.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4029030  PMID: 24501098
2.  fRMA ST: frozen robust multiarray analysis for Affymetrix Exon and Gene ST arrays 
Bioinformatics  2012;28(23):3153-3154.
Summary: Frozen robust multiarray analysis (fRMA) is a single-array preprocessing algorithm that retains the advantages of multiarray algorithms and removes certain batch effects by downweighting probes that have high between-batch residual variance. Here, we extend the fRMA algorithm to two new microarray platforms—Affymetrix Human Exon and Gene 1.0 ST—by modifying the fRMA probe-level model and extending the frma package to work with oligo ExonFeatureSet and GeneFeatureSet objects.
Availability and implementation: All packages are implemented in R. Source code and binaries are freely available through the Bioconductor project. Convenient links to all software and data packages can be found at
PMCID: PMC3509489  PMID: 23044545
3.  ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data 
Bioinformatics  2013;29(9):1182-1189.
Motivation: Although chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip) is increasingly used to map genome-wide–binding sites of transcription factors (TFs), it still remains difficult to generate a quality ChIPx (i.e. ChIP-seq or ChIP-chip) dataset because of the tremendous amount of effort required to develop effective antibodies and efficient protocols. Moreover, most laboratories are unable to easily obtain ChIPx data for one or more TF(s) in more than a handful of biological contexts. Thus, standard ChIPx analyses primarily focus on analyzing data from one experiment, and the discoveries are restricted to a specific biological context.
Results: We propose to enrich this existing data analysis paradigm by developing a novel approach, ChIP-PED, which superimposes ChIPx data on large amounts of publicly available human and mouse gene expression data containing a diverse collection of cell types, tissues and disease conditions to discover new biological contexts with potential TF regulatory activities. We demonstrate ChIP-PED using a number of examples, including a novel discovery that MYC, a human TF, plays an important functional role in pediatric Ewing sarcoma cell lines. These examples show that ChIP-PED increases the value of ChIPx data by allowing one to expand the scope of possible discoveries made from a ChIPx experiment.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3658457  PMID: 23457041
4.  Performance assessment of copy number microarray platforms using a spike-in experiment 
Bioinformatics  2011;27(8):1052-1060.
Motivation: Changes in the copy number of chromosomal DNA segments [copy number variants (CNVs)] have been implicated in human variation, heritable diseases and cancers. Microarray-based platforms are the current established technology of choice for studies reporting these discoveries and constitute the benchmark against which emergent sequence-based approaches will be evaluated. Research that depends on CNV analysis is rapidly increasing, and systematic platform assessments that distinguish strengths and weaknesses are needed to guide informed choice.
Results: We evaluated the sensitivity and specificity of six platforms, provided by four leading vendors, using a spike-in experiment. NimbleGen and Agilent platforms outperformed Illumina and Affymetrix in accuracy and precision of copy number dosage estimates. However, Illumina and Affymetrix algorithms that leverage single nucleotide polymorphism (SNP) information make up for this disadvantage and perform well at variant detection. Overall, the NimbleGen 2.1M platform outperformed others, but only with the use of an alternative data analysis pipeline to the one offered by the manufacturer.
Availability: The data is available from
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3072561  PMID: 21478196
5.  A framework for oligonucleotide microarray preprocessing 
Bioinformatics  2010;26(19):2363-2367.
Motivation: The availability of flexible open source software for the analysis of gene expression raw level data has greatly facilitated the development of widely used preprocessing methods for these technologies. However, the expansion of microarray applications has exposed the limitation of existing tools.
Results: We developed the oligo package to provide a more general solution that supports a wide range of applications. The package is based on the BioConductor principles of transparency, reproducibility and efficiency of development. It extends the existing tools and leverages existing code for visualization, accessing data and widely used preprocessing routines. The oligo package implements a unified paradigm for preprocessing data and interfaces with other BioConductor tools for downstream analysis. Our infrastructure is general and can be used by other BioConductor packages.
Availability: The oligo package is freely available through BioConductor,
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2944196  PMID: 20688976
6.  Quantifying uncertainty in genotype calls 
Bioinformatics  2009;26(2):242-249.
Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.
Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.
Availability: Software implementing of the method described in this article is available as free and open source code in the crlmm R/BioConductor package.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2804295  PMID: 19906825
7.  R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips 
Bioinformatics  2009;25(19):2621-2623.
Summary: Illumina produces a number of microarray-based technologies for human genotyping. An Infinium BeadChip is a two-color platform that types between 105 and 106 single nucleotide polymorphisms (SNPs) per sample. Despite being widely used, there is a shortage of open source software to process the raw intensities from this platform into genotype calls. To this end, we have developed the R/Bioconductor package crlmm for analyzing BeadChip data. After careful preprocessing, our software applies the CRLMM algorithm to produce genotype calls, confidence scores and other quality metrics at both the SNP and sample levels. We provide access to the raw summary-level intensity data, allowing users to develop their own methods for genotype calling or copy number analysis if they wish.
Availability and Implementation: The crlmm Bioconductor package is available from Data packages and documentation are available from
PMCID: PMC2752620  PMID: 19661241
8.  High-resolution spatial normalization for microarrays containing embedded technical replicates 
Bioinformatics (Oxford, England)  2006;22(24):3054-3060.
Microarray data are susceptible to a wide-range of artifacts, many of which occur on physical scales comparable to the spatial dimensions of the array. These artifacts introduce biases that are spatially correlated. The ability of current methodologies to detect and correct such biases is limited.
We introduce a new approach for analyzing spatial artifacts, termed ‘conditional residual analysis for microarrays’ (CRAM). CRAM requires a microarray design that contains technical replicates of representative features and a limited number of negative controls, but is free of the assumptions that constrain existing analytical procedures. The key idea is to extract residuals from sets of matched replicates to generate residual images. The residual images reveal spatial artifacts with single-feature resolution. Surprisingly, spatial artifacts were found to coexist independently as additive and multiplicative errors. Efficient procedures for bias estimation were devised to correct the spatial artifacts on both intensity scales. In a survey of 484 published single-channel datasets, variance fell 4- to 12-fold in 5% of the datasets after bias correction. Thus, inclusion of technical replicates in a microarray design affords benefits far beyond what one might expect with a conventional ‘n = 5’ averaging, and should be considered when designing any microarray for which randomization is feasible.
PMCID: PMC2262854  PMID: 17060357

Results 1-8 (8)