Microarray technology allows researchers to efficiently profile gene expression and discover associations of disease and gene expression levels. There are various microarray platforms commercially available. Illumina BeadArray is a recent microarray technology that has attractive features not found in other widely used arrays like Affymetrix GeneChips. In a BeadArray, hundreds of thousands of copies of a specific 50-mer oligonucleotide, used as the probe for a gene, are attached to 3-micron silica beads that are then randomly assembled in equally spaced microwells on either fiber optic bundles or planar silica slides (
Kuhn et al., 2004). There are up to tens of thousands of different bead types and hundreds of thousands of beads in an array, resulting in high redundancies, i.e., ~ 30 replicates on average for each bead type. And because the beads are located randomly on a chip, a decoding scheme (
Gunderson et al., 2004) is used to identify the types of beads through sequential hybridization to the ~ 25-mer identifier sequence that are also attached to the beads. There are several advantages associated with the features of BeadArray technology: randomness helps to reduce the impact of localization artifacts; redundancy promotes the precision as well as the robustness in measuring the intensity through replicates of the same type of beads (
Kuhn et al., 2004); the decoding process also validates the hybridization performance of each bead to ensure that all beads are functional; multiple arrays can be arranged in a single chip so that several samples can be processed simultaneously, improving the throughput and reducing the variability; and the technology is cost efficient in that it allows rapid development of new products and quick delivery of custom-designed high-density chips since it is easy to produce new beads and to assemble them onto substrates. Because of these appealing attributes, the Illumina BeadArray platform has become increasingly popular in gene expression profiling.
When pre-processing microarray data, one step that is critical to the analysis of gene expression is the background noise adjustment. Noise can be introduced into the observed expression level, or intensity, during the processing of the samples. For example, when a mRNA sample is labeled and hybridized to the probes, part of the hybridization is nonspecific, i.e., binding of RNA sequences other than the intended target of the probe; and when the array is scanned, optical variations can also affect signal intensity. Here, following
Wu et al. (2004), we define the background noise as a part of the intensity not attributed to the target gene, which includes non-specific hybridization and errors in optical scanning and data extraction. In this article we focus on the background noise correction for the gene expression data of Illumina whole genome BeadArrays.
Because microarray products are highly commercialized, there are significant differences, among different platforms by different vendors, in the design of the arrays, the scanning devices and data extraction processes. For example, in the design for controlling non-specific hybridization, one distinction between the Affymetrix GeneChip and Illumina BeadArray is, in Affymetrix GeneChips each perfect match (PM) probe is paired with a mismatch (MM) probe by changing its middle base; however, the structure of the PM-MM pair does not exist in BeadArrays; instead, negative control beads, attached with arbitrary oligonucleotide sequences that have no targets in the genome, are designed with the intention of detecting non-specific hybridization. Consequently, background adjustment methods are highly platform dependent. For Affymetrix oligonucleotide arrays, extensive efforts have been devoted to the problem of background correction, yielding fruitful methodologies in the literature, for example, the MAS 5.0 algorithm of Affymetrix, the multiplicative model based expression index (MMBE) proposed by
Li and Wong (2001), the robust multi-array average (RMA) method by
Irizarry et al. (2003), the GC-RMA methods by
Wu et al. (2004), and the maximum likelihood estimation method based on the normal-exponential convolution model by
Silver et al. (2009). Comparisons of various methods can be found in
Ritchie et al. (2007). However, background correction modeling for Illumina BeadArrays has been modest, partly because the technology is new and very different from Affymetrix arrays so that many existing methods, especially those involving MM probes, can not be extended directly to BeadArrays.
In a recent paper
Dunning et al. (2008) discussed important statistical issues in preprocessing Illumina data. Illumina Inc. supplies a background correction algorithm that simply subtracts the average of the negative control beads from the intensity values of the genes. However,
Barnes et al. (2005) found that “background subtraction had a negative impact on Illumina data quality”, and so they chose not to perform background correction. Also, as reported by
Ding et al. (2008), subtraction as proposed by Illumina results in substantial negative values that may not be used directly in further analyses, which is a significant loss of information from the experiment. Furthermore, a large number of probes can have negative values in one sample but positive values in another, which calls into doubt the efficiency of this algorithm. Note that
Lin et al. (2008) proposed variance-stabilizing transformation (VST) method that can recover the negative values. The popular RMA algorithm, initially developed for Affymetrix microarrays by
Irizarry et al. (2003), can be applied to BeadArray data because it uses only PM probes. Although it works well empirically on Affymetrix microarrays (
Bolstad et al., 2003), it uses
ad hoc parameter estimation (
McGee and Chen, 2006) and it is not an efficient background correction method for BeadArray data since it does not make use of the negative control data on the array (
Xie et al., 2009). Recently,
Xie et al. (2009) proposed an exponential-normal convolution model, which we will refer to as NMLE (normal distribution using maximum likelihood estimator) hereafter. The NMLE model incorporates negative control data into the background correction model and has been shown to have better performance than other existing methods. The NMLE model assumes a Gaussian distribution for the noise term; however, sometimes the noise can be non-symmetrically distributed (an example will be shown in Section 2.1). In this paper, we propose a Gamma distribution for the noise term and a new background adjustment approach is developed. The Gamma distribution is widely used in situations when values are non-negative, as in this context because the noise is believed to be positive. More importantly, it is quite flexible in accommodating right-skewed as well as roughly symmetric distributions.
This paper is organized as follows: in Section 2 we present the model and discuss methods of parameter estimation; in Section 3 we develop background adjustment methods based on the model; simulation studies and results are reported in Section 4; and in Section 5 we show an example of applying the background correction method to three real data examples using Illumina Human WG-6 V2 and Illumina Mouse-6 V1 BeadChips.