|Home | About | Journals | Submit | Contact Us | Français|
Despite the tremendous growth of microarray usage in scientific studies, there is a lack of standards for background correction methodologies, especially in single-color microarray platforms. Traditional background subtraction methods often generate negative signals and thus cause large amounts of data loss. Hence, some researchers prefer to avoid background corrections, which typically result in the underestimation of differential expression. Here, by utilizing nonspecific negative control features integrated into Illumina whole genome expression arrays, we have developed a method of model-based background correction for BeadArrays (MBCB). We compared the MBCB with a method adapted from the Affymetrix robust multi-array analysis algorithm and with no background subtraction, using a mouse acute myeloid leukemia (AML) dataset. We demonstrated that differential expression ratios obtained by using the MBCB had the best correlation with quantitative RT–PCR. MBCB also achieved better sensitivity in detecting differentially expressed genes with biological significance. For example, we demonstrated that the differential regulation of Tnfr2, Ikk and NF-kappaB, the death receptor pathway, in the AML samples, could only be detected by using data after MBCB implementation. We conclude that MBCB is a robust background correction method that will lead to more precise determination of gene expression and better biological interpretation of Illumina BeadArray data.
With the advent of microarray technology, gene expression analysis has become a valuable tool in biological research from development to cancer. Researchers now can choose from a number of commercially available microarray platforms. While early results were limited by lack of reproducibility, the recently completed Microarray Quality Control (MAQC) project demonstrated that through the use of standardized protocols both intraplatform consistency and interplatform concordance could be achieved (1–6). However, one area of data processing where standardization and optimization lags is background signal correction, that is, the removal of nonspecific signals from total signal intensity in the process of microarray data analysis. Comprehensive comparisons have been conducted to evaluate the performance of different background correction methodologies in two-color arrays (7); however, because of the high extent of commercialization, background correction methods for single-color arrays are mostly platform dependent. This is a consequence of different array designs, image scanning and data extraction processes developed by different vendors. Even in this setting, there is a lack of standardization for background correction—even within the same platform. For example, the robust multi-array analysis (RMA), a popular algorithm to preprocess Affymetrix GeneChip data (8), uses only perfect match (PM) probes and ignores mismatch (MM) probes when correcting for background signal. Although the empirical experience shows that RMA background correction works well in practice (8,9), it uses the ad hoc parameter estimation procedure and McGee et al. (8) have identified problems associated with its parameter estimation.
Another single color microarray platform tested in the MAQC project is from Illumina Inc., San Diego, CA, USA. This platform utilizes the company's BeadArray technology; 3-μm beads coated with hundreds of thousands of copies of 50-mer oligonucleotides sequences. The beads are randomly assembled into microwells on each array. A postscanning ‘decoding’ process is required to determine the location of each probe (10,11). One feature of Illumina arrays is that its noise is controlled by beads conjugated with nonspecific oligonucleotides. Illumina Inc. provides a method to perform background subtraction using the average value of control beads. Unfortunately, substantial negative data values can be generated by this method. For example, with the samples used in this study, where samples from one group are compared against another, use of the Illumina default background correction resulted in more than half of the probe values in one group being negative. More importantly, over 6000 probes had negative values in one group but positive values in the other group. Exclusion of probes with negative values can result in loss of large amounts of information on the chip, especially genes that ‘switch’ on and off between different sample groups. In this context, there has been report suggesting that method provided by Illumina has a negative impact on Illumina data quality and the use of the raw data before normalization was recommended (12). However, significant data compression was observed when expression ratios were calculated between two experimental groups when no background subtraction was performed. These data compression resulted in far fewer genes identified as differentially expressed than expected. For these reasons, our intent in this study was to address the necessity of performing background correction and, in the mean time, develop a robust background correction model to facilitate further statistical and data mining analysis.
In addition to using no background correction or the default manufacturer's protocol, there are two background subtraction methods for BeadArrays proposed in the Bioconductor ‘lumi’ package. The first method is to raise all data uniformly to a ‘floor’ value, which assures that all signals are positive. The other approach applies the Affymetrix RMA background correction model to BeadArray data. The former method is arbitrary and induces data compression that results in diminished expression ratios. The latter method ignores the nonspecific oligo-beads contained on Illumina arrays. By examining the histogram of signal intensities on BeadArrays, we noted that gene and control signal intensity values exhibited different distributions. The intensity values from genes did not follow the same distribution as that of the control beads in that they exhibited much heavier tails than the control intensity values (Figure 1). It was our contention that the intensity values associated with the nonspecific oligo-labeled beads is valuable information and by using the information from the nonspecific oligo-labeled beads, we could develop a model to estimate a true background and perform first-round data processing for Illumina arrays. This background correction method, referred to as model-based background correction for BeadArrays (MBCB), incorporates the negative control bead information into a statistical algorithm for background correction of Illumina arrays. Using the same set of array data, we compared the results after implementing the MBCB method versus using the data without background subtraction (RAW) as recommended by Barnes et al. (12). We also adapted the Affymetrix RMA algorithm and compared it with MBCB as well.
Although we could not apply it to Illumina data, another approach we considered was the Li and Wong model (13,14) found in the popular Affymetrix expression analysis package dCHIP. Their model-based analysis algorithm has a similar philosophical approach to MBCB in that it is a model-based approach to oligonucleotide expression array analysis. Unfortunately, it is not directly applicable to Illumina arrays because of inherent platform differences. For instance, unlike the Affymetrix MM probe that has single nucleotide differences to perfect match probes, Illumina control beads are conjugated to nonspecific sequences that are not associated with gene-specific probes. Furthermore, the main feature of Li–Wong's model is to take account of the probe-specific effects into account for the computation of expression index. They use a parameter (Φj) to represent the sensitivity of PM probe of probe pair j, and the parameter is estimated by using information from multiple arrays, a minimum of 10 being optimal. This is appropriate for Affymetrix array because each probe in the probe set is different, and some probes are more sensitive for hybridization. However, for Illumina BeadArray data, the beads in each bead type are identical and there is no sensitivity difference among beads within same bead type. Therefore, the parameter (Φj) in Li–Wong's model is not appropriate for Illumina BeadArrays and dCHIP software could not be adapted to analyze output data from the Illumina platform.
For our approach, samples from spleens of CBA mice, positive or not for acute myelogenous leukemia (AML), were used to generate comparative data sets. The MBCB, RMA and RAW methods were used to process probe signal for subsequent statistical analysis. Relative gene expression values were validated using quantitative RT–PCR (QPCR). In some cases, western analysis of protein expression was also examined. Data preprocessed using the MBCB method had the highest correlation with QPCR results and provided a greater sensitivity for detecting differentially expressed genes. The result was a better discrimination of gene expression between the AML and non-AML groups, and subsequent to that a better interpretation of differences in signal transduction pathway activation in AML-positive spleen samples.
In this study, total RNA from samples of spleen cells from four normal CBA mice and samples of spleen cells from four mice confirms by a pathologist as positive for AML were used. The tissues were homogenized in Trizol Reagent (Invitrogen, CA, USA) using an Omni Tip Disposable Generator Probe (Omni International Inc., Marietta, GA, USA). RNA isolation was performed according to the Qiagen RNeasy Mini Kit column (Qiagen, CA, USA) protocol. Chloroform was added to the homogenate for phase separation, the aqueous phase was removed and mixed with 1 Vol. 70% ethanol and the mixture was loaded into the Qiagen column, which was then centrifuged at 11 000 r.p.m. for 1 min. The flow-through liquid was discarded, buffer RW1 was added, the column was washed and the RNA eluted. The RNA was quantified and the quality was checked by using an Experion automated electrophoresis system and Experion RNA StdSens Chips (Bio-Rad Inc., CA, USA).
Illumina Mouse-6 V1 BeadChip (Illumina, Inc.) mouse whole-genome expression arrays were used in this study. Of the eight samples, four were hybridized twice and were used as technical replicates. Each RNA sample was amplified using the Ambion Illumina RNA amplification kit with biotin UTP (Enzo) labeling. The Ambion Illumina RNA amplification kit uses T7 oligo(dT) primer to generate single stranded cDNA followed by a second strand synthesis to generate double-stranded cDNA, which is then column purified. In vitro transcription was done to synthesize biotin-labeled cRNA using T7 RNA polymerase. The cRNA was then column purified. The cRNA was then checked for size and yield using the Bio-Rad Experion system. A total of 1.5 μg of cRNA was hybridized for each array using standard Illumina protocols with streptavidin-Cy3 (Amersham, Piscataway, NJ, USA) being used for detection. Slides were scanned on an Illumina Beadstation and analyzed using BeadStudio (Illumina, Inc).
The expression data described by Figure 1 motivated a background plus signal model to explain the observed intensity of gene i, where Si = Xi + Yi, and Si represents the observed intensity for gene i, Xi is the signal intensity for gene i, and Yi is the noise intensity for gene i. The goal was to adjust the observed intensity Si by removing the effects of background Yi. The observed intensity of the negative control bead comes only from the noise intensity. In order to estimate Xi, we assume the signal Xi comes from an exponential distribution with mean α and the noise intensity for both genes and negative controls comes from a normal distribution with mean µ and variance σ2. We applied the model to the observed intensity of genes and negative controls, and used Markov chain Monte Carlo simulations to estimate the parameters. The estimated signal intensity Xi is used as the background corrected expression for gene i. The negative control values were extracted from the bead-level intensity data of all nonspecific control beads and summarized using median values of each control bead type. The summarized control bead-type values for each array were used for the modeling. The R script and data files used for MBCB model was posted as Supplementary materials at the journal website. The model details could be found in the Supplementary material.
We adopted the convolution model in RMA for Affymetrix Arrays to Illumina BeadArrays. The parameter estimation also follows RMA's procedure: first, a nonparametric density function was fitted to the observed intensities, the mode of this density was used as the estimate of the mean of the noise; second, the lower tail of the mode was used to estimate the variability of noise and finally, the right tail of the mode was used to estimate the rate of exponential distribution of the signal. The model was applied to the observed intensity of each gene. The expected true signal value condition on the observed total signal is used as the background corrected gene expression level.
After performing MBCB and RMA background subtraction, we applied quantile normalization across samples using the ‘Affy’ package in Bioconductor. For comparison of groups without background subtraction (RAW), we carried out quantile normalization using the raw intensity data on the arrays. The scatter plots and Venn diagram were generated using GeneSpring GX 7.3.1. Significance analysis was done using significant analysis of microarray (SAM) and BRB-ArrayTool. Pathway analysis was performed using ingenuity pathway analysis (IPA) (www.ingenuity.com).
QPCR was performed on one splenic sample of AML-positive mouse and one normal mouse splenic sample. Fourteen genes were randomly selected to test for validation of microarray results. RNA solutions were treated with DNase I before reverse transcription. Complementary DNA was synthesized from the treated RNA solution in a reaction containing SuperScript III reverse transcriptase (Invitrogen) and random hexamer primers. The gene-specific primers were designed by using Primer3 software. PCR reactions were performed using a SYBR PCR master kit (AB Biosystems, Inc., Foster City, CA, USA), and a Chromo4 Fluorescence Detector (Bio-Rad, Inc.). The PCR protocol was designed with an initial denaturing step of 95°C 10 min, followed by 40 cycles of 95°C 15 s and 60°C 1 min. Mouse 18S RNA was used as an internal control between samples. The PCR reactions were performed in triplicates for each gene being validated. Primers used in this experiment can be found in the Supplementary materials.
Samples of frozen mouse spleen tissues were lyzed with lysis buffer (150 mM NaCl, 1% NP-40, 0.5% sodium deoxycholate, 0.1% SDS and 50 mM Tris pH = 8.0), homogenized and then sonicated. After centrifugation for 10 min at 4°C, the protein concentration was measured. For western blot analysis, 25 μg of total protein was resolved on a 7.5% polyacrylamide gel and transferred onto immunobilon-FL membrane (Millipore, Billerica, MA, USA). Western blotting was performed as follows. Briefly, after 1-h incubation with blocking buffer at room temperature, membranes were incubated with the primary antibody overnight at 4°C. The membranes were then washed five times with washing buffer and incubated with the secondary antibody for 1 h at room temperature. After washing with washing buffer three times, the membrane was incubated with ECL plus reagents and exposed to X-film. The antibodies for Tnfr2 and NF-kappaB were purchased from Santa Cruz Biotechnology (Santa Cruz, CA, USA), and the actin antibody was purchased from Sigma, St. Louis, MO, USA.
We compared the normalization results of median polish and quantile normalization, in order to determine which method to use in combination with our background correction models. The difference between the two normalization methods was best manifested in two technical replicate samples. These replicates are hybridizations that used the same amplified cRNA samples. The comparison indicated that the MBCB method, in combination with quantile normalization, demonstrated a better correction of variations between chips than the commonly used median polish normalization (Figure 2). In subsequent analyses, quantile normalization after implementation of MBCB was always performed. We also performed quantile normalization when using data without background subtraction (RAW) and with the RMA method.
Two different statistical models, widely used to identify differentially expressed genes, were used to interrogate normal spleen samples versus the AML samples. In the first analysis the random variance F-test, developed by the NCBI and implemented in BRB-ArrayTools, was used. The default significance (P < 0.001) provided by the software was used. In the second analysis, the SAM model, a widely used algorithm developed by groups in Stanford University, was used. The cutoff value for SAM is median false discovery rate (FDR) <0.01. In both of the analyses, more genes were identified as differentially regulated when the MBCB versus either the RMA or RAW methodologies was used (Table 1). When compared to the RAW method, there were 24 or 42% more genes identified when the MBCB background subtraction model was used in combination with either SAM or the BRB ArrayTools, respectively. The RMA method also detected 20 or 42% more genes than the RAW methodology. Plot of FDRs and numbers of significant genes generated from SAM output indicated that the MBCB method facilitated the lowest FDR for a given number of significant genes when comparing all three background correction methods (Figure 3).
QPCR for 14 randomly picked genes was performed in order to validate the Illumina array results. Our comparisons indicate that the MBCB method has the best overall correlation with quantitative PCR results than RMA and RAW method (Figure 4). In the plot of Figure 4, each gene was plotted according to the log-expression ratio generated by RT–PCR and one of the three background correction methods. The MBCB has the smallest mean square error (MBCB: 0.09; RMA: 0.19; and RAW: 0.17). Assuming a hypothetical perfect fit for array data versus PCR data for the same genes, the slope of the fitted line would be 1.0, with a y-intercept equal to 0. The fit of the MBCB/PCR data are the closest to meeting that hypothetical condition, see Figure 4. Both the RAW and RMA fits have slopes that significantly deviate from 1.0 and both have y-intercepts that deviate from 0. The values for each slope can be found in Figure 4.
SAM analysis in combination with the MBCB model identified 873 more genes as differentially regulated between the two groups than the combination of RAW and SAM (FDR < 0.01), and 44 of these 873 genes have an association with AML according to the designations identified by the IPA software package. If RMA was used in combination with SAM analysis 32 of these 44 genes were identified (Figure 5). In comparison, there were only small portions of genes that could be detected by either RAW (four genes) or RMA (five genes) but not by MBCB (Figure 5). Although there were major commonalities of the three methods as shown in Figure 5, the 44 genes that could not be detected by the RAW method represented 28% of all AML-related genes that changed and highlighted the fact that data compression can cause substantial loss of biologically significant information (Table 2). Included in the 44 genes not identified by the RAW/SAM combination were: Aml1, which is frequently found translocated and fused with the Eto gene to form the Aml1-Eto fusion protein in AML and which was upregulated by >5-fold in the AML samples; and Nqo-1, which is reported as suppressed in leukemia, was downregulated in the AML spleen samples. These data demonstrated again that the MBCB model is much more sensitive in detecting biologically relevant gene expression changes. Table 3 lists 12 genes that show significant expression differences identified by MBCB, but not by either the RAW or RMA methods. This list includes several important apoptosis-related genes such as Tnfrsf6/Fas and Caspase 3.
IPA was used to study gene signaling pathway that was involved in biological processes of AML. Differentially expressed genes and their fold changes that were obtained by using the MBCB, RMA or RAW method, were uploaded into IPA for analysis. The results showed that in most gene groups, the MBCB method detected more genes than the RAW method (Figure 6A). One ontology category that showed the greatest difference in gene identified was the death receptor signaling pathway. With the MBCB method, Tnfr2 and NF-kappaB mediated signaling pathway was clearly activated, while the RMA and RAW methods failed to allow the detection of this pathway (Figure 6B–D). Although the RMA method showed differential expression of several genes in this pathway, it failed to detect one of the key molecules in this pathway, that being NF-kappaB up-regulation (Figure 6C). The NF-kappaB, Ikk and Tnfr2 expression changes were confirmed by western blot (Figure 7).
Traditional methodologies involving subtraction of local background or nonspecific control spots often generate negative intensities. These signals are often filtered out and subsequently result in missing values (15,16). There have been studies suggesting that performing a background correction should be avoided (17,18), and this holds good for data generated by Illumina arrays as suggested by a previous study (12). On the other hand, not performing background subtraction can exacerbate the problem of data compression, which can then result in an underestimation of signal. There are numerous reports where <50% of genes on an array have detectable signal (19–21). And while this could very well be true, it is likely that data compression had led to an increase in false negative genes which are then ignored in subsequent analyses such as predicting group membership or signal pathway and regulatory network discoveries.
Illumina BeadArrays have more than 50 000 beads linked with nonspecific oligonucleotides, which constitute a large population of true negative controls for hybridization. These beads are the same carriers of the gene-specific oligo probes. It is likely that a background correction method using these controls will outperform methods that subtract background from the localized array attachment substrate, be it glass, silica or other (7). There are two models available that consider the nonspecific probes as background controls aside from the local substrate. One is the model described by Li and Wong (13) and the other is RMA, both of which are widely used for Affymetrix GeneChip analysis. The Li and Wong model-based expression analysis approach in dCHIP is not directly applicable to Illumina arrays because there are no sensitivity differences amongst beads within the same bead type. Therefore, we modified the RMA method and adapted it for use on the Illumina platform. However, the RMA model is limited because it completely ignores information from nonspecific control beads. The data from this experiment and simulation results (22) suggested that RMA overestimated background values. This resulted in excess background subtraction and increased variances in arrays.
To overcome these limitations, we have described a model-based background correction method for Illumina BeadArrays. Our model was motivated by the different distributions of the observed intensity level for both genes and negative controls. We verified that this difference of distributions is typical for Illumina expression arrays as we could observe it in all hybridizations (data not shown). Our model makes use of the negative control beads of the arrays and the parameter estimation was based on the Markov chain Monte Carlo simulations (23). This method provides the positive estimation of background corrected expression level for each gene. We demonstrated that our MBCB model had a better correlation with QPCR results.
Our results clearly showed that both RMA and MBCB methods out-performed a no background correction approach. Although adopting RMA to Illumina background correction is still relatively new, MBCB incorporates the negative control beads, which is not available for the RMA method. MBCB improves the parameter estimation compared to the RMA method and we therefore, recommend the MBCB method for Illumina background correction.
We applied two popular significance analysis methods, SAM and BRB ArrayTools, to further validate the performance of our model. Both of the methods resulted in detecting the most differentially expressed genes when using the MBCB model when comparing with RMA and RAW methods, demonstrating improved sensitivity for gene discovery through the use of the MBCB model.
Through a literature search and with help of IPA software, we showed that the MBCB background correction model detected 44 more AML-associated genes considered to be differentially regulated than were found when the RAW methodology was used. As shown in Table 2, Aml1 is a gene that has been shown frequently translocated and forms fusion protein with Eto (24,25). In our MBCB model, Aml1 was overexpressed >5-fold in the AML samples when compared to nonleukemic samples, whereas in RAW data set, there was no significant change of Aml1. Aberrations in cell cycle regulation and p53-dependent apoptosis have been frequently found in hematological malignancies and we found in the MBCB generated data set that several cell cycle and apoptosis-related genes, such as Cdkn1a, Cdkn2a, Brca1 and Nfil3, were significantly changed in leukemic mice. These genes were not detected in the RAW data set. Brca1 has been reported to be hypermethylated in therapy-related AML (26) and our results suggest that BRCA1 expression was suppressed in radiation-induced mouse AML model. On the other hand, lack of methylation was found for p21 in AML (27), and our data showed an increased expression for this gene. Myb is associated with differentiation and proliferation in leukemia cell lines (28) and our data indicate a 3-fold overexpression in the leukemic samples. Also on the list, Nqo1 a member of the NAD(P)H dehydrogenase (quinone) family, which encodes a cytoplasmic 2-electron reductase, was underexpressed in the leukemic samples in line with the notion that no or low Nqo1 activity is associated with an increased risk of de novo acute leukemia in humans (29). The data above clearly indicate that if one chose not to perform a background subtraction there is a risk for significant loss of biological information. By using the RMA methodology, 32 of the 44 genes identified by the MBCB method were detected. Within the remaining 12 genes were several important apoptosis-associated molecules, such as Tnfrsf6/Fas, Nfkbia, Casp3 and Ripk1. In fairness, there were six genes that showed significant change in either the RAW (four genes) or RMA methods (five genes) that were not detected when the MBCB method was used. In summary, our data supported the conclusion that MBCB discovered more biologically relevant findings than the other two methods.
Our data demonstrated that the new MBCB background correction model enhanced sensitivity and in the mean time, remained highly specific during the gene discovery process. This is evident by the fact that more differentially expressed genes were detected and had better correlations with QPCR data. We also demonstrated an enhanced ability to perform functional pathway analysis due to the higher sensitivity. By using the MBCB model, we were able to discover that Tnfr2 and NF-kappaB mediated death receptor pathway was activated in the radiation-induced AML samples. Activation of the NF-kappaB pathway has been reported in human leukemia to support cancerous proliferation, resistance to apoptosis and sustain angiogenesis (30). Targeting NF-kappaB pathway activation via pharmacological inhibition induced apoptosis in human AML cells (31–33). It has been shown in human leukemia cells that the TNF signal machinery is necessary to switch cells between a proliferative versus an apoptotic phenotype (34,35). Sustained Tnf/Tnfr2-induced NF-kappaB signaling and transcription allow the cells to survive and proliferate. On the other hand, switching the NF-kappaB pathway off results in Tnf/Tnfr1-driven stimulation of proapoptotic pathways, such as sustained Jnk and p38 MAPK activity. In our mouse AML expression profile generated without background subtraction (RAW), there was no suggestion of any NF-kappaB pathway activity. The RMA method indicated partial activation but failed to detect NF-kappaB upregulation. Using the MBCB model, it was clearly shown that the Tnfr2 and NF-kappaB death receptor pathway was activated given the differential expression of Tnfr2, Ikk and NF-kappaB. Since the activation of this pathway has not been systemically confirmed in mouse AML models, western blotting was performed to validate the protein expression of these genes. The results were consistent with the gene-expression profile.
In summary, we have demonstrated that appropriate background correction (MBCB) will lead to better detection of differentially expressed genes, which then results in an improved sensitivity for performing gene function and pathway analysis. The MBCB is a robust model that avoids the problem of generating negative values after background subtraction, and substantially reduces data compression that occurs when using RAW data without background correction. Our data indicate that the large population of negative control features found on Illumina arrays are useful for estimating noise value and should be considered in array designs.
Supplementary Data are available at NAR Online.
We would like to thank Shane Scoggin of the Simmons Cancer Center Genomics Core, UT Southwestern Medical Center, for processing all samples for microarray analysis. This study was funded by grants from NASA NSCORS NAG9-1569, NNJ05HD36G, NCI CA06294 and NIH UL1RR024982. Funding to pay the Open Access publication charges for this article was provided by NASA.
Conflict of interest statement. None declared.