Great strides have been made in the development of gene expression profiling technologies that can accommodate partially degraded mRNA samples (Fan et al., 2004
; April et al., 2009
). These technologies are especially useful in assaying gene expression levels from unique tissue sources, such as the brain, where the conditions for the preservation of mRNA quality are not typically ideal (Mirnics and Pevsner, 2004
). Gene expression assays that can accommodate the often degraded or partially degraded mRNA obtained from the brain could help identify molecular mechanisms underlying neuropsychiatric disorders, especially those that cannot be studied with animal models (Horváth et al., 2010
). However, as relevant and sophisticated as gene expression assays that can accommodate partially degraded mRNA may, the application of these assays also requires appropriate methods for handling and preprocessing the information resulting from the assay in order to make sure the samples have been assayed properly with minimal residual effects of the degraded RNA.
While many gene expression assay preprocessing transformation and normalization procedures exist, such as those implemented in the available and widely used software package Bioconductor (Gentleman et al., 2004
), most procedures differ in the way they remove systemic variance and prepare datasets for downstream processing (Lim et al., 2007
; Schmid et al., 2010
). For example, batch effects and issues of antemortem conditions documented by medical records that are often associated with the analysis of brain samples are not routinely accommodated by available methods (Johnson et al., 2007
), but can be dealt with in a variety of ways. It is therefore important to compare and evaluate the utility of the various methods (Gold et al., 2005
). Such comparisons can be achieved by considering resulting tests of associations between the processed expression data and other variables of interest, such as batch or level of sample degradation, using analysis of variance (ANOVA)-based techniques such as multivariate distance matrix regression (MDMR; Zapala and Schork, 2006
We assessed the potential effects of different preprocessing strategies on single-channel postmortem brain gene expression data obtained with the Illumina DASL-based assay. The study that motivated our development of a preprocessing strategy involved exploring gene expression differences between autistic and normal individuals as part of an ongoing study of autism pathology. To achieve this, we considered the use of MDMR in combination with a number of standard gene expression level transformation and normalization measures to quantify the effect of defined preprocessing steps on a data set resulting from a DASL-based assay and partially degraded brain samples. The transformation and normalization measures we considered were those implemented in the R/Bioconductor package lumi
(Du et al., 2008
). We also considered the utility of Bayesian approaches to correct for batch effects (Johnson et al., 2007
). Our results suggest that a preprocessing strategy that effectively identifies outliers, normalizes the data, and corrects for batch effects can be fashioned for gene expression assays designed to accommodate degraded samples.
Overview of preprocessing strategy
The strategy that we developed for objectively assessing outliers, normalization, and batch effects can be described in a series of steps. Before providing the results of each individual step, we offer a brief overview of the main elements of these steps (Figure ). Essentially, raw intensity data without normalization or background subtraction was output from GenomeStudio software for the 57 total samples that we collected (see Materials and Methods
), and quality control and outlier removal analyses were performed. Following these steps, transformation and normalization were performed by R/Bioconductor package lumi
(Du et al., 2008
). Then, to remove batch effects, we used the ComBat algorithm (Johnson et al., 2007
). We leveraged MDMR analysis to probabilistically assess the effect of each step on the removal of systematic variation from the samples.
Figure 1 Data preprocessing steps and quality assessment scheme. (A) Flowchart depicting preprocessing steps taken for microarray data generated by DASL-based profiling of 33 frozen tissue samples passing quality control from male autistic and control cases in (more ...) Normalization and transformation
We examined the effect of different transformation and normalization method combinations in the Bioconductor package lumi
(Du et al., 2008
) on our dataset. These combinations are depicted in Figure B (e.g., log2–Loess, cubic root–rank invariant, etc.). Mean inter-array correlation (IAC) and lumi
visualization plots were used as preliminary outcome measures to compare them. These correlations are a measure of the efficacy of normalization steps in removing systemic error from the dataset.
In addition to standard transformation and normalization procedures, it was necessary to consider batch and covariate correction procedures. First, since the frozen tissue samples were processed in two separate batches, samples within the same batch tended to group together, creating a possible confounding effect for downstream analyses. Furthermore, since epileptiform abnormalities are present in as many as 5–44% of children with autism (Tuchman and Rapin, 2002
), it was important to account for the variance attributable to seizures noted by medical records in cases assayed (Table S1 in Supplementary Material) since we wanted to focus on differences due to autism pathology, not seizure-related activity.
Batch correction and adjustment for seizures as a covariate was performed using ComBat, which applied an empirical Bayes method (Johnson et al., 2007
) to the dataset. Although batch correction techniques other than ComBat (Johnson et al., 2007
) are available, Combat has been shown to outperform some other algorithms, particularly for small sample sets (Chen et al., 2011
). MDMR and mean IAC were again used to gage the effectiveness of this stage of processing (Table ).
Multivariate distance matrix regression and mean IAC of preprocessing techniques before and after batch and seizure correction.