|Home | About | Journals | Submit | Contact Us | Français|
Available statistical preprocessing or quality control analysis tools for gene expression microarray datasets are known to greatly affect downstream data analysis, especially when degraded samples, unique tissue samples, or novel expression assays are used. It is therefore important to assess the validity and impact of the assumptions built in to preprocessing schemes for a dataset. We developed and assessed a data preprocessing strategy for use with the Illumina DASL-based gene expression assay with partially degraded postmortem prefrontal cortex samples. The samples were obtained from individuals with autism as part of an investigation of the pathogenic factors contributing to autism. Using statistical analysis methods and metrics such as those associated with multivariate distance matrix regression and mean inter-array correlation, we developed a DASL-based assay gene expression preprocessing pipeline to accommodate and detect problems with microarray-based gene expression values obtained with degraded brain samples. Key steps in the pipeline included outlier exclusion, data transformation and normalization, and batch effect and covariate corrections. Our goal was to produce a clean dataset for subsequent downstream differential expression analysis. We ultimately settled on available transformation and normalization algorithms in the R/Bioconductor package lumi based on an assessment of their use in various combinations. A log2-transformed, quantile-normalized, and batch and seizure-corrected procedure was likely the most appropriate for our data. We empirically tested different components of our proposed preprocessing strategy and believe that our results suggest that a preprocessing strategy that effectively identifies outliers, normalizes the data, and corrects for batch effects can be applied to all studies, even those pursued with degraded samples.
Great strides have been made in the development of gene expression profiling technologies that can accommodate partially degraded mRNA samples (Fan et al., 2004; April et al., 2009). These technologies are especially useful in assaying gene expression levels from unique tissue sources, such as the brain, where the conditions for the preservation of mRNA quality are not typically ideal (Mirnics and Pevsner, 2004). Gene expression assays that can accommodate the often degraded or partially degraded mRNA obtained from the brain could help identify molecular mechanisms underlying neuropsychiatric disorders, especially those that cannot be studied with animal models (Horváth et al., 2010). However, as relevant and sophisticated as gene expression assays that can accommodate partially degraded mRNA may, the application of these assays also requires appropriate methods for handling and preprocessing the information resulting from the assay in order to make sure the samples have been assayed properly with minimal residual effects of the degraded RNA.
While many gene expression assay preprocessing transformation and normalization procedures exist, such as those implemented in the available and widely used software package Bioconductor (Gentleman et al., 2004), most procedures differ in the way they remove systemic variance and prepare datasets for downstream processing (Lim et al., 2007; Schmid et al., 2010). For example, batch effects and issues of antemortem conditions documented by medical records that are often associated with the analysis of brain samples are not routinely accommodated by available methods (Johnson et al., 2007), but can be dealt with in a variety of ways. It is therefore important to compare and evaluate the utility of the various methods (Gold et al., 2005). Such comparisons can be achieved by considering resulting tests of associations between the processed expression data and other variables of interest, such as batch or level of sample degradation, using analysis of variance (ANOVA)-based techniques such as multivariate distance matrix regression (MDMR; Zapala and Schork, 2006).
We assessed the potential effects of different preprocessing strategies on single-channel postmortem brain gene expression data obtained with the Illumina DASL-based assay. The study that motivated our development of a preprocessing strategy involved exploring gene expression differences between autistic and normal individuals as part of an ongoing study of autism pathology. To achieve this, we considered the use of MDMR in combination with a number of standard gene expression level transformation and normalization measures to quantify the effect of defined preprocessing steps on a data set resulting from a DASL-based assay and partially degraded brain samples. The transformation and normalization measures we considered were those implemented in the R/Bioconductor package lumi (Du et al., 2008). We also considered the utility of Bayesian approaches to correct for batch effects (Johnson et al., 2007). Our results suggest that a preprocessing strategy that effectively identifies outliers, normalizes the data, and corrects for batch effects can be fashioned for gene expression assays designed to accommodate degraded samples.
The strategy that we developed for objectively assessing outliers, normalization, and batch effects can be described in a series of steps. Before providing the results of each individual step, we offer a brief overview of the main elements of these steps (Figure (Figure1).1). Essentially, raw intensity data without normalization or background subtraction was output from GenomeStudio software for the 57 total samples that we collected (see Materials and Methods), and quality control and outlier removal analyses were performed. Following these steps, transformation and normalization were performed by R/Bioconductor package lumi (Du et al., 2008). Then, to remove batch effects, we used the ComBat algorithm (Johnson et al., 2007). We leveraged MDMR analysis to probabilistically assess the effect of each step on the removal of systematic variation from the samples.
We examined the effect of different transformation and normalization method combinations in the Bioconductor package lumi (Du et al., 2008) on our dataset. These combinations are depicted in Figure Figure1B1B (e.g., log2–Loess, cubic root–rank invariant, etc.). Mean inter-array correlation (IAC) and lumi visualization plots were used as preliminary outcome measures to compare them. These correlations are a measure of the efficacy of normalization steps in removing systemic error from the dataset.
In addition to standard transformation and normalization procedures, it was necessary to consider batch and covariate correction procedures. First, since the frozen tissue samples were processed in two separate batches, samples within the same batch tended to group together, creating a possible confounding effect for downstream analyses. Furthermore, since epileptiform abnormalities are present in as many as 5–44% of children with autism (Tuchman and Rapin, 2002), it was important to account for the variance attributable to seizures noted by medical records in cases assayed (Table S1 in Supplementary Material) since we wanted to focus on differences due to autism pathology, not seizure-related activity.
Batch correction and adjustment for seizures as a covariate was performed using ComBat, which applied an empirical Bayes method (Johnson et al., 2007) to the dataset. Although batch correction techniques other than ComBat (Johnson et al., 2007) are available, Combat has been shown to outperform some other algorithms, particularly for small sample sets (Chen et al., 2011). MDMR and mean IAC were again used to gage the effectiveness of this stage of processing (Table (Table33).
Fifty-seven frozen blocks of fresh frozen brain tissue from the prefrontal cortex of control and autistic male and female cases were obtained from the Harvard Brain and Tissue Resource Center (United States Public Health Service) and from the University of Miami/University of Maryland Brain and Tissue Bank (National Institute of Child Health and Human Development; Table S1 in Supplementary Material).
Diagnostic criteria of autistic disorder was verified for all autistic cases by review of psychological and medical records, including the Autism Diagnostic Interview-Revised (ADI-R; Lord et al., 1994), and the Autism Diagnostic Observation Schedule (ADOS; Lord et al., 2000) by a psychologist with extensive diagnostic experience with autism (CCB; Table S1 in Supplementary Material). Seizure incidence of autistic cases was also assessed through case records.
Due to documented variability of gene expression in neighboring brain areas (Rehen et al., 2005; Lein et al., 2007), it is of extreme importance that the blocks of tissue chosen for gene expression profiling are from comparable regions between cases. Anatomical landmarks were identified as consistently as possible for dissection across cases with the goal of obtaining a set of highly controlled, comparable tissue for brain gene expression profiling. When available, tissue from the superior frontal gyrus of the dorsal lateral prefrontal cortex (DLPFC) was dissected in each case. When this area was not available, we sampled from the middle frontal gyrus.
Extraction of total RNA from 5–10mg of frozen tissue from both gray and white matter, with as many layers of cortex as possible, was performed using MELT® kit from Ambion according to manufacturer’s instructions1. Select RNA samples were analyzed with BioAnalyzer® (Agilent) according to the manufacturer’s protocol for quality control and quantification, and available RNA integrity numbers (RIN) from three RNA quality assessments are reported in Table S1 in Supplementary Material. Because RNA quality was not expected to be a good predictor of array quality (Abramovitz et al., 2008), all samples regardless of RIN were assayed. Whole RNA from remaining samples was quantified using a NanoDrop® spectrophotometer.
Total RNA from frozen samples underwent cDNA synthesis, and cDNA-mediated annealing, selection, and ligation (DASL)-based labeling, hybridization to Illumina HumanRef8 v3 and scanning on two separate occasions as described previously (April et al., 2009). Both biological and technical replicates were included for quality control. Using biotinylated random primers and oligo-dT, 200ng RNA was converted to cDNA. The biotinylated cDNA was then immobilized to a streptavidin-coated solid support, and annealed with a pool of gene-specific oligonucleotides. Following extension and ligation, the ligated oligonucleotides were PCR amplified with a biotinylated and a fluorophore-labeled universal primer, and captured using streptavidin paramagnetic beads. Finally, the single-stranded PCR products were eluted and hybridized to the BeadChips at 58°C for 16h. A BeadArray Reader was used to scan array images and extract fluorescence intensities, and all data were uploaded into GenomeStudio software without normalization or background subtraction for quality control and processing. All raw data is available on the NCBI Gene Expression Omnibus under accession number GSE284752. Array chip and position of each sample are detailed in Table S2 in Supplementary Material.
Exclusion criteria are outlined in detail in the results section. The lumi package in Bioconductor (Du et al., 2008), MDMR, and mean IAC were used as unbiased statistical metrics and visualization techniques for quality control of the outlier exclusion process. The final dataset consists of high quality arrays of 33 male ASD and control cases. All outliers were removed before transformation, normalization, and batch correction procedures.
For data preprocessing and normalization, we aimed to identify a workflow that would: (1) maximize mean IAC across the dataset (Oldham et al., 2008); (2) remove known confounds from the dataset; and (3) prepare the data for downstream processing (e.g., differential expression and enrichment analysis).
Average clustering with Euclidean distances, scatterplots, distribution histograms, correlation measures, and boxplots were used to visualize the data before and after processing steps. For details on the implementation of the transformation and normalization techniques we used, see (Du et al., 2007, 2008; Lin et al., 2008). Mean IACs (Oldham et al., 2008) were used as a basis to identify reasonable processing candidates at each step for further investigation.
Transformation methods involving the log2, variance stabilizing transformation (VST; Lin et al., 2008), and cubic root were implemented using the lumi package (Du et al., 2008) before data normalization.
Robust spline normalization (RSN), simple scaling normalization (SSN), quantile normalization, variance stabilizing normalization (VSN), Loess, and Rank Invariant normalization methods were tested in conjunction with the above transformation procedures (Du et al., 2008; Lin et al., 2008).
The software suite Combat (Johnson et al., 2007) was used to remove the variance attributed to batch effect, since sets of our frozen samples were processed at different times on the DASL platform. In addition, we also attempted to remove the confounding effects of seizures in our dataset, since many individuals with autism have comorbid seizure incidence. MDMR was used to assess the efficacy of the correction methods as described below.
To assess the variance within the dataset attributable to a set of additional or ancillary variables before and after manipulating and preprocessing the expression assay results (e.g., batch correction), MDMR (Zapala and Schork, 2006) with 1000 permutations was applied to the Euclidean distance matrices constructed from the expression values between each sample3. Variables of interest that were tested for association with the expression profiles (reflected in distance matrices) included batch, diagnosis, age, and seizure incidence of cases from which we sampled. We leveraged both single independent variable and multiple independent MDMR results.
Multivariate distance matrix regression (Zapala and Schork, 2006) is a statistical method that considers variation in the degree of pairwise similarity among individuals based on multivariate profiles or data collected on those individuals. MDMR tests the hypothesis that a measure variable (e.g., diagnosis) “explains” variation in the similarities/dissimilarities exhibited by the individuals. It requires two main inputs: (1) a distance matrix quantifying distances (i.e., differences or lack of similarity) between gene expression profiles of the samples in the study; and (2) additional variables, such as diagnosis, age, sex, etc., which are considered independent variables. The method uses these additional variables to determine how much variation in the similarities among the individuals can be explained by these variables. It essentially works in a manner analogous to regression analysis, but accounts for dependencies in the data, and can produce estimates of variance explained as is both single independent and multiple independent variable regression models. The single independent variable regression results consider the predictive value each of the independent variables separately, while the multiple regression model considers all the independent variables entered cumulatively. In the current manuscript, we use MDMR as a method to assess the variance attributable ‘to, e.g., batch, age, gender, etc.’
The final transformed, normalized, and batch and covariate-corrected dataset is available on the NCBI Gene Expression Omnibus accession # GSE28475 (see text footnote 2). qPCR validation of selected genes was performed (Figure S27 in Supplementary Material).
An initial step in microarray-based gene expression assay data processing involves identifying and removing outlying individuals whose assay results exhibit marked, and likely artifactual, deviation from the other assay results. Since normalization procedures depend on biological variability within the dataset examined and assume that most genes are not differentially expressed, normalizing experimental samples with outliers that create artificial variation in a dataset could potentially confound analyses after the preprocessing is complete. In addition, most normalization procedures assume that most of the genes in a dataset are not differentially expressed across experimental conditions, so artificial variance induced by outliers can have an amplified effect on downstream analyses.
As noted, our data set was based on brain tissue samples from 57 autistic and control cases, and were labeled by a sample number (numbers 1–74), biological replicate letter (a, b, or c), and technical replicate number (rep 1 or rep 2). We refer to this labeling in order to describe the results of the outlier analyses. These brain samples were subjected to DASL-based expression array analysis. Assay results were then analyzed using criteria and analysis steps outlined above and detailed below for outlier detection and exclusion in the presence of potential batch effects:
Fifteen combinations of normalization and transformation methods were tested (Figure (Figure1B).1B). Plots of the genome-wide data before transformation and normalization procedures are shown in Figures S13–S15 in Supplementary Material, and plots after transformation and normalization are shown in Figures S16–S24 in Supplementary Material.
Of the 15 combinations tested (Table (Table2),2), four were chosen for further analysis. Log2 transformation of the resulting assay data was chosen for further investigation based on convention and high mean IAC, and VST transformation was chosen based on high mean IAC (Table (Table2).2). RSN and quantile normalization for each of these transformation methods yielded the highest mean IACs and so were further investigated. Slight differences in data distribution were evident between these four techniques (Figure (Figure2;2; Figures S18 and 21 in Supplementary Material). These differences are important due to assumptions of normality that must be met for downstream preprocessing and analyses (Giles, 2003; Zapala and Schork, 2006; Johnson et al., 2007).
Qualitatively, log2-transformed expression values showed a more normal distribution, while VST-transformed values showed a distribution skewed to lower intensity values (Figure (Figure22 and Figures S18 in Supplementary Material). Correlations between arrays were slightly higher in the log2/quantile combination, but more genes were detected as differentially up/down regulated between technical replicates in this combination.
We again used MDMR to check the predictors of variance in these preprocessed datasets. Comparing multiple regression MDMR results after quality control (Table (Table1)1) and after transformation and normalization but before batch correction (Table (Table3),3), we see that cumulative percentage of variance explained (PVE) using batch, age, diagnosis, and seizures as predictors could predict 75.6% of variance before preprocessing, but only ~35% after preprocessing. Batch effects were still the primary predictor of variance in the dataset. Furthermore, the importance of seizures as a predictor of variance increased following all four preprocessing strategies, which necessitated implementing other statistical strategies to ensure that seizure incidence was not a confounding variable for differential expression analysis by diagnosis. Nonetheless, the variance predicted by the main variable of interest, diagnosis, did not appear to be affected by any of the four preprocessing techniques. In fact, following transformation and normalization, each of the four strategies appeared to yield similar PVEs for each of the four variables queried.
Substantial batch effects could be observed through hierarchical clustering by average linkage (Figure S7 in Supplementary Material). Batch correction and adjustment for seizures as a covariate was performed using ComBat on the four pipeline protocols highlighted in Table Table22 (Johnson et al., 2007).
Batch correction by ComBat decreased the percentage of variation that could be attributable to batch from ~25% after transformation and normalization to less than 1% in each of the four pipeline pathways. Scatterplots and IAC calculations indeed showed higher correlation between samples in two separate batches after correction (Figure (Figure2;2; Figures S25 and 26 in Supplementary Material; Table Table3).3). After the correction step, seizures became the most important predictor of variance in the dataset, but the PVE remained similar before and after correction. These results suggested that the use of Combat and appropriate preprocessing can effectively eliminate the potential for artifactual associations between gene expression levels and important covariates.
Based on these results with ComBat and the combination of factors listed above, including the intensity distribution, MDMR analysis, and IACs, we chose the log2–quantile method for our DASL-based brain gene expression dataset. Following these steps, the dataset was ready for probe-based filtering and differential expression analysis in BRB Array Tools, followed by enrichment analysis in the MetaCore software suite (Chow et al., in review).
We have described a DASL assay-based gene expression preprocessing analysis and quality control strategy meant to accommodate problems associated with the use of degraded brain samples. The motivation for developing this strategy was to investigate aberrant molecular pathways in the brains of individuals with autism (Chow et al., in review). Preprocessing strategies have important downstream consequences, and should therefore be vetted appropriately. We exploited statistical methods such as MDMR (Zapala and Schork, 2006), the algorithms and procedures described in the lumi analysis package (Du et al., 2008), and mean IAC analysis (Oldham et al., 2008) to quantify and visualize the effects of each of our proposed preprocessing steps. Ultimately, it is crucially important to remove systemic error in microarray-based gene expression studies, so as not to unduly influence inferences made in such studies. The main goal of our preprocessing technique was to produce a clean dataset suitable for downstream differential expression analyses using statistical measures to quantify and visualize the effects of each preprocessing step. An alternate statistical modeling strategy may also be applied to adjust for batch effects and other confounding variables.
Our analysis suggests that not all bioinformatics and biostatistical pre- and post-processing techniques will generate reliable results from brain gene expression datasets, when degraded samples are considered and the DASL assay is used. This is consistent with previous reports that consider general microarray-based gene expression studies (Lim et al., 2007; Schmid et al., 2010). For example, the transformation methods resulting in the highest inter-array gene expression profile correlations and that are historically used for expression microarray preprocessing all yielded varying results as assessed by the use of MDMR and mean IAC analyses. Our decision to ultimately transform and normalize our dataset by the log2–quantile method, followed by batch and covariate correction analyses, was based on careful consideration of whether a removal of potentially artifactual variation across the expression profiles of the samples could be achieved and quantified. The results of the analyses comparing autistic and control brains involving the samples processed in this report are described elsewhere (Chow et al., in review).
In order to assess the effects of autism diagnosis on brain gene expression, while controlling for important covariates such as gender and age, it was necessary to remove the systematic variance introduced by experimental data handling factors. Furthermore, if autism-specific mechanisms are to be uncovered, the effect of medications, lifestyle, comorbid conditions, and other confounding variables on brain gene expression must be accommodated and controlled for (Horváth et al., 2010). As noted, we attempted to remove seizure-specific variance through the use of the algorithm implement in ComBat (Johnson et al., 2007) since it was a primary confounding variable to diagnostic differences between cases. Despite the sophistication of these statistical techniques and their potential to control for effects such as seizure when considering the effects of autism on brain gene expression, care should be taken to select postmortem cases and samples for gene expression studies without such confounding conditions. It is possible that overuse of batch effect correction and normalization techniques can modify variance in the dataset and thus confound biological results. Thus, although we have developed a procedure for objectively removing outliers, normalizing data, and removing batch effects from DASL chip-based gene expression on degraded brain samples, it is no substitute for good study design and appropriate collection, storage and maintenance of samples.
Dr. Jian-Bing Fan and Dr. Craig April declare stock and employment interest in Illumina, Inc.
The Supplementary Material for this article can be found online at http://www.frontiersin.org/Statistical_Genetics_and_Methodology/10.3389/fgene.2012.00011/abstract
This research was supported by funds from Cure Autism Now, the Simons Foundation, The Peter Emch Family Foundation, and the UCSD-NIH Autism Center of Excellence (P50-MH081755) awarded to EC. NJS and MEW are supported in part by the following research grants: U19 AG023122-05; R01 MH078151-03; N01 MH22005; U01 DA024417-01; R01 AG030474-02; N01 MH022005; R01 HL089655-02; R01 MH080134-03; U54 CA143906-01; UL1 RR025774-03 as well as the Price Foundation and Scripps Genomic Medicine. We send our appreciation to all parents who made the difficult choice to support brain research through the donation of brain tissue from their loved ones. Tissue for this study was provided by the National Institute of Child Health and Human Development Brain and Tissue Bank for Developmental Disorders (Baltimore, MD) under contracts N01-HD-4-3368 and N01-HD-4-3383, the Brain and Tissue Bank for Developmental Disorders (Miami, FL), Autism Tissue Program (Princeton, NJ) and Harvard Brain Tissue Resource Center (Belmont, MA). We thank Dr. Ronald Zielke at the National Institute of Child Health and Human Development Brain and Tissue Bank for Developmental Disorders and Dr. Jane Pickett at the Autism Tissue Program for facilitation of tissue acquisition and Dr. Joeseph Buckwalter, Dr. Cynthia Schumann, Robert Johnson, and Robert Vigorito for help in tissue dissection and collection. We also thank Dr. Brandy Klotzle, Dr. Gary Hardiman, and James Sprague for microarray processing.