Data sets from five independent microarray studies comparing PBMC samples from SLE patients with those from healthy individuals were obtained from prominent SLE researchers. These data sets are referred to as data sets 1, 2, 3, 4, and 5. Data sets 1, 2, 4, and 5 are associated with peer-reviewed publications (11-14). Data set 3 is composed of unpublished data. Three of the studies (studies 1, 4, and 5) included only pediatric patients, while the remaining two included only adults. All studies employed the Affymetrix GeneChipmicroarray platform (Affymetrix, Inc., Santa Clara, CA, USA) but the versions of the array type varied (Table ). In the case of two different array types used for the same study (that is, data sets 1 and 2), we treated them as separate data sets (data sets 1a, 1b, 2a, and 2b) during the meta-analysis. Raw data in the form of Affymetrix CEL files were provided for studies 1, 2, 3, and 5. For data set 4, however, expression values for a short-listed set of genes were provided. While data sets 1 to 4 were used in the meta-analysis workflow, data set 5 served as an independent data set to validate the gene signature derived from the meta-analysis.
Information on data sets used for meta-analysisa
Workflow of the pathway-based meta-analysis approach
The overall workflow of the pathway-based meta-analysis is summarized in Figure . The meta-analysis used a leave one data set out validation process. Both principal component analysis (PCA) and hierarchical cluster analysis (HCA) were used to visually inspect the leave one data set out cross-validation results. Last, the combined meta-signature obtained from the 4 data sets was validated against an independent fifth data set (data set 5).
For the individual quality control and data analysis steps mentioned below, each data set was considered separately. Additionally, since data sets 1 and 2 used two chip types each, they were considered as four different data sets (1a, 1b, 2a, and 2b) for the initial analysis.
Quality assessment was done for each data set using the Genedata Expressionist (Genedata, San Francisco, CA, USA) [22
] (Figure , step 1). Only one sample in data set 2a was discarded from further analysis, because it had too high a value for defective area percentage.
Individual data processing and analysis
Following quality control assessment, each data set was analyzed individually using the ArrayTrack™ tool (US Food and Drug Administration's National Center for Toxicological Research, Jefferson, AR, USA) [23
]. ArrayTrack is a comprehensive tool for microarray data storage, analysis, and interpretation that has been developed at the FDA's National Center for Toxicological Research. To maintain consistency during the individual analysis of data sets, similar normalization methods, statistical tests, and parameters were used with all data sets. First, all data sets except data set 4 were normalized using Robust Multi-array Analysis. Then Welch's t
test was performed on each data set individually. The P
value and fold change filters (0.01 and 1.5, respectively) were used to identify a unique list of DEGs from each data set (Figure , step 2). This list represented genes that were either notably upregulated or downregulated in the PBMCs of SLE patients when compared to the PBMCs of healthy controls. Each DEG list was then used to identify biological pathways significantly represented in SLE samples compared to the healthy controls (P
< 0.01) in each data set (Figure , step 3). This pathway analysis was done using Ingenuity Pathway Analysis (IPA) software (Ingenuity Systems Inc., Redwood City, CA, USA).
Pathways common to all of the data sets were identified from the individual lists of pathways enriched in SLE patients compared to healthy controls (one for each data set) (Figure , step 4). The resulting list of pathways was indicative of processes significantly affected in all of the SLE data sets and comprised a pathway signature representative of all data sets and of the disease. From this common pathway signature, gene markers that met all of the following criteria were selected: (1) exhibited a fold change greater than 2 in at least one of the data sets (stringency increased from 1.5-fold to 2-fold to obtain a robust signature), (2) present in the DEG list in at least one of the data sets, and (3) involved in at least one of the commonly enriched pathways (Figure , step 5). These DEGs composed the collective signature (Figure , step 6).
Validation with the leave one data set out permutation method
To validate this technique, a leave one data set out permutation approach was employed (Figure , step 7). The meta-analysis technique described above was reiterated four times, each time leaving out one of the four data sets (data sets 1 to 4) and performing the analysis using the remaining three data sets. This gave rise to four different scenarios (Table ). The gene signature obtained using the three data sets (for example, data sets 1 to 3) was then applied to the data set left out (for example, data set 4). Unsupervised visualization techniques such as PCA and HCA were performed to examine how well the signature could differentiate SLE patients from healthy controls (Figures and ).
Scenarios for leave one data set out validation
Figure 2 Principal component analysis from all scenarios. There is a clear distinction between healthy samples and systemic lupus erythematosus (SLE) patients, shown in blue and red, respectively. (A) Scenario I. (B) Scenario II. (C) Scenario III. (D) Scenario (more ...)
Hierarchical clustering analysis for all scenarios. Blue branches indicate healthy samples, and red branches indicate SLE patients. (A) Scenario I. (B) Scenario II. (C) Scenario III. (D) Scenario IV.
Gene markers were generated for each of the four scenarios as described in the Meta-analysis section. Gene markers present in at least three of the four scenarios were grouped to comprise a 37-gene metasignature.
Confirmation using a fifth independent data set
Confirmation of the final 37-gene metasignature was done using an independent fifth data set, data set 5 (Figure , step 8). Again, PCA and HCA were carried out to evaluate the ability of the metasignature to differentiate SLE patients from healthy controls in this independent data set (Figure ).
Figure 4 Validating the 37-gene signature using independent data set 5. (A) Hierarchical clustering analysis shows blue branches indicating healthy samples and red branches indicating SLE patients. (B) Principal component analysis with healthy samples shown in (more ...)
Results and discussion
Individual data sets of SLE and healthy control data sets derived from Affymetrix microarrays were analyzed using ArrayTrack following quality control (Figure , step 1) and normalization procedures. DEGs for individual data sets were identified using a P value cutoff of 0.01 and a fold change cutoff of 1.5 (Figure , step 2).
Biological pathways identified in SLE patients through the leave one data set out permutation method
After applying the leave one data set out approach for each of the four scenarios (Table ), commonly enriched biological pathways were identified using IPA software (Table ). Three biological pathways were consistently enriched in SLE patients in all four scenarios: interferon (IFN) signaling, interleukin (IL)-10 signaling, and glucocorticoid receptor signaling. An additional pathway, LXR/RXR signaling, was identified only in scenario IV.
Biological pathways commonly and significantly enriched in the four scenariosa
Previous studies have provided evidence of increased autoimmunity in patients undergoing IFN treatment [24
]. More specifically, there is evidence of women developing SLE during IFN-α treatment [25
]. Several studies have shown upregulation of the IFN signaling pathway in SLE patients [9
]. Therefore, it is understandable that IFN signaling appears to be affected across all data sets.
IL-10 signaling appears to be dysregulated and may be indicative of the inflammatory processes involved in SLE. IL-10 binds to IL-10 receptor 1 on immune cells and activates the JAK-STAT signaling pathway, which is the key IFN signaling mechanism [31
]. In support of the hypothesis that IL-10 is involved in SLE, IL-10 has been identified as one of risk loci for SLE in a large genome-wide association study [32
Glucocorticoid receptors are also believed to influence cytokine signaling and may be indirectly involved in the pathways underpinning SLE [33
]. In fact, glucocorticoids are routinely used in the treatment of SLE patients.
Genes differentially expressed in SLE
Each of the four scenarios produced a gene signature: scenario I produced a signature comprising 51 genes, scenario II produced a signature with 31 genes, scenario III produced a signature with 34 genes, and scenario IV produced a signature with 28 genes. These DEGs represent the three main SLE disease pathways (IFN signaling, IL-10 signaling pathway, and glucocorticoid signaling pathways) as discussed in the section above.
A separate analysis of the pediatric and adult data sets used to identify DEGs and pathways in the two populations was conducted. Similar gene expression patterns were observed in adult and pediatric populations, although the extent of upregulation of some of the genes was higher in the pediatric data sets (unpublished results).
Validation of the meta-analysis approach
For each scenario, the signature obtained using the three data sets was applied to the fourth data set to observe how effectively the expression of the signature genes could distinguish between the SLE and healthy populations.
The PCAs and HCAs obtained for each scenario are presented in Figures and , respectively. The PCA and HCA produced similar and consistent results. Grouping of samples based on the expression of signature genes alone produced a clear distinction between SLE patients and healthy controls. The results suggest that the DEG signatures derived by using the leave one data set out permutation approach in the four scenarios (Table ) can potentially identify a robust gene expression signature for SLE.
Gene expression signatures for SLE and Systemic Lupus Erythematosus Disease Activity Index scores
The Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) is a validated scoring system that can be used to describe the range of disease activity and comprises a weighted score calculated by the presence or absence of 24 symptoms. The association of SLEDAI scores to expression profiles of SLE patients was evaluated. While the majority of the samples were grouped into their respective classes (SLE or control; see Figure ), 12 SLE patients exhibited expression profiles similar to the control samples. On closer examination of these samples, it was found that the scores for nine of the patients indicated that they were either in remission (SLEDAI score 0) or had mild activity of the disease (SLEDAI score 2 or 3). These findings lend further credence to the ability of the pathway-based meta-analysis approach used here in distinguishing SLE patients from healthy controls. Correlation between SLEDAI scores and gene expression signatures has also been reported in the literature [9
Metasignature for SLE
A 37-gene signature was generated by the meta-analysis workflow (Table ). Many IFN-induced genes involved in the IFN signaling pathway (Figure ), such as IFIT1, IFIT3, IFITM1, IFIT35, MX1
, and OAS1
, were present in the signature. Overexpression of IFN-regulated genes in PBMCs of SLE patients has been reported in several publications [9
]. In addition to genes involved in the IFN signaling pathway, genes in cytokine signal transduction (SOCS1
) were also among the DEGs in SLE patients. Differential expression of many other biomarkers associated with inflammatory and/or immune responses and with cellular proliferation was also observed, as shown in Table .
Signature genes and their functionalities
Interferon signaling pathways. Interferon-α, interferon-β, and interferon-γ signaling pathways are shown. The genes in blue represent differentially expressed genes that are part of the SLE metasignature.
Confirmation of the metasignature using an independent data set
This signature was applied to an independent fifth data set (data set 5) to evaluate its ability to distinguish the SLE samples from the control samples. Figure shows that the signature demonstrated clear differentiation between SLE patients and healthy controls. In the HCA analysis, nine of ten healthy samples clustered together and were clearly separated from the cluster of SLE samples (Figure ). The PCA analysis also showed that the majority of the SLE samples and healthy samples were grouped separately (Figure ). The one SLE sample that was clustered with the healthy samples had a SLEDAI score of 2, confirming our earlier observations with different data sets (Figure ).