Construction of peripheral blood mononuclear cell (PBMC) transcriptional modules
Comparing transcriptional profiles of two or more study groups generates long lists of differentially expressed genes. Because of the large number of comparisons performed (usually >10,000), these results are permissive to noise, which in turn can affect biomarker discovery and data interpretation (Ioannidis, 2005
; Michiels et al., 2005
In order to circumvent these hurdles we focused the analysis on small sets of coordinately expressed transcripts. Indeed, the probability for multiple transcripts to follow a complex pattern of expression across dozens or hundreds of conditions only by chance is low, and such sets of genes should therefore constitute coherent and biologically meaningful transcriptional units. Thus, we designed an algorithm for constructing sets of coordinately expressed transcripts (i.e. modules) from PBMC microarray profiles generated from a wide range of diseases. This stable modular framework was then used as a basis for the analysis of separate PBMC dataset.
The algorithm used for the construction of transcriptional modules is described in detail in the methods section (Supplementary Figure 1
). Briefly, the first step of the module construction process analyzes expression patterns of transcripts across samples for individual diseases: sets of coordinately expressed transcripts were identified using an unsupervised clustering algorithm; in this case, the GeneSpring Version 7.1 (Agilent) implementation of the K-Means algorithm (k=30). All transcripts detected in at least one sample were used as input; no screening for differential expression was performed. The second step of the module construction process analyzed the “clustering behavior” of transcripts across diseases, taking into account the possibility that genes may co-cluster in some diseases but not others. Also, in our example the transcripts that clustered together across all 8 diseases were grouped to form a set of modules (round 1 of selection) and the stringency of the analysis was then decreased gradually to identify transcripts that belong to similar K-means cluster in only a subset of diseases (round 2: 7 out of 8 diseases & round 3: 6 out of 8 diseases). This analysis of gene cluster membership across diseases relates to “graph theory” which is used in the mathematics and computer science fields to model pairwise relations between objects (Biggs, 1986
). It is important to note that the module selection process is “data-driven” and does not involve manual selection of genes by the investigator.
We implemented the module construction strategy described above using as input a total of 239 peripheral blood mononuclear cell (PBMC) samples obtained from individuals with one of the following conditions: systemic juvenile idiopathic arthritis (n=47), systemic lupus erythematosus (n=40), type I diabetes (n=20), metastatic melanoma (n=39), acute infections (Escherichia coli
(n=22), Staphylococcus aureus
(n=18), Influenza A (n=16)), or liver transplant recipients undergoing immunosuppressive therapy (n=37). Transcriptional profiles were generated using Affymetrix U133A and U133B GeneChips (>44,000 probesets). A total of 4742 transcripts distributed among 28 sets were selected upon running the module construction algorithm described above (Supplementary Figure 1
; a complete list is provided in Supplementary Table 1
). Each module is assigned a unique identifier indicating the round and order of selection (i.e. M3.1 is the first module identified in the third round of selection).
The stringency of this algorithm was tested statistically by implementing the same module construction procedure after randomization of the original dataset. This process was repeated two hundred times without a single module being identified (See supplementary experimental procedures
for details). Therefore, the analysis of gene cluster membership across multiple diseases provided a stringent means to identify PBMC transcriptional modules.
The next step consists in characterizing each module functionally. Keyword occurrence in PubMed abstracts associated with the genes within each module were analyzed by literature profiling (Chaussabel and Sher, 2002
). Differences in patterns of keyword occurrence across modules were observed as illustrated in Supplementary Figure 2
, and functional associations were identified for each of the 28 PBMC transcriptional modules (). Out of 28 PBMC modules, 14 could be clearly linked with pathways or cell types involved in immune processes (as detailed below). Functional associations were also observed in the remaining 14 modules. M2.5, for example, includes genes encoding immune-related molecules - CD40
- as well as cytoskeleton-related molecules - Myosin
, Dedicator of Cytokenesis
, Syndecan 2
, Plexin C1
). (See for details). Thus transcriptional modules form coherent transcriptional and functional units.
Using modules to map transcriptional changes in health and disease
After identifying sets of coordinately expressed PBMC transcripts based on the analysis of patterns found in a wide range of diseases, we used this modular framework as a stable basis for analyzing individual PBMC datasets.
Modules were conceived as a stable framework for the analysis of data generated independently from the sets initially used for module construction. We analyzed PBMC microarray transcriptional profiles generated from 14 patients with acute Streptococcus pneumoniae infection and 10 age- and sex-matched healthy control subjects. This dataset was not used in the module selection process. Statistical comparisons between patient and healthy control groups were performed independently on a module-by-module basis (Mann-Whitney rank test, p<0.05). The transcriptional profiles of differentially expressed genes were then represented on a graph for individual modules (). The pie-chart indicates the proportion of differentially expressed transcripts for a given module (e.g. 49% of the 322 transcripts forming module M3.2 were overexpressed in patients with acute S. pneumoniae infection compared to healthy controls). As shown in , differentially expressed genes in each module were either predominantly overexpressed or predominantly underexpressed. This observation is notable because modules were not selected based on differences in expression between study groups but instead on clustering patterns.
Analysis of patient blood leukocyte transcriptional profiles
To graphically represent global transcriptional changes, spots were aligned on a grid, with each position corresponding to a different module (). Spot intensity positively correlates with the proportion of differentially expressed transcripts, whereas spot color indicates the polarity of the change (red: overexpressed, blue: underexpressed). The resulting map represents molecular perturbations associated with a disease state. A blank grid would indicate that no significant differences exist between the disease and healthy baseline. Conversely, the presence of blue and red spots as in indicates that expression levels of sets of transcripts are increased or decreased in patients with acute S. pneumoniae infection in comparison to healthy controls. In addition, to facilitate data interpretation, modules' coordinates were associated to the functional categories which have been previously assigned (, ). For instance, Modules M1.3 and M2.1, respectively associated with B-cells and cytotoxic cells, are under-expressed in the S. aureus group when compared to healthy, while modules M2.2 and M3.2, respectively associated with neutrophils and inflammation are over-expressed.
Thus, we showed that sets of transcriptional modules can be used as a reference for the analysis and interpretation of data generated independently from those used for module construction. Furthermore we have developed means to visualize transcriptional changes on a module-by-module basis, which in conjunction with functional annotations yield an interpretable representation of microarray results.
We next generated module maps for three additional groups of patients (22 Systemic Lupus Erythematosus (SLE), 16 metastatic melanoma, and 16 liver transplant recipients) compared to their respective control groups composed of 10 to 12 healthy donors who were matched for age and sex (, Supplementary Table 2
). Results for M1.1 and M1.2 alone distinguished all four diseases (S. pneumoniae
: M1.1 = no change, M1.2 = over-expressed; SLE: M1.1= over-expressed, M1.2 = no change; melanoma M1.1 = under-expressed, M1.2 = over-expressed; transplant: M1.1 = under-expressed, M1.2 = under-expressed). A number of genes in M3.2 (“inflammation”) were overexpressed in patients with melanoma, S. pneumoniae
infection as well as transplant recipients, while genes in M3.1 (“interferon-inducible”) were overexpressed in patients with SLE and, to a lesser extent, in transplant recipients. M2.1 and M2.8 include, respectively, cytotoxic cell-related genes and T-cell transcripts, which are underexpressed in lymphopenic SLE patients and transplant recipients treated with immunosuppressive drugs. Overall, whereas these comparisons showed that modules can be shared between diseases (e.g. under-expression for M1.3 transcripts in both S. pneumoniae
and Melanoma groups) global modular changes remained disease-specific.
Module maps provide a means to organize and reduce the dimension of complex data, and thereby to facilitate its interpretation. However useful, this oversimplified representation lacks at the same time the depth that systems-scale analyses are able to provide. Representing changes at the module-level with a red or blue spot for instance does not indicate which of the genes are significantly changed. Indeed, a spot of the same color in two different diseases may be attributed to two different subsets of genes belonging to the same module. The disconnect between gene-level and module-level data is especially apparent when the results are presented in a static format; i.e. on paper. Thus, we have developed an interactive web-interface allowing users to switch seamlessly between the module-level and gene-level (Supplementary Figure 3
). Interactive module maps can be accessed at: www.biir.net/modules
. In addition to the four datasets analyzed in the context of this manuscript we loaded on this tool third party datasets made publicly available by others (Burczynski et al., 2006
). Mapping transcriptional changes in patients with Crohn's disease and ulcerative colitis highlighted similarity and differences between these diseases, with for instance a characteristic over-expression of transcripts linked to plasma cells (M1.1) in ulcerative colitis. This repository will be updated as more blood transcriptional data become available from our and other groups.
Using modules as a basis for the discovery of blood transcriptional biomarkers
Microarray gene expression data generated from blood not only provide valuable insights into mechanisms of disease pathogenesis but also constitute a promising source of biomarkers. The difficulty, however, lies in the extraction of indicators of potential clinical value from the vast amounts of data generated. We used modular transcriptional data as the foundation of our biomarker discovery strategy. This approach was implemented using a dataset generated from patients with SLE.
Microarray analyses have been carried out on peripheral blood mononuclear cells obtained from pediatric and adult SLE patients (Baechler et al., 2003
; Bennett et al., 2003
; Crow et al., 2003
; Kirou et al., 2004
). Using an earlier generation of Affymetrix arrays (~12,600 probe sets), we identified a type I interferon (IFN) signature in all active pediatric patients (Bennett et al., 2003
). This analysis also revealed the presence of neutrophil, immunoglobulin (Ig) and lymphopenic signatures that correlated with the presence of low density granulocytes, plasma cell precursors and a reduction in lymphocyte numbers in SLE blood, respectively (Bennett et al., 2003
These findings were confirmed in the present study, with significant changes observed in modules M3.1, M2.2, M1.1 and M2.8 (interferon-inducible, neutrophils, plasma cells and T lymphocytes, respectively) for a new dataset generated from a cohort of 22 pediatric lupus patients sampled at the time of diagnosis and before initiation of treatment. Transcriptional changes were observed in 7 additional modules (M1.7, M2.1, M2.3, M2.4, M2.5, M2.6, and M2.7). Two of these modules, M1.7 and M2.4, included transcripts encoding ribosomal protein family members whose expression was recently found altered in acute infection and sepsis (Calvano et al., 2005
; Thach et al., 2005
). Furthermore, our unpublished observations have shown that in vitro exposure of purified human monocytes to interferon alpha results in a late downregulation of the transcripts forming these modules. In addition, marked changes in gene expression were also observed for modules M2.1 and M2.3 which include transcripts expressed in cytotoxic cells and erythrocytes, respectively. Interestingly, the pattern of change in M2.1, M2.2, M2.3 and M2.4 for the SLE group was well conserved across diseases. Indeed, increased expression for M2.2 & M2.3, and decreased expression for M2.1 and M2.4 was also observed in transplant recipients, as well as in patients with acute S. pneumoniae
infections. This partial convergence is likely to reflect the existence of core transcriptional responses to disease or injury (e.g. inflammation).
The proposed biomarker selection strategy relies on modules for reducing highly dimensional microarray datasets in a step-wise manner (). Starting from the full set of 28 modules only those for which a set minimum proportion of transcripts are significantly changed between the study groups are selected (; e.g. minimum proportion of differentially expressed transcripts at p<0.05 = 15% over-expressed or under-expressed transcripts; in the example given 11 SLE modules meet this criterion). This eliminates from the selection pool the modules registering fewer consistent changes that may be attributed to noise. The cutoffs used for gene selection can be adjusted to adapt the number of candidate markers that will be returned by this analysis.
Module-based biomarker selection strategy
We next generated composite values for each sample. The arithmetic average of normalized expression values across significantly over-expressed or under-expressed genes selected from each module was calculated (). Each resulting “transcriptional vector” recapitulates the expression of a given module (or select set of genes within a module) in a given patient. A spider graph connects all the vector values obtained for each patient (). This is in contrast with the module maps defined earlier, which display the frequency of significant changes for an entire patient cohort module-by-module.
SLE patient profiles are linked to disease activity
Transcriptional vectors were derived for the entire cohort of 22 untreated pediatric SLE patients using the set of 11 SLE modules detailed above (the 628 differentially expressed genes distributed among those 11 vectors are listed in Supplementary Table 3
). On each line represents the expression profile of one patient; the thicker line shows the average expression for the patients forming this group. The values are normalized per-gene using the median expression value of healthy and are represented on a logarithmic scale. displays the expression pattern characteristic of healthy volunteers. Differences between the healthy and SLE groups were statistically significant for each of the modules (p<0.01 Mann-Whitney U test). Patient profiles were also generated for an independent set of 31 children with SLE treated orally with steroids and/or cytotoxic drugs and/or hydroxychloroquine ( – Supplementary Table 4
). Interestingly, average profiles for both treated and untreated patient cohorts were almost superimposable ( – no significant difference at p<0.01; V2.2 p=0.04; Mann Whitney U test). However, patient selection in both groups was such that they presented similar disease activity as measured by the clinical index SLEDAI (SLE disease activity index – untreated patients average=11.5 ± 7.9; treated patients=9.4 ± 6.4, Student's t-Test p=0.3).
In order to investigate a possible link between SLE activity and patient transcriptional profiles we stratified samples based solely on SLEDAI scores. Samples from patients with mild disease activity (SLEDAI [0-6]) presented a profile closer to that of healthy subjects (); whereas patients with high disease activity (SLEDAI [14-28]) presented an exacerbated profile ( – comparison of mild vs. high disease activity: V1.7, V2.2, V2.3, V2.4, V2.8 & V3.1: p<0.01; V1.1 p=0.07; V2.1 p=0.06; V2.5 p=0.8; V2.6 p=0.02; V2.7 p=0.08; Mann Whitney U test).
These results suggest that composite transcriptional vectors identified in SLE patients are associated with disease severity and have potential value as biomarkers.
Generating multivariate transcriptional scores as indicators of SLE disease progression
SLE is a heterogeneous multisystemic disease presenting a wide range of clinical and laboratory abnormalities. Objectively assessing disease activity across patients or longitudinally in individual patients can therefore be challenging. At least 6 composite measures of SLE global disease activity have been developed (Bae et al., 2001
; Bencivelli et al., 1992
; Bombardier et al., 1992
; Hay et al., 1993
; Liang et al., 1989
; Petri et al., 1999
) and have been used to assess disease progression during clinical trials. These measures, however, rely on a series of clinical and laboratory findings and are cumbersome to obtain. The SLEDAI, one of the simplest measures, considers 24 different attributes that need to be obtained at every clinic visit. Additionally, given the heterogeneous nature of the clinical disease, not all SLE manifestations are computed within these measures, making the overall assessment of the patient sometimes difficult. Thus, establishment of an objective disease activity index would be beneficial. We engaged to assess whether such activity index could be generated from blood leukocyte microarray transcriptional data.
The analysis of pediatric SLE patient profiles carried out above showed a link between transcriptional vectors and clinical disease manifestations. We have previously found that expression of individual genes, or gene signatures (such as interferon-inducible genes) could be affected by treatment (Bennett et al., 2003
). Therefore, our aim was to maximize the number of transcriptional signatures used as a basis for the generation of a clinical indicator of disease activity. We computed correlations between composite expression values for individual dimensions (transcriptional vectors) and the clinical activity index (SLEDAI) for each of the patients in our untreated cohort (). We found that two dimensions (corresponding to V2.2 - “neutrophil” - and V3.1 - “interferon-inducible” - modules) correlated positively with disease activity, whereas dimensions corresponding to V1.7, V2.4 and V2.8 (“ribosomal proteins” and “T cells”) correlated negatively. We next verified that differences in expression observed for a selection of transcripts that belong to these five modules could be confirmed using real-time PCR. Two transcripts from M1.7, M2.2, M2.4, M2.8 and M3.1 were tested in 10 healthy controls and 25 patients. Differences in expression were significant in 9 out of the 10 transcripts tested (CCDC72
p=0.8). Only one of the transcripts tested, GLTSCR2
, did not display the expected difference between control and SLE groups. This degree of concordance is consistent with rates reported in the literature (Bosotti et al., 2007
). The discrepancy could be attributed to differences in probe selection for the respective assays. Expression values obtained by real-time PCR for the 9 differentially expressed transcripts were also significantly correlated with microarray data (Supplementary Figures 4 & 5
SLE transcriptional vectors correlate with disease activity
A non-parametric method for analyzing multivariate ordinal data was used to score the patients based on these five dimensions (Spangler et al., 2004
; Wittkowski et al., 2004
). The advantage of this approach is that there is no need for additional assumptions and validations. Once available knowledge has been incorporated by making the initial transformations the proposed scores are valid by construction, as long as each variable increases or decreases with the unobservable latent factor. Thus, no empirical evaluation is needed. Because no assumptions are made regarding the functional form of the relationship, U- scores are scale independent.
U-scores were obtained for all patients in the untreated cohort (n=22). A polarity of 1 was attributed to vectors correlating positively with disease activity (i.e. Neutrophil: V2.2, Interferon: V3.1). The polarity of vectors correlating inversely with disease activity was set to -1 (T cells: V2.8, and Ribosomal proteins: V1.7 and V2.4). This allowed the ranking of all patients within this group. U-scores have positive (most severe disease) or negative values (less severe disease) reflecting the rank of each sample vs. the other patients forming this cohort. The association between the multivariate “transcriptional scores” and SLEDAI was assessed using linear regression and was determined to be statistically significant (; r=0.83, df=1, t=6.66, and p-value<.0001). The correlation achieved by this score was superior to that of its individual components. Using the same process, correlation between “Transcriptional score” and SLEDAI was examined for the treated pediatric SLE patient cohort (n=31) and was found to be statistically significant as well (; r=0.63, df=1, t=4.40, and p-value=0.0001).
Thus, distinct immunological signatures associated with the pathogenesis of SLE have been reduced to a unique multivariate score correlating with disease activity.
Multivariate transcriptional scores are used to monitor disease progression in patients with SLE
Lupus disease flares can lead to irreversible worsening of the status of the patient. We tested the relevance of the multivariate transcriptional score for the longitudinal monitoring of disease activity a cohort of 20 pediatric SLE patients (two to four time points/patient, intervals between each time point varied from one month to 18 months). Half of the patients had been included in our cross-sectional analysis before they were enrolled in this longitudinal study.
During the follow up period, the SLEDAI fluctuated in 10 patients whereas it remained constant in the other 10 (). Parallel trends were observed between transcriptional U-scores and SLEDAI longitudinal measures in a majority of patients. The positive association between SLEDAI and transcriptional scores was verified statistically using a linear regression model. The estimated model was: transcriptional score = 18.13 + 1.26 (SLEDAI) - 0.03 (Days). The overall model was statistically significant (df=1, chi-sq=28.44, and p-value<.0001), as was the association between the SLEDAI and transcriptional scores (df=28, t=2.41, and p-value=0.0229). For every one unit increase in SLEDAI score the transcriptional score increases by 1.3. Overall SLEDAI index and transcriptional scores reflected similar activities according to their respective scales in all but 6 patients (SLE31, SLE78, SLE125, SLE130, SLE135 and SLE 99) in whom the transcriptional U-scores were disproportionately high compared to SLEDAI index (SLEDAI values are positive while multivariate U-scores can be positive or negative). One of the patients with the highest discrepancy (SLE78) was diagnosed during the follow-up period with a life-threatening complication (pulmonary hypertension) which is not computed within the SLEDAI. Thus, severity of disease was more accurately assessed by the transcriptional score. Disease flaring and subsequent recovery was detected in one patient (SLE31) upon longitudinal follow up using both SLEDAI and transcriptional score. Interestingly, however, the amplitude of change observed in the case of the transcriptional U-score appears not only to be much greater (0 to 40 vs. 6 to 10 for SLEDAI), but an increase could already be detected at the second time point, 2 months before the worsening of the clinical condition of this patient was detected by SLEDAI. Thus, these data illustrate the potential value of microarray data and the multivariate transcriptional scores derived from it for the longitudinal follow up of disease progression in patients with complex multisystemic diseases like SLE.
Longitudinal disease monitoring with a multivariate disease activity score
Composite transcriptional vectors are stable across laboratories and microarray platforms
To be truly viable as biomarkers, composite transcriptional vectors must prove reliable. Early on, poor reproducibility of microarray results obtained by different laboratories and across platforms raised suspicion about the validity of these results and remains a major concern (Bammler et al., 2005
; Frantz, 2005
; Ioannidis, 2005
; Irizarry et al., 2005
; Larkin et al., 2005
; Michiels et al., 2005
; Shi et al., 2006
). We compared transcriptional profiles obtained using two commercial microarray platforms, Affymetrix and Illumina. PBMCs were isolated from four healthy volunteers and ten liver transplant recipients. Starting from the same source of total RNA, targets were generated independently and analyzed using Affymetrix U133 GeneChips (at the Baylor Institute for Immunology Research) and Illumina Human Ref8 BeadChips (at Illumina Inc.). Fundamental differences exist between the two microarray technologies (see Methods for details). Probe IDs provided by each manufacturer were converted into a common ID that was used for matching gene expression profiles. Overall the Affymetrix and Illumina profiles for M3.1 appear to be similar (). However, correlations comparing both platforms performed for individual genes forming M3.1 resulted in a median R2
value of 0.36 (ranging from 0.17 and 0.55) In other modules such as M1.2 and M3.2 correlations observed at the level of individual genes were also poor (R2
median (range) = 0.13 (0.02-0.5) for genes forming M1.2; and 0.19 (0.06-0.4) for genes forming M3.2 – ). In order to compare overall modular expression pattern across the two platforms we derived for each module a composite transcriptional vector (averaging the values obtained for the genes forming each module). Remarkably, the module-level expression values thus derived from Affymetrix and Illumina data were highly comparable (; transplant group Pearson correlation coefficient R2
= 0.83, 0.98 and 0.93, for M1.2, M3.1 and M3.2 respectively; p<0.0001 – in addition R2
values of M1.1=0.84; M1.3=0.95; M1.4=0.81; M1.5=0.74; M1.8=0.62; M2.1=0.98; M2.2=0.82; M2.3=0.99; M2.6=0.73; M2.8=0.83; M2.10=0.66; M3.3=0.65; M3.8=0.57; R2
values for other modules <0.5). Taken together, these results indicate that module-level composite expression data produce a more stable metric than individual gene expression values, thereby enhancing data reproducibility across microarray platforms. This property may be attributed to the stringent module selection process (transcripts must be co-expressed across many samples) and the fact that composite expression values are derived from multiple measurements (smoothing the imprecision observed at the level of individual probes).
Cross-microarray platform comparison