Microarray-related technology, approaches, and limitations have been extensively reviewed elsewhere [1
], and will be summarized below. Notably, there is now an emerging technology, RNA sequencing (RNA-Seq) [6
], that has potentially intriguing applications for the field, but will not be further discussed as there are no RNA-Seq data specifically related to sepsis.
The fundamental technical innovation of microarray technology is the ability to simultaneously measure mRNA abundance of thousands of transcripts (transcriptomics). The technique generally involves reverse transcription of RNA into cDNA, with the inclusion of a labeling molecule for detection. The labeled cDNA (targets) is subsequently applied to a support surface arrayed with nucleotide sequences corresponding to specific genes (probes). The probes and targets hybridize via standard nucleic acid interactions and the amount of hybridization reflects the abundance of a specific mRNA species. The supporting surface is subsequently washed and scanned to provide raw mRNA abundance data. An important limitation of transcriptomics is that it solely provides a 'snapshot' of steady-state mRNA abundance. The degree of mRNA abundance is influenced by multiple factors, and does not provide any direct information about gene end products (proteins), nor post-translational modifiers of protein function, such as phosphorylation or glycation.
One major consideration in designing a microarray experiment involves the RNA source. Ideally, the RNA source should be relatively homogenous and closely represent the disease/condition biology of interest. For example, the discovery of neutrophil gelatinase-associated lipocalin as a biomarker for acute kidney injury included microarray-based analysis of kidneys from rodents subjected to renal ischemia [7
]. Most of the studies described below have used the blood compartment as the RNA source. Reliance on the blood compartment has obvious limitations with regard to specific organ perturbations in clinical sepsis, but also reflects the practical limitations of tissue sampling in clinical research and does provide a broad picture of a systemic response. Blood-derived RNA can come from either whole blood (a mixed population of blood cells), or following the isolation of specific blood cells. The whole-blood approach facilitates the procurement of samples from multiple centers, without the requirement for cell separation expertise, and has the potential to provide a comprehensive picture. However, the whole blood approach has the potential to confound data interpretation due to heterogeneous blood cell populations. The cell-specific RNA approach provides a more homogenous RNA source, but has the potential to miss biologically relevant expression signatures from cells that are excluded from the experimental approach. For example, a study that focuses exclusively on peripheral blood mononuclear cells will not account for the potentially important response of neutrophils.
Another important consideration in designing a microarray experiment involves the reference (control) group to which gene expression in the population of interest will be compared. For example, if one is interested in studying gene expression patterns in sepsis, relative to a normal state, then comparisons to normal controls is appropriate. In contrast, if one is interested in discovering gene expression patterns that distinguish sepsis from 'sterile inflammation', then a more appropriate control group would consist of patients who are not infected, but meet criteria for systemic inflammatory response (SIRS).
The heterogeneity and complexity that characterize clinical sepsis present an important challenge to clinical microarray studies. From one perspective, one could say that the comprehensive nature of a microarray approach is ideally suited for studying such a heterogeneous and complex syndrome. From another perspective, the heterogeneity and complexity are potentially profound confounders for data interpretation. Accordingly, it is critical that microarray data be interpreted in the context of robust clinical/biological data that can influence gene expression patterns. These include, but are not limited to, race, gender, age, co-morbidities, infecting pathogen class, state of immune competence, and therapy.
Analysis of microarray data is an evolving and complex field. A universal initial step involves data normalization, which allows valid comparisons across samples by reducing technical variations not directly related to biological variation [5
]. A typical next step involves statistical comparisons across groups of interest using either parametric or non-parametric analysis of variance. Unfortunately, there is no clear consensus as to which statistical test is most appropriate for a given data set, and it is particularly troubling that lists of 'differentially regulated genes', from the same data set, can substantially vary based on the statistical test [8
]. Regardless of what statistical test one uses, it is imperative that the statistical test incorporates corrections for multiple comparisons to account for a substantially high risk of false positives. One common filter that is applied to microarray data involves an expression filter that compares mRNA abundance of specific gene probes in one cohort versus a reference cohort. Expression filters are useful to assess 'magnitude of effect' and to reduce the number of comparisons for a subsequent statistical test, but they are not valid substitutes for formal statistical testing. Finally, there is the issue of statistical power in microarray experiments, which can be calculated, but is dependent on assumptions that can be difficult to derive objectively [10
]. In general, a heterogeneous study cohort will require substantially more independent samples, compared to a more homogenous cohort.
The statistical tests described above typically yield large lists of differentially regulated genes, thus leaving one with the challenge of assigning biological meaning to these gene lists. One approach to data interpretation involves the generation of 'heat maps', which statistically cluster genes and samples based on similarity of expression. Heat maps provide a broad picture of gene expression patterns and allow for the discovery of disease 'subclasses' based on differential gene expression [11
]. Another approach to viewing large microarray data sets involves the generation of gene expression 'mosaics' based on a 'self-organizing map' algorithm [12
]. These gene expression mosaics provide microarray data with a 'face' that is recognizable via intuitive pattern recognition, and were recently applied to allocate patients with septic shock into clinically relevant subclasses [14
Beyond these global assessments of gene expression patterns there exist a number of public and proprietary databases allowing for the assignment of biological function to gene lists. These databases examine uploaded gene lists and determine whether the gene list is enriched for genes that are biologically related, based on the established literature. The outputs from these databases range from generic (for example, 'immune response') to specific (for example, 'antigen presentation') biological processes. Furthermore, the outputs from these databases provide an estimate of significance (P
-values) indicating how likely a gene list is enriched for a given biological function by chance alone. The level of significance is directly proportional to the number of genes in the list that correspond to the given biological function, and indirectly proportional to the total number of genes in the list. A related approach to assigning biological meaning to gene lists involves the generation of gene networks based on known, direct and indirect, interactions between genes [16