|Home | About | Journals | Submit | Contact Us | Français|
This article describes and illustrates a novel method of microarray data analysis that couples model-based clustering and binary classification to form clusters of `response-relevant' genes; that is, genes that are informative when discriminating between the different values of the response. Predictions are subsequently made using an appropriate statistical summary of each gene cluster, which we call the `meta-covariate' representation of the cluster, in a probit regression model. We first illustrate this method by analysing a leukaemia expression dataset, before focusing closely on the meta-covariate analysis of a renal gene expression dataset in a rat model of salt-sensitive hypertension. We explore the biological insights provided by our analysis of these data. In particular, we identify a highly influential cluster of 13 genes—including three transcription factors (Arntl, Bhlhe41 and Npas2)—that is implicated as being protective against hypertension in response to increased dietary sodium. Functional and canonical pathway analysis of this cluster using Ingenuity Pathway Analysis implicated transcriptional activation and circadian rhythm signalling, respectively. Although we illustrate our method using only expression data, the method is applicable to any high-dimensional datasets. Expression data are available at ArrayExpress (accession number E-MEXP-2514) and code is available at http://www.dcs.gla.ac.uk/inference/metacovariateanalysis/.
Microarray experiments allow the simultaneous expression measurements of tens of thousands of genes in a biological sample and have been employed extensively to investigate human disease since the early 90s (1). Despite almost two decades of research, challenges regarding the analysis of these data remain. Typically, the number of variables (or probes) measured vastly outnumbers the number of replicate experiments: over 30000 probes might be measured in only three or four samples, making good predictive performance possible by chance, irrespective of whether the data contain relevant patterns. In addition, many variables will exhibit similar patterns across the samples; we require methods that identify which of these correlations are the result of genuine functional relationships and/or co-regulation and which are merely observed by chance. Taken together, these features make microarray analysis statistically demanding, prone to variability in model parameter estimates and ultimately susceptible to inaccurate prediction.
Our meta-covariate method is a novel approach to analysing microarray data, which overcomes and, in the case of correlated expression patterns, exploits the statistical properties of gene expression data, with a view to improving prediction and identifying biologically relevant structure in the data (2). It is, however, applicable to any high-dimensional dataset (including proteomics, sequencing and miRNA datasets) where informative correlations exist between the variables. Initially, the D probes are grouped into K clusters, using gene expression similarity across the samples and a standard Gaussian mixture model. An -dimensional meta-covariate vector is then generated from each cluster and predictions are made by weighting these meta-covariates in a probit regression model. We then take the novel step of using the prediction performance to update the clustering structure, the meta-covariates and the regression weights. This iterative procedure is repeated until convergence (Figure 1).
The approach of reducing microarray data dimensionality by forming clusters (independent of predictions) and making subsequent predictions using cluster summaries has been adopted previously by Hanczar et al. (3) and Park et al. (4) amongst others. Where our method improves upon existing methods is that inter-predictor correlations are coupled with predictor-outcome correlations to inform the clustering structure, the cluster summaries (meta-covariates) and the regression weights (indicated by the turquoise arrow in Figure 1). The advantages of our method are three-fold. First, the clustering component of the model identifies response-relevant structure in the data, aiding biological interpretation. Second, the regression coefficients allow the identification of influential clusters: the greater the weight assigned to a cluster in the regression model, the more `informative' that cluster is when discriminating between the outcomes of the response variable. And finally, using the predictor-response correlations to fine-tune the clustering structure in the model potentially improves prediction performance.
In this article, we will first demonstrate how the meta-covariate method works using the well-known leukaemia dataset described by Golub et al. (5). We will then employ the method to analyse gene expression data in the rat kidney to investigate the genetics of salt-sensitive hypertension.
In the Golub et al. (5) dataset, bone marrow or peripheral blood samples were taken from 25 acute myeloid leukaemia (AML) and 47 acute lymphoid leukaemia (ALL) patients. The training data contain 38 samples, of which 11 are AML and 27 are ALL samples. The test data contain 34 samples, of which 14 are AML and 20 are ALL. RNA extracted from these samples was tagged and subsequently hybridized to a high-density Affymetrix oligoneuclotide microarray (Hu6800/HuGeneFL). The expression data were obtained from the Broad Institute Website and pre-processed as recommended in Dudoit et al. (6) (see Supplementary Data for details), leaving 3571 probes for analysis.
Inbred colonies of SHRSP and WKY have been maintained at the University of Glasgow since 1991, as previously described (7). From weaning, all rats were maintained on normal rat chow (rat and mouse No.1 maintenance diet, Special Diet Services) and at 18 weeks of age rats were given a salt challenge (1% NaCl in drinking water) for 3 weeks.
The congenic strain SP.WKYGla2a (D2Rat13-D2Rat157) was generated using a marker-assisted ‘speed’ congenic strategy (8) where a WKY (donor strain) segment was introgressed into the SHRSP (recipient strain) genetic background.
Affymetrix GeneChip renal expression analysis was used to identify differentially expressed probe sets (representing unique gene or expressed sequence tag sequence on the Affymetrix GeneChip) between male, 21-week old, age-matched salt-loaded and non-treated SHRSP, SP.WKYGla2a, and WKY rats. Whole kidneys (harvested between 10 a.m. and 12 noon and snap-frozen in liquid nitrogen) were homogenized and total RNA extracted from three rats from each strain by using the maxi RNeasy kit according to the manufacturers protocol (Qiagen). Biotinylated amplified target cRNA was prepared and hybridized to the Affymetrix Rat Rat230-2 gene chips as described by Affymetrix. After hybridization, microarray chips were washed, stained and scanned. The microarray data were normalized in R using RMA (9). To assess the statistical significance of pairwise intergroup differences, Rank products (RP) (10) was used, corrected for multiple testing using a false discovery rate of 5% (11). The microarray dataset has been submitted to ArrayExpress (Accession Number E-MEXP-2514).
Renal total RNA was extracted from 21-week-old salt-loaded and non-treated male rats using RNeasy kits (Qiagen), treated with DNase-Free RNase (Ambion), and accurately quantified. qRT–PCR was performed using Taqman (Applied Biosystem, UK) Actb (-actin) labelled Vic, as a normalization control and either Arntl (Rn00577590_m1), Npas2 (Rn01438224_m1), Nfil3 (Rn01434874_s1) and Bhlhe41 (Rn00591084_m1) labeled FAM. Arntl, Npas2, Nfil3 and Bhlhe41 were normalized to Actb, expressed relative to SHRSP (non-salt treated) in each sample using the comparative (CT) method.
As described in the ‘Introduction' section, the novelty of this method lies in the coupling of the clustering and prediction components (as depicted by the turquoise arrow in Figure 1). These components are coupled by optimising all the parameters (i.e. the parameters pertaining to both components) simultaneously, rather than optimising the clustering parameters before the prediction parameters. Here, we have chosen a Gaussian mixture model as the clustering model (12, Section 9.2) and probit regression (12, Section 4.3.5) as the prediction method. We optimize the parameters of these models using the Expectation-Maximisation (EM) algorithm (12, Section 9.4), which finds the most likely parameter estimates in a probabilistic model by updating them over a number of iterations. Our model updates are derived by merging the standard EM updates for the clustering and regression parameters.
Intuitively, the meta-covariate model can be thought of as follows: (A) all probes on the array are clustered into K groups, and each group is then represented by some average of its members; (B) these cluster averages (which we call the `meta-covariates') are then used to predict the response by assigning each meta-covariate a weight in a regression model; (C) we update the cluster structure (Step A) and the regression weights (Step B) depending on how well the meta-covariate regression model predicts the response. It is also important to appreciate that this method can be used as an exploratory tool as well as a prediction algorithm.
The significant parameters in this model are , π, , and w. w is a vector, containing the weights assigned to each meta-covariate (and therefore each cluster) in the regression model. Each value in w indicates how much influence each cluster has in determining the value of the response and therefore how informative it is when discriminating between different values of the response (in the hypertension dataset, the response is salt-loaded or non-salt-loaded, while in the leukaemia dataset, the response is AML or ALL). The other four parameters are relevant to the clustering model. is a matrix containing the meta-covariate representations of the clusters and is a matrix that describes the variance within each cluster in the model; i.e. θk and are the mean and covariance of the k-th cluster. π is a vector containing the proportion of probes in the dataset that are assigned to each cluster, which are the `mixing coefficients'. is a matrix containing the `responsibilities' that each cluster takes for explaining each probe; each element of can be interpreted as the probability that a particular probe belongs to a particular cluster (the values for any probe will sum to 1). To generate assignments, a probe is assigned to the cluster to which it has the highest probability of belonging. Using such `soft' clustering (rather than `hard' clustering, where each probe is assigned to a cluster with a probability of 1), aids the interpretation of the model.
Our EM procedure iteratively updates the values of π, , , and w (and others, see Supplementary Data) until the model converges. More specifically, given some number of clusters K, the goal is to maximise the log joint distribution with respect to the parameters, π, , , and w, until the model converges. Here, the convergence criterion is an increase in the log joint distribution of or some maximum number of iterations. Note that the value of K must be set before optimisation, necessitating a model selection step that identifies which K is best for a given dataset.
Full details of our method are given in the Supplementary Data, Sections 1.2–1.3 and MATLAB code is available at http://www.dcs.gla.ac.uk/inference/metacovariateanalysis/.
All probe to gene mappings; gene to pathway mappings and network analysis tools were taken from Ingenuity Pathway Analysis software (IPA, http://www.ingenuity.com/) as of October 2009. Molecular interactions between genes were mapped to a common pathway using the Pathway Explorer function within IPA software.
A well-established leukaemia dataset containing expression data for AML and ALL was used initially to illustrate our method (2). Our method was then applied to a novel dataset of renal gene expression data with a view to providing insight into salt-sensitive hypertension. Throughout this section, clusters will be represented as where gives the ID of that cluster in the dataset () where denotes the Golub et al. dataset and denotes the hypertension dataset.
Leukaemia is a broad term to describe cancer of the blood or bone marrow. Haemopoiesis, the process of blood production, is organized hierarchically with the haemopoietic stem cell at the apex. The first major lineage diversion is between myeloid and lymphoid progenitors. In AML there is a block to differentiation with a rapid accumulation of abnormally proliferating myeloid blasts. This process is mirrored in ALL, but in this case, the blasts are of lymphoid morphology (13, Chapter 12).
In 1999, Golub et al. published work in which previously unseen samples could be classified according to their gene expression profiles; using a weighted vote of 50 probes, they successfully classified all but one of the samples in the test set of 34 samples (14 AML and 20 ALL samples). This dataset has been subject to extensive analysis in the past decade and predictions made from these data are consistently of good quality, regardless of the approach taken: using a sparse Bayesian classification model, Bae and Mallick (14) misclassified two test samples; Lee and Lee (15) used support vector machines to analyse an extended multinomial version of the Golub et al. dataset and achieved a misclassification rate of 1; Tibshirani et al. (16) used the nearest shrunken centroids and misclassified two samples; and using a hierarchical Bayesian model, Lee et al. (17) misclassified only one sample.
Although AML and ALL are both forms of leukaemia, they cause accumulation of different types of cell (5). As such, there will be many differences between the two sets of samples in this dataset that are attributable to cell type, rather than the molecular pathology of the two diseases. These cellular differences may be responsible for the ease with which the AML and ALL samples are discriminated in the literature. It must also be noted that there are subtypes of AML and ALL (18)—in the process of haemopoiesis, myeloid and lymphoid progenitor cells give rise to further cell lineages, where subtypes of AML and ALL describe cancers exhibiting variable levels of differentiation towards mature myeloid and lymphoid cells—and that the Golub et al. dataset pools all AML and ALL subtypes together. In addition to the heterogeneity inherent in the disease, the samples in the dataset vary with respect to the age of the patient and with respect to sample type (e.g. both bone marrow aspirates and peripheral blood mononuclear samples are used). As such, we expect that any biology captured by our model would represent very `general' characterizations of myeloid and lymphoid cells.
The Golub data were pre-filtered as described by Dudoit et al. (6). In our representation, AML samples have been encoded as 1 and ALL samples have been encoded as 0; therefore, positively weighted clusters are predictive of AML samples (these clusters will be described as AML+) and negatively weighted clusters are predictive of ALL (such clusters will be described as ALL+). A model selection step identified as the best model using the criterion of minimum average test error (the model selection step performed 1000 iterations of the EM algorithm, where ).
The maximum a posteriori (MAP; 12, pp. 30) solution for this model discriminates perfectly between AML and ALL samples, in both the training and test set, providing evidence that our meta-covariate model is able to make good predictions and suggesting that the clusters formed are response relevant and, therefore, potentially biologically relevant.
The meta-covariate model algorithm was run to convergence—the criterion being a difference in the joint posterior of or a maximum of 5000 iterations—on the leukaemia data, partitioning the probes into 22 clusters. These clusters and their associated regression coefficients (w), dataset proportion () and mean variance (, the variance in expression of cluster members, averaged over samples) are described in Table 1. There is a marginal trend for to decrease with cluster size (). However, there is a significant correlation between the mean variance in the cluster and its influence (). This is perhaps contrary to expectation. It might be expected that the most influential clusters would identify transcriptionally tight clusters of genes corresponding to specific sub-functionality; however, the opposite is true: the more influential clusters are more variable.
This can be explained by considering how θk is calculated (see Equation 4 in the Supplementary Data). θk is comprised of both a model mismatch component, which describes how well the current classification model matches the response data, and a standard clustering component (12, Section 9.2.2). As the cluster size decreases, that is, as γk becomes more sparse (where γk is the vector of clustering responsibilities for cluster k), the model mismatch terms will dominate the calculation of θk as the standard clustering component, dependent on γk, will diminish. Conversely, as the cluster becomes larger and γk becomes less sparse, the standard mixture modelling component will dominate the calculation. Furthermore, as the cluster becomes more influential and the value of increases, the model mismatch term will dominate further. Therefore, the model will tend to form smaller, influential, more variable clusters and larger, less influential and less variable clusters, thereby automatically inducing sparsity in the model.
The model is capable of capturing large-scale biological functionality that is of relevance to the response. As expected, the biology captured by the most influential clusters in this dataset describes functions characteristic of myeloid and lymphoid cells.
C is the most influential cluster in the model generated from the leukaemia data (Figure 2a). The expression of the genes in this cluster is associated with ALL samples. C is enriched for elements in the `MIF regulation of innate immunity' pathway, due to the inclusion of MIF and its cell surface receptor CD74 (19) in the cluster (Supplementary Figure S2). MIF is a lymphokine, a signalling molecule expressed by lymphocytes (http://www.ncbi.nlm.nih.gov/gene/4282), which has been shown to play a role in T-cell tumourigenesis (20) and lymphocyte proliferation (21,22). CD74 is expressed on malignant B cells (a form of lymphoid cell), but is expressed to a much lesser extent on non-malignant cell surfaces stein (23) (Supplementary Figure S3).
C is the most influential AML+ cluster (Figure 2b). The most over-represented IPA pathway in this cluster is the `triggering receptor expression on myeloid cells 1' (or TREM1) signalling pathway (Supplementary Figure S4). TREM1 activation has various roles in both the adaptive and innate immune response, but critically, it is only expressed in myeloid cells (Supplementary Figure S5). This cluster is also enriched for `acute myeloid leukaemia signalling' proteins; in fact, the top five AML+ clusters (C, C, C, C and C) are all enriched for this pathway (Supplementary Figure S6). This IPA canonical pathway describes the signalling pathways which, when disrupted by abnormalities (e.g. mutations to genes and/or transcription factors), can lead to increased proliferation and apoptosis resistance in AML. These two pathways are myeloid-specific, describing processes that occur exclusively in myeloid cells.
The predictive ability of each cluster only exists within the meta-covariate model. Although some clusters may clearly discriminate between AML and ALL samples, others may be good at predicting subtypes of either disease. ALL samples can be further sub-classified as T- or B-cell ALL. C is an example of one of these ‘subtype specific' clusters. It is ALL+, with a regression coefficient of −1.91. From the expression plot in Figure 2c, it is clear that this cluster is important when classifying specifically T-cell ALL samples: expression in these samples is visibly higher, while expression in the B-cell ALL and AML samples is similarly low. This cluster is enriched for several T-cell lymphocyte-specific canonical pathways, including the ‘Calcium-induced T lymphocyte apoptosis'; ‘iCOS-iCOSL' and ‘CD28 Signalling in T Helper Cells'; ‘cytotoxic T lymphocyte-mediated apoptosis of target cells' and ‘T cell receptor signalling' IPA canonical pathways (Supplementary Figure S7).
Sub-type-specific clusters can arise in our model because complementary clusters, which are able to predict the other subtype(s) within a class, can exist. An example of a complementary cluster to C is C (Supplementary Figure S8); here, the cluster contains genes that are more highly expressed in B-cell ALL samples than T-cell ALL and AML samples.
In the previous section, we illustrated the use of our meta-covariate method by applying it to a well-known leukaemia dataset. We observed that the influential clusters tend to be smaller and more variable than the less influential clusters and that the model is able to capture both large-scale biological characteristics and small-scale, more specific biological characteristics. In the next section, our method is applied to a dataset of renal gene expression profiles in a rat model of salt-sensitive hypertension.
Essential hypertension (chronically elevated blood pressure) is a genetically complex disease, currently affecting one quarter of adults worldwide and projected to affect almost 30% of adults within 15 years (24). One half of hypertensive patients are salt sensitive, exhibiting increased blood pressure with increased dietary sodium (25). Elucidating the genetics of hypertension would have far-reaching implications for global health. Animal models are useful functional models allowing the genetic dissection of complex, polygenic disease; the data described here are derived from a rat model of salt-sensitive hypertension (26).
The SHRSP, Wistar-Kyoto (WKY) and congenic SP.WKYGla2a strains of rat are distinct with respect to phenotype in response to salt, with the SHRSP demonstrating increased systolic blood pressure and circadian amplitude in response to salt, the WKY being largely unaffected by salt, and the SP.WKYGla2a demonstrating an intermediate increase in both systolic blood pressure and circadian amplitude in response to salt (27, Supplementary Figure S9).
Microarray experiments were conducted to measure renal gene expression in male, age-matched, 21-week-old salt-loaded and non-salt-loaded animals. The resulting dataset was analysed using our meta-covariate method. Genes contained in influential clusters will be informative when discriminating between salt-loaded samples and non-salt-loaded samples. Furthermore, identifying gene expression changes between SP.WKYGla2a and SHRSP will highlight chromosome 2-dependent processes involved in blood pressure regulation.
The sample size (n=18) is small; as such, we used all the data to build the model, rather than making predictions on a test set. However, here we can demonstrate the second use of our method, by employing it as a valuable supervised clustering tool to generate response-relevant clusters within the given dataset, rather than using it primarily to build a classifier (as demonstrated in the previous section when analysing the leukaemia data).
The 4562 probes on the array were identified as significant using the Wilcoxon rank-sum test (). was set to 20, following 1000 iterations of the method for each value of where , and subsequent analysis using Bayesian information criterion (BIC); for , BIC=85731. BIC is a regularized model selection technique, which identifies the most likely values of the model parameters, whilst penalising unnecessary model complexity (12, Section 4.4.1). Upon completion, the meta-covariate model successfully partitioned the dataset with respect to salt.
The 20 clusters that are formed are described in Table 2. Here, there is an imbalance of positively and negatively weighted clusters—12 negative to 8 positive—unlike in the leukaemia model (Table 1), where there were equal numbers of positively and negatively weighted clusters.
This model is dominated by heavily, negatively weighted clusters: the three most influential clusters (C, ; C, ; C, ) are all negatively weighted (note also that these three clusters have similar variance). This suggests the dominant biology captured by this model is reduced expression in the salt-loaded animals. That is, the biology that contributes most significantly to the discrimination between the salt-loaded and non-salt-loaded samples is that of lower expression in the salt-loaded samples.
Cluster size is significantly inversely correlated with regression weight (, ) and significantly correlated with average variance (, ). Therefore, as observed in the leukaemia dataset, the method has generated both small, variable (with respect to member gene expression), influential clusters and large, tight, non-influential clusters.
This feature of our model is particularly useful in this dataset, where all of the 4562 probes are significantly correlated with the response. The induced sparsity allows identification of the most relevant features, in a congested dataset where all features are relevant by traditional, univariate methods. Furthermore, there is no correlation between the Wilcoxon P-value and regression coefficient in this dataset (, ), indicating that the most valuable predictors (as defined by the meta-covariate model) would not be selected on the basis of P-value alone.
With a view to establishing (i) how sensitive our method is to variation in the data and (ii) how robust these clusters are, we performed leave one out cross-validation (LOOCV) and compared the models generated from the LOOCV folds to each other and to the model generated from the full dataset using two metrics—adjusted rand index (ARI) (28) and adjusted mutual information (AMI) (29)—both of which measure concordance between clustering structures, while accounting for chance. The results are very encouraging: despite the small sample size, clustering concordance is high (see Supplementary Figures S10–S11 and Supplementary Tables S1–S2). The mean concordance between the clustering structures of the LOOCV-fold models and the clustering structure of the model built from the dataset in its entirety (i.e. the clustering described in Table 2) is 0.96 (=0.011) and 0.96 (=0.0070) for the ARI and AMI metrics, respectively (all values rounded to 2 s.f.). This convincingly demonstrates that a similar clustering structure is observed across LOOCV folds and, therefore, that the method is insensitive to variation in the input data. This is particularly encouraging given that an initial motivation for this method was to avoid such sensitivity. We can now progress with the analysis of these data, with confidence in the clustering structure.
C is the most influential cluster: its regression coefficient is five times larger than the second most influential cluster. Classification using this cluster and its regression coefficient in isolation results in only one misclassification (the SHRSP+salt animal, C3996) using the decision boundary (). Although we should be cautious of reading too much into cluster performance in isolation, given that clusters are only relevant as part of the model as a whole, it is a useful indicator of how informative a cluster is in the model.
The negative regression coefficient indicates that the genes in this cluster are, largely, more highly expressed in the non-salt-loaded samples than the salt-loaded samples, as is evident in the graph of the mean expression values () in Figure 3. Comparing the mean expression values to the values illustrates the effect of incorporating an outcome-specific component in the calculation of θk: the difference between the non-salt-loaded and salt-loaded samples is exaggerated in the graph of θ13 (θk where ) in Figure 3.
Note that the difference between and in the renal dataset is greater than in the leukaemia dataset (Figure 2a–c and Supplementary Figure S8). This suggests that there is greater discriminative power in the unaltered leukaemia data than in the unaltered hypertension data. This is not surprising, given the known heterogeneity in the leukaemia samples and the comparative homogeneity of the inbred rats. It is encouraging that the model is able to use patterns that exist in the mean gene expression data to build the model, but that it is also able to alter the cluster representation (i.e. alter ) to find more complex informative patterns.
Figure 4a–c show the results of a RP analysis (10) within each strain, between the salt-loaded and non-salt-loaded datasets (Supplementary Tables S3–S5, chromosome mappings given in Supplementary Table S6). Most of the genes are significantly differentially expressed between the salt-loaded and non-salt-loaded datasets in both the WKY and SP.WKYGla2a (Figure 4a and b, respectively). However, the same genes are not differentially expressed when comparing the salt-loaded and non-salt-loaded SHRSP animals (Figure 4c) giving rise to the hypothesis that changes in expression levels of the genes in are protective against hypertension in response to an increase in dietary sodium. These results are corroborated by a Rosetta Resolver analysis (http://www.rosettabio.com/products/resolver; data not shown) and the differential expression of the four transcription factors (Arntl, Npas2, Nfil3 and Bhlhe41) have been confirmed by qRT–PCR (Figure 4d–g, respectively).
Eleven of the thirteen probes in this cluster were mapped to genes using IPA. A canonical pathway analysis of these eleven genes shows that the cluster is enriched for circadian rhythm signalling genes (Supplementary Figures S12–S13). This is relevant as all three rat strains demonstrate circadian patterns of systolic blood pressure: these nocturnal animals have a higher blood pressure during the night than during the day and this difference and the circadian amplitude is exacerbated on salt-loading in the SHRSP (Supplementary Figure S9).
Also shown in Figure 4a–c are the relationships between the genes in , as described in the Ingenuity Pathway Knowledge Base. Of note are the four transcription factors, three of which, neuronal PAS domain protein 2 or Npas2 (also known as Bhlhe9); aryl hydrocarbon receptor nuclear translocator-like or Arntl (also known as Bmal1 and Bhlhe5); and basic helix–loop–helix family member e41 Bhlhe41 (also known as Dec2), are known to form a transcriptional network and, as seen in a previous section, are potentially protective against hypertension, being differentially expressed on salt in the SP.WKYGla2a and WKY strains. These three transcription factors are central components of the circadian clock (Supplementary Figure S12). Aryl hydrocarbon receptor nuclear translocator-like (Arntl) forms a heterodimer with Clock and is required for E-box-dependent transactivation activating the transcription of the Per genes from the E-box elements in its promoter region (30,31). Protein products of Per act together with Cry proteins to inhibit Per transcription, thus closing the autoregulatory feedback loop. It has been shown that the basic helix–loop–helix transcription factors (Bhlhe41) can repress Clock/Bmal1-induced transactivation of the mouse Per1 promoter through direct protein-protein interactions with Bmal1 and/or competition for E-box elements. Disruption of the key molecular oscillators (Arntl, Npas2) and autoregulatory feedback loops (Bhlhe41, Per, Dbp, Cry), have recently been shown to be involved in hypertension (32) and salt sensitivity in both mice (33,34) and rats (35).
The IPA network generation algorithm was used to form networks of genes known to be functionally related, as defined by the Ingenuity Pathway Knowledge Base. This algorithm generates small (at most 35 genes), densely connected networks from a set of `focus genes'; IPA is able to `fill in the gaps' with linker genes to maximise connectivity in the networks. Constructing networks around the significantly differentially expressed genes identified by RP (10) in the salt data, we can identify networks of functionally related genes that are relevant to salt-loaded animals.
The same three transcriptional regulators that form the transcriptional network in are present in the most significant networks generated from the SP.WKYGla2a () and WKY () RP gene expression microarray data: Arntl, Npas2 and Bhlhe41 (Figure 5a–b and Supplementary Tables S7–S8). To have arrived at a similar conclusion both by way of IPA network analysis and by our meta-covariate method is encouraging. Further, our meta-covariate method identifies a much smaller set of genes, allowing more concise interpretation of the data.
Further investigation and validation experiments are underway, with the priority being the elucidation of how these genes are linked to chromosome 2 and how they are involved in sodium homeostasis. In addition, a major focus will be to investigate why the genes in vary similarly with the response; this may be due to shared transcriptional regulation.
In this article, we describe and illustrate a novel method of microarray analysis using the Golub et al. (5) leukaemia dataset, before applying the same analysis to a novel dataset of renal gene expression data in rat models of salt-sensitive hypertension. It was demonstrated that the prediction performance of our meta-covariate method is competitive in the Golub et al. dataset. Although we refrain from drawing any additional conclusions from these data, beyond the identification of general patterns, we would like to stress that further analysis of these data could be informative, provided the caveats with respect to the experimental design are kept in mind.
Although we were not able to evaluate prediction performance in an independent test set given the small size of the hypertension dataset, the model generated from the training set was able to perfectly discriminate between salt-loaded and non-salt-loaded samples. However, it must also be noted that it is perfectly valid to use the meta-covariate method as a supervised clustering technique with a view to identifying response-relevant gene clusters, as well as a classification model.
Both datasets demonstrated that the model tends to form small, variable, influential clusters and larger, tighter, less influential clusters. This is particularly useful in a congested, homogeneous dataset, such as the hypertension dataset, where many, if not all, variables are significantly correlated with the response. The flexibility of the model was evident in that discrimination patterns were identified in the mean gene expression data where possible, but where these data were not informative, complex patterns were identified by alternative representations of the clusters.
We are currently developing a fully Bayesian implementation of this meta-covariate method—which will provide a range of clustering structures for a dataset rather than a single clustering scheme—while carrying out further biological validation of our findings.
Supplementary Data available at NAR Online.
Engineering and Physical Sciences Research Council (EP/F009429/1, EP/E052029/1) awarded to M.A.G.; British Heart Foundation Chair and Programme grant funding (CH98001, RG/07/005); Wellcome Trust Cardiovascular Functional Genomics Initiative (066780/Z/01/Z); European Union Sixth Framework Programme Integrated Project (LSHG_CT 2005-019015 EURATools) awarded to A.F.D. Funding for open access charge: The Wellcome Trust.
Conflict of interest statement. None declared.