|Home | About | Journals | Submit | Contact Us | Français|
High-throughput data collection using gene microarrays has great potential as a method for addressing the pharmacogenomics of complex biological systems. Similarly, mechanism-based pharmacokinetic/pharmacodynamic modeling provides a tool for formulating quantitative testable hypotheses concerning the responses of complex biological systems. As the response of such systems to drugs generally entails cascades of molecular events in time, a time series design provides the best approach to capturing the full scope of drug effects. A major problem in using microarrays for high-throughput data collection is sorting through the massive amount of data in order to identify probe sets and genes of interest. Due to its inherent redundancy, a rich time series containing many time points and multiple samples per time point allows for the use of less stringent criteria of expression, expression change and data quality for initial filtering of unwanted probe sets. The remaining probe sets can then become the focus of more intense scrutiny by other methods, including temporal clustering, functional clustering and pharmacokinetic/pharmacodynamic modeling, which provide additional ways of identifying the probes and genes of pharmacological interest.
Synthetic glucocorticoids, corticosteroids, are widely used to suppress inflammatory and immune responses. However, this class of drugs has a low therapeutic index due to a multiplicity of adverse effects on many tissues, including skeletal muscle [1-9]. The adverse effects on skeletal muscle derive from the role of this tissue in glucose homeostasis. One aspect of the broad systemic function of glucocorticoids is to increase gluconeogenesis in the liver and kidney [4,5,10,11]. A primary substrate for gluconeogenesis is amino acid carbon derived from the net degradation of muscle protein. Glucocorticoids also cause muscle to become insulin resistant, thereby preventing the large bulk of the musculature from taking up the glucose produced by the liver and kidney [4,5,12]. When corticosteroids are used therapeutically, their effects on the musculature are accentuated, which results in muscle wasting and insulin-resistant diabetes.
Corticosteroids produce their effects on skeletal muscle by altering the transcription of specific genes. These transcriptional effects take two fundamental forms, a direct and indirect method. Many regulated genes contain glucocorticoid responsive elements (GREs) in their regulatory sequences and their transcription is influenced directly [2,3,10,13-17]. However, there are a large number of genes whose transcription is altered indirectly by glucocorticoids. In these cases, glucocorticoids alter the expression or function of other transcription factors, which in turn alter the transcription of other genes [13,18-22]. Although still not entirely understood, the phenomena of muscle wasting and insulin resistance clearly involve temporal cascades of changes in the expression of a multiplicity of genes [2-5,19,23-26]. Many genes involved in the phenomena of muscle wasting and insulin resistance have been identified in a piecemeal fashion using diverse in vitro and in vivo experimental systems. However, understanding such phenomena requires that the temporal cascade of gene expression events be viewed as a whole.
Previously we have used pharmacokinetic/pharmacodynamic (PK/PD) modeling in studies to describe the relationship between bolus dosing with methylprednisolone (MPL) and the change in the expression of a few genes in liver and skeletal muscle [3,10,15-17]. For those experiments, a single bolus dose of MPL was given intravenously to groups of adrenalectomized animals. Animals were sacrificed at 16 time points over a 72-h period. The PK/PD models describe the deviations from and return to baseline (defined by vehicle-treated controls) of gene expression responses. The livers and muscles used for both of these studies were derived from the same animals. Data were analyzed as if samples were taken from a single animal. The data for the change in the expression of mRNA for the PK/PD models was generated using quantitative northern hybridization. Although, more recently, we have converted such measurements to quantitative real-time reverse transcriptase polymerase chain reaction (RT-PCR), even this method does not allow the scope of data collection necessary for developing models for the type of polygenic phenomena initiated by corticosteroids. We previously described the availability of data sets developed by using the Affymetrix GeneChips® Rat Genome (R_U34A) (Affymetrix, Inc., Santa Clara, CA, USA) microarray chip available online, which allows for single gene queries . Those data sets were developed using the same rich time series employed in our earlier studies. The intent was to use gene arrays as a method of high-throughput data collection in order to obtain the scope of data necessary for applying PK/PD modeling to describe broad polygenic phenomena, such as insulin resistance caused by corticosteroids.
Mining such data sets presents uniquely different problems from those encountered when microarrays are used to distinguish one group from another (e.g., cancerous versus non-cancerous tissues). For those applications, one attempts to define a pattern or fingerprint that distinguishes, with very high probability, one group from another [28-33]. In many cases it is the pattern of gene expression rather than the relationship between the genes that is the important focus. In the present application of microarrays, the difficulty lies with sorting through the vast amount of data to identify probe sets with temporal patterns of change in expression, which indicate that the gene is regulated in response to the drug. In this case, the causal relationship between the genes whose expression is changing in response to the drug is of paramount importance. For example, the drug may change the expression of a particular transcription factor, which in turn alters the expression of downstream genes. For this reason the most important aspect of the mining approach is to avoid discarding valuable data. This is of particular importance because each differentially expressed gene becomes the subject of extensive literature searches in order that it can be placed into a temporal context of all other transcriptionally altered genes. The purpose of the endeavor is to use PK/PD modeling to develop a ‘motion picture’ of the polygenic response to the drug.
In the present report we describe a filtering approach to mining the skeletal muscle data set, which is designed to eliminate probe sets that do not meet criteria expected of transcriptionally altered genes. These criteria are based on our extensive prior knowledge of data for individual genes and their use in PK/PD modeling. This report, therefore, details the small percentage of probe sets in the skeletal muscle data set that meet a specific criteria for further and more intense scrutiny. That same skeletal muscle data set was initially described and its online availability has been detailed in a previous report .
Muscle samples (gastrocnemius) were obtained from a previously performed animal study in our laboratory [2,3,10]. Male adrenalectomized (ADX) Wistar rats (Rattus rattus) weighing 225–250 g were obtained from Harlan Sprague-Dawley (Indianapolis, IN, USA). Animals were allowed free access to rat chow (Agway, RMH 1000) and 0.9% NaCl drinking water. They were housed in a room with a 12 h light/12 h dark cycle, a constant temperature of 22°C, and were allowed to acclimatize to this environment for at least 1 week. All rats were subjected to right external jugular vein cannulation under light ether anesthesia 1 day prior to the study. Cannula patency was maintained with sterile 0.9% NaCl solution. Four animals were designated as controls (i.e., zero time samples) and received vehicle only. All remaining animals received a single 50 mg/kg dose of MPL sodium succinate (Pharmacia-Upjohn Company, Kalamazoo, MI, USA) via the cannula over 30 s. Rats (3) were sacrificed by exsanguinations under anesthesia at 0.25, 0.5, 0.75, 1, 2, 4, 5, 5.5, 6, 7, 8, 12, 18, 30, 48 and 72 h after dosing. The sampling time points were selected based on previous studies describing glucocorticoid response dynamics and enzyme induction in skeletal muscle and liver. Four cannulated vehicle-treated rats were sacrificed as controls. Gastrocnemius muscles were rapidly excised, flash frozen in liquid nitrogen, and stored at -80°C. Frozen muscle tissues were ground into powder using a liquid-nitrogen-chilled mortar and pestle.
Muscle powder (100 mg) from each individual animal was added to 1 ml of prechilled Trizol® reagent (Invitrogen, Carlsbad, CA, USA) and total RNA extractions were carried out according to the manufacturer's directions. Extracted RNAs were further purified by passage through RNAeasy™ mini-columns (QIAGEN, Valencia, CA, USA) according to the manufacturer's protocols for RNA clean up. Final RNA preparations were resuspended in RNase-free water and stored at -80°C. The RNAs were quantified spectrophotometrically, and purity and integrity were assessed by agarose gel electrophoresis.
Isolated RNA from each individual muscle was used to prepare the target according to the manufacturer's protocols. The biotinylated cRNAs were hybridized to 51 individual R_U34A, which contained 8799 probe sets (in one case, at 30 h, only two samples were available). The Affymetrix arrays used have the advantage of having multiple measurements per gene (≥ 11 probe pairs per transcript). However, this same redundancy leads to many different interpretations of probe sets to determine a signal for each gene (probe set algorithms), and controversy regarding the sensitivity and specificity of resulting signals, as well as appropriate normalization methods . We have used the Affymetrix Microarray Suite (MAS) 5.0® (Affymetrix, Inc., Santa Clara, CA, USA) algorithm, and this places a relatively high penalty for mismatch hybridization, favoring specificity over sensitivity . This entire data set has been submitted to the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus database (GSE490) and is also available online [27,101].
The approach to data mining was developed based on our use of gene arrays as a technique for high-throughput data collection within the context of a rigidly controlled time series paradigm. The Affymetrix oligonucleotide microarrays use sequence information and photolithography-directed combinatorial chemical synthesis to develop probe sets for the genes of interest. Each probe set consisted of a series of short oligonucleotide sequences and an identical partner sequence, except for a single base mismatch in the center. The mismatch sequence provided a unique background for each sequence in the series. MAS 5.0 was used for initial data acquisition and basic analysis. In this first step, a ‘call’ of present (P), absent (A) or marginal (M) was determined for each probe set on each chip based on the comparison of the matched and mismatched pairs for the gene sequence. The results were normalized for each chip using a distribution of all genes around the 50th percentile. The results from the first step were entered into the program GeneSpring 6.1® (Silicon Genetics, Redwood City, CA, USA). This robust software provides a number of tools for visualization and analysis of time series data. One such tool for both hierarchical clustering and visualization is the gene tree approach described by Eisen  as modified by the Gene-Spring software. This algorithm can be used to construct a dendrogram of genes with similar patterns. A negative aspect of this tool, and most time series data-mining tools, is the assumption that the points in the time series are equally spaced. However, to design our 72 h time series in this manner would have ignored the richness of biological information during the early period following dosing of the drug. Notwithstanding this drawback, gene trees provide an excellent method of visualizing the progression of the data analysis. In order for this and other tools to be used it was also necessary to transform the data so that the values for all probe sets were within the same range. To accomplish this, values for each individual probe set on each chip were expressed as a ratio of the mean of the four control values for that gene, which we refer to as ‘normalized intensity’. Thus, the average of each probe set has a value of 1 at zero time and either increases, decreases or remains unchanged relative to controls over the time series.
Figure 1 (top left) shows the gene tree derived from the Genespring program for the entire data set (8799 probe sets at 17 time points). This tree was constructed with the program's default method, which uses Pearson correlation around zero. The x-axis presents the 17 time points studied in rank order from left to right. Vehicle controls are nominally referred to as time zero. As pointed out above, each time point is equally spaced and, therefore, does not represent the true temporal relationship between points. The y-axis presents the mean of the normalized value at each time point for each of the individual probe sets, represented by color and clustered by similarity. In this view, the color yellow represents a value of ‘1’, progression toward red represents values that exceed ‘1’, and progression toward blue represents values that decline toward 0. The intensity of the color reflects the intensity of the original signal. To the immediate left of the gene tree is a schematic tree of the relationship of all probe sets to each other based on expression pattern similarity (represented in green). On the left side of the gene tree, but spatially separated, is what is referred to as a ‘marquee view’ in the GeneSpring software. This marquee view can be used to navigate the view of the main gene tree. Although the gene tree representation of the entire data set is of limited value for examining individual gene patterns of regulation, it does illustrate two points. First, within the entire data set there are a vast number of genes represented by black (no expression in skeletal muscle regardless of treatment) or by the color yellow across the entire time frame studied. This latter group of genes exhibits no temporal regulation by the drug (i.e., their expression does not deviate from control value following drug dosing) and represents the probe sets that we wish to filter from the data set. This gene tree does reflect segregation of similarly regulated genes and demonstrates that similar patterns of regulation do exist. For example, groups of intense red or blue represent clusters of genes with similar up- or downregulation, respectively. Figure 1 (top right) provides a magnified view of one such clustering of upregulated probe sets. The location of this group of probe sets on the entire gene tree is indicated by the white bracket. Figure 1 (bottom) shows an even closer zoom in on five probe sets in this grouping with similar patterns. In this case, two probe sets for the gene regenerating liver inhibitory factor-1 (RL/IF-1) (X63594) segregate next to each other, which demonstrates the optimal function of gene tree clustering. The location of these probe sets in the top right panel is indicated by the white rectangle.
A series of ‘filtering’ steps were applied to the data in an attempt to identify genes and eliminate probe sets that either are not expressed in skeletal muscle or are not responsive to MPL treatment. The first level of filtering was designed to eliminate probe sets not expressed in skeletal muscle regardless of drug treatment (represented by a very low intensity on the gene tree), and utilized the Affymetrix ‘call’ feature. For this step we explored two different levels of stringency. The first approach that we employed required the probe set for the gene to have a call of P on at least 25 of the 51 chips. This filter reduced the number of remaining probe sets from the original 8799 to 3785. We compared this result with a less stringent filter that only required a call of P on 4 of the 51 chips. This filter reduced the number of remaining probe sets from the original 8799 to 4636. This endeavor identified two types of probe sets that would be lost by greater stringency at this step. The first was a group of probe sets that started below the sensitivity of the chip (background) then appeared on a contiguous sets of chips in which the call was P that then returned to a series of A chip calls. The second was a group of probe sets that initially had calls of P but, following dosing, dropped to A for a period of time before returning to a call of P. We chose not to exclude these probe sets with this initial filter because it appeared that the drug treatment was influencing the quality call of the probe set by the software, thus suggesting possible regulation. A final observation concerning the background is that there is considerable variation in the probe set backgrounds on the R_U34A chip. Some probe sets can obtain a call of P with a signal as low as 10 units while others require a signal > 350 units. As our major concern was to prevent discarding valuable data, we chose the less stringent 4P filter for this initial filtering step. Figure 2 provides a gene tree of the 4636 probe sets not eliminated by this filter. Thus, requiring that a signal be present on only four chips eliminated > 4000 probe sets.
The second level of filtering was designed to eliminate probe sets that could not meet the basic criteria of a regulated probe. Specifically, this filter was designed to eliminate probe sets whose average did not deviate from baseline by a certain value for a reasonable number of time points. After exploring a variety of filtering values and number of conditions using gene trees in the manner described in the previous paragraph, we developed two filters that were designed to eliminate probe sets that were neither down- nor upregulated. The first filter eliminated probe sets that could not meet a minimal criteria for down-regulation. Starting with the 4P filtered list we eliminated all probe sets that did not have average values < 0.65 in at least four conditions (time points). Figure 3 (top left) shows a gene tree of the 354 probe sets that were not eliminated by this filter. Most of these probe sets clearly contain a sustained run of time points represented by the color blue, as expected of downregulated probe sets. The next filter was designed to eliminate probe sets that could not meet a minimal criterion for upregulation. Starting with the 4P filtered list, we eliminated all probe sets that did not have average values > 1.5 in at least four conditions (time points). Figure 3 (top right) shows a gene tree of the 349 probe sets that were not eliminated by this filter. Most of these probe sets clearly contain a sustained run of red time points as expected of upregulated probe sets. There were a small number of probe sets that were not eliminated by either filter indicating both up- and downregulation. These probe sets (which were present on both lists) suggest biphasic regulation, a phenomenon we have previously described . Figure 3 (bottom) shows a gene tree of the 690 probe sets that were not eliminated by any of the filters applied. Thus, using four straightforward filters, we were able to eliminate all but 8% of the probe sets present in the original data set.
The last non-statistical filter we applied addressed the quality of the data. For this ‘quality control’ filter we eliminated probe sets that did not meet one of two conditions. The first condition focused on the control chips. As indicated above, our initial operation was to divide the value of each individual probe set on each chip by the mean of the values for that probe set on the four control chips. Therefore, the quality of the control data for each particular probe set is of paramount importance in defining regulation by the drug. The second condition focused on the remaining 16 time points. Using the standard deviation of the mean, this filter excluded all probe sets whose coefficient of variation (CV) for the control chips exceeded 50% or whose CV for > 8 of the remaining 16 time points exceeded 50%. For the one time point in which only two samples were available (30 h), a quasi-CV was calculated by dividing the difference between the two values by the average. This ‘quality control’ filter eliminated an additional 37 probe sets. Figure 4 (left) provides a gene tree of the 653 probe sets that were not eliminated by the entire series of filters. Figure 4 (right) shows the 8146 probe sets that were filtered out by the entire set of filters. Comparing Figure 1 (top left) with Figure 4 (right) demonstrates that probes with apparent regulation are no longer present in the total data set. The 653 probe sets remaining after filtering the total data set (Figure 4, left) are the product of this approach to data mining and, thus, become the focus of temporal clustering, functional clustering, and PK/PD modeling.
Our previous work demonstrated that three basic temporal signatures of regulation can be expected following single bolus dosing by MPL of a population of ADX rats that have a stable baseline. These three signatures indicate upregulation, downregulation, and biphasic regulation. Figure 5 top left (313 probe sets), top right (321 probe sets), and bottom (19 probe sets) provide gene trees for probe sets that fall into these three categories respectively. As a final evaluation of the results we performed a 1-way analysis of variance (ANOVA) with a Tukey post-hoc test (p < 0.05) on each of the three groups. Tables 1, ,22 and and33 list the probe sets in these three groups, respectively. The ANOVA identified for exclusion 62 of 313 probe sets from Table 1, 95 of 321 probe sets from Table 2 and 2 of 19 probe sets from Table 3. Those probe sets identified for exclusion by the ANOVA are highlighted in the tables.
A population of ADX male Wistar rats was injected with a single bolus dose of MPL, groups of three animals were sacrificed at 16 time points over a 72-h period, and MPL-treated samples were compared with vehicle-treated controls. ADX animals were used to eliminate the circadian oscillation of corticosterone and provide a stable baseline [2,3,10,15-17]. This allowed us to identify gene transcripts that deviate from the baseline in response to MPL, and determine the duration of time it takes to return to that baseline. The times of sacrifice over the 72-h period were chosen based on previous experiments, which indicated that the effect of the drug was most significant soon after dosing but, in some cases, full recovery required as long as 72 h.
R_U34A chips were used to examine the temporal profile of changes in global gene expression in skeletal muscle in response to this single bolus dose of MPL. RNA samples from each individual animal were applied to a separate chip to preserve inter-animal variation. As this chip contains 8799 probe sets, the major problem was identifying the small percentage of the probe sets that are regulated by corticosteroids. In previous work, we have used cluster analysis tools to concurrently address the problems of data mining and temporal clustering with a similar data set developed using the livers from these animals . Those tools did identify several subgroups of regulated probe sets. However, while examining genes in the pathways we were prompted to visually inspect the results for genes that ‘should’ have been regulated based on the literature. The results of the visual inspection of individual genes demonstrated to us that the initial very stringent approach that we employed eliminated probe sets that were clearly regulated. There are two reasons for this deficit. First, we approached data mining (identifying regulated probe sets) and clustering as a single process using clustering algorithms based on Euclidian distance and correlation coefficients. Neither Euclidian distance nor correlation coefficients incorporate time interval. In our time series design, 9 of the 16 points are within the first 6 h and 12 of the 16 points are within the first 12 h. The interval between time domains ranged from 0.25 h in the beginning to 24 h at the end. Our previous use of Self Organizing Maps (SOM) and K-means, which are based on Euclidian distance, did not consider varying time interval in the analysis. Our subsequent application of correlation coefficients did not rectify the problem. We, therefore, developed a new approach to data mining that is based on a series of filters designed to eliminate probe sets that do not meet certain explicit criteria. This series of filters produces a small percentage of the total probe sets, which can then become the focus of temporal and functional clustering. The latter will then provide an additional filter prior to applying the ultimate objective, mechanism-based PK/PD modeling. Since we have previously published data on expression changes of small groups of genes measured individually by other methods in conjunction with PK/PD modeling of that data, the drug kinetics and receptor dynamics for this data set have been published [3,10]. When such models are developed, one aspect is to evaluate the goodness of fit of the data to the model.
The basis of this new approach is to filter the database based on specific characteristics of the probe sets that we wished to be eliminated. The first filter we applied was designed to eliminate all probe sets that were not expressed in skeletal muscle. This filter reduced the number of probe sets under consideration from 8799 to 4636. The second filter we applied was designed to identify and eliminate a group of probe sets that do not meet the criteria of downregulation. This filter eliminated all but 354 probe sets. Similarly, we filtered the remaining 4282 probe sets for those that did not meet a minimum criteria for upregulation. This filter eliminated all but 349 probe sets.
We then combined the two lists of probe sets that had not been eliminated and filtered that list of 690 probe sets on data quality. The results of the entire set of filters is that 8146 probe sets of the original 8799 probe sets were eliminated from further consideration. This left a remainder of 653 probe sets for further consideration with respect to temporal clustering, functional clustering, and PK/PD modeling.
The use of a rich time series of expression microarrays as a high-throughput method of data collection provides a means for obtaining the patterns of differential expression necessary for developing PK/PD models for complex phenomena, such as steroid-induced insulin resistance or muscle wasting. The initial problem presented by this approach is the necessity to focus attention on a small percentage of regulated genes out of the thousands measured by gene arrays. Probes that fit this category should meet minimal requirements, such as being expressed in the tissue and deviating from baseline for a period of time. They should also meet certain data quality criteria. The series of filters applied to the present data set eliminated > 92% of the probe sets in a simple, straightforward manner. It should be pointed out that this filtration approach takes advantage of the richness of the time series. As this entire data set is available online in a single gene query format , the present report makes the subset of probes that warrant further consideration generally available, along with their criteria of selection.
In the future, the focus in pharmacogenomics will increasingly shift away from simply identifying genes whose expression is altered by a particular drug toward modeling the dynamic relationship between the elements in the cascade of events in the complex biological system. In the case of skeletal muscle and the response to corticosteroids, such models will allow us to begin to address questions such as dosing regimen and the development of polygenic adverse effects, including insulin resistance and atrophy. These transcriptional models will also provide the scaffolding onto which translational and post-translational data can be placed.
This work was supported by grants GM 24211 and 1 P20 GM67650 from the National Institute of General Medical Sciences, NIH. This data set was developed under the auspices of a grant from the National Heart, Lung, and Blood Institute (NHLBI)/NIH Programs in Genomic Applications U01 HL66614.
Papers of special note have been highlighted as either of interest (•) or of considerable interest (••) to readers.