|Home | About | Journals | Submit | Contact Us | Français|
Whole genome microarrays are increasingly becoming the method of choice to study responses in model organisms to disease, stressors or other stimuli. However, whole genome sequences are available for only some model organisms, and there are still many species whose genome sequences are not yet available. Cross-species studies, where arrays developed for one species are used to study gene expression in a closely related species, have been used to address this gap, with some promising results. Current analytical methods have included filtration of some probes or genes that showed low hybridization activities. But consensus filtration schemes are still not available.
A novel masking procedure is proposed based on currently available target species sequences to filter out probes and study a cross-species data set using this masking procedure and gene-set analysis. Gene-set analysis evaluates the association of some priori defined gene groups with a phenotype of interest. Two methods, Gene Set Enrichment Analysis (GSEA) and Test of Test Statistics (ToTS) were investigated. The results showed that masking procedure together with ToTS method worked well in our data set. The results from an alternative way to study cross-species hybridization experiments without masking are also presented. We hypothesize that the multi-probes structure of Affymetrix microarrays makes it possible to aggregate the effects of both well-hybridized and poorly-hybridized probes to study a group of genes. The principles of gene-set analysis were applied to the probe-level data instead of gene-level data. The results showed that ToTS can give valuable information and thus can be used as a powerful technique for analyzing cross-species hybridization experiments.
Software in the form of R code is available at http://anson.ucdavis.edu/~ychen/cross-species.html
Microarrays have become standard tools in biomedical and genomic research nowadays. Modulation of the expressions of thousands of genes simultaneously provides important insights into the molecular mechanisms of biological processes. However, in spite of the exponential growth in available whole genome sequences, there are still many organisms of interest, whose genomes have not been sequenced and therefore whose gene sequences are not all known. To study the gene expressions of these organisms, there are mainly two ways. The first way is to use cross-species hybridization, which is to hybridize the RNA samples of one species to the microarrays designed for a closely related species. The second way is to make customer-designed microarrays based on the currently available sequences of the organism being studied. Presently many biotech companies such as Affymetrix can provide such customer-designed services to fulfill the needs of genomic study. But this way may not be practical when the budget for the project is tight or the time is short or the sequences available are insufficient to be representative of the whole organism. In such cases, researchers often turn to the first way to solve the practical problem. Cross-species hybridization experiments usually cost less than making customer-designed arrays.
The cross-species approach has been employed in several studies in nonhuman primates, using human microarrays to analyze closely related species, such as chimpanzees, rhesus macaques and orangutans,1,2 as well as more distantly related species, such as cattle, dogs, pigs or canines.3,4 These studies assume that nucleotide sequence conservation within mammals is high enough to generate detectable signals. Despite the potential usefulness of cross-species hybridization studies, the quality of the gene expression measures obtained in this way is in question. Two important aspects of measurement quality of cross-species hybridizations have been examined: accuracy and reproducibility. Several studies have reported that the cross-species results are reproducible.5,6 However, reproducibility does not assure that cross-species results can provide valid biological information. The accuracy aspect is more important.
Since the RNA samples of one species are hybridized to the arrays designed for another species, the sequence dissimilarity between the two species will cause the hybridization signals to be low compared to same-species hybridization.3,6 The question has been raised whether cross-species hybridization studies are able to generate valid biological results similar to those obtained by same-species hybridization studies. It has been shown that cross-species hybridization can be used to detect within-species expression differences without discernible loss of information, as long as the two species are not too highly diverged.7 This provides an evidence of the validity of cross-species hybridization studies. However, it is generally agreed that the array sensitivity in cross-species studies decreases compared to that of same-species studies. This means cross-species analysis gives more false negatives, and thus the accuracy of the analysis decreases. To improve the array sensitivity, some researchers suggested a filtration approach called masking procedure.3,4 That is, to install a mask to screen off poorly hybridized probes in Affymetrix arrays. The approach did improve the array sensitivity to some extents. But the problem is there are no consensus filtration schemes available. Thus, it is needed to provide guidelines for the selection of masks or investigate if there is a better alternative.
Previous cross-species studies have used different microarray platforms, such as short oligonucleotide arrays, long oligonucleotide arrays, and two-color cDNA arrays. Affymetrix high-density oligonucleotide GeneChips (Affymetrix, Santa Clara, CA) is a kind of short oligonucleotide microarrays. In Affymetrix system, an mRNA molecule transcribed from a gene is represented by a probe set composed of 11–20 probe pairs. Each probe pair consists of a 25 bases long perfect match probe (PM) and a 25 base long mismatch probe (MM). The multi-probes structure of Affymetrix microarrays may have an advantage for cross-species analysis compared to other platforms such as cDNA microarrays.3 The reason is that the presence of multiple probes of each probe set may increase its probability to match with the target sequence, and thus make it possible to produce a good measure of its expression. In this study, novel statistical methods to improve the sensitivity of cross-species analysis using Affymetrix GeneChips are investigated, and the results of analysis of a cross-species data set are discussed.
Data used in this study was generated at Lawrence Livermore National Lab from a cross-species hybridization experiment to study the heterocyclic amine food mutagen, 2-amino-1-methyl-6-phenylimidazo[4,5-b]pyridine (PhIP). PhIP is the most abundant of the carcinogenic heterocyclic amines found in well-cooked or over-cooked meat and fish. This compound has been shown to be a potent initiator and promoter of prostate and other cancers.8 In this study, a DNA repair-proficient Chinese hamster ovary (CHO) cell line (5P3R2) was exposed to PhIP, at a dose level of 0.4 micro ml, and RNA was harvested at different time points (2 hours, 4 hours, and 8 hours) after exposure. The PhIP-treated RNA samples and untreated reference samples were hybridized to Affymetrix’s mouse GeneChip, MG U74Av2, respectively. Two technical replicates were available at each time point. Thus a total of 12 arrays with 197993 probes were available, that correspond to 12488 probe sets on each array. The aim of this study is to identify genes that respond to PhIP in cell line 5P3R2.
Using the same approach as other similar studies,3,9,10 the current study also assumes the divergence between mammals to be in the order of less than 100 million years. This also suggests that the conservation of protein function might ensure that sequence identity is also sufficiently conserved. The cumulative effect of such conservation across related proteins and sequences of interest would leverage the effort in this study to use gene sets to analyze gene expression data rather than the use of single gene data points.
Purified total RNA was analyzed to ensure quality and quantity using the NanoDrop spectrophotometer and gel microelectrophoresis using Agilent’s Bioanalyzer. Total RNA (5ugms) from all samples was labeled using a modified version of the Ebervine method. RNA was reverse transcribed using an oligo-dT T7 promoter primer and double stranded cDNA generated using the RiboAmp RNA Amplification Kit (Cat # KIT0209) according to manufacturer’s directions as follows: a primer containing T7 RNA polymerase promoter sequence-oligodT was annealed to the total RNA molecules by heating at 65 °C for 5 minutes. A master mix including dNTPs was prepared using the RiboAmp kit and added and first strand cDNA generated by incubation at 42 °C for 45 minutes. The RNA is then degraded using a proprietary nuclease enzyme at 37 °C for 20 minutes followed by inactivation of nuclease at 95 °C, 5 minutes and cooled to 4°C. Thereafter, the second strand cDNA synthesis was initiated by annealing another proprietary primer (Primer B) supplied with the RiboAmp kit to the first strand cDNA fragments (95 °C for 2 minutes, chill to 4 °C). Post addition of the second strand cDNA synthesis mix, the reaction is incubated at 25 °C for 5 minutes, 37 °C for 10 minutes and then finally at 70 °C for 5 minutes. The reaction mixture was then chilled to 4 °C and the double stranded cDNA thus generated was purified using columns and proprietary buffers supplied along with the RiboAmp kit with vendor provided purification protocol. The Enzo BioArray HighYield Transcript Labeling Kit (Cat # 42655–20) was used to generate labeled antisense RNA from the double stranded cDNA using T7 RNA polymerase and biotinylated UTP for amplification mediated labeling by incubation for 5 hours at 37 °C in a shaking water bath. Post labeling, the antisense RNA (ranging from 500–1800 nts) was first purified using the RNEasy procedure (Qiagen), fragmented as described in the Affymetrix Gene Expression Analysis Technical Manual (Affymetrix, Santa Clara, CA) and evaluated for quality and quantity by microchannel electrophoresis on the Agilent 2100 Bioanalyzer.
Affymetrix MG U74Av2 gene chips were hybridized using 15 μg of fragmented complementary DNA followed by washing and staining in an Affymetrix Fluidics Workstation as described in the Expression Analysis Technical Manual (Affymetrix, Santa Clara, CA). Hybridized chips were scanned and signals were detected using an argon-ion laser scanner (Agilent Technologies, Palo Alto, CA). Microarray reports were generated to assess the hybridization quality and individual CEL files were used for data preprocessing.
The resulting images were processed to give a raw intensity value for each probe. Then probe level data need to be converted to expression values. There have been quite a few methods which achieve this goal such as MAS5.0,11 MBEI,12,13 RMA14 and GLA.15 All these methods contain mainly three steps: 1) background correction, which refers to the adjustment intended to remove background noise; 2) normalization, which is a technique to reduce non-biological variation in different arrays; 3) summarization, which gives an expression measure for each gene or probe set.
Among these, RMA expression measure has become a widely accepted method for Affymetrix GeneChips. In Zhou and Rocke,15 they compared the performances of these preprocessing methods through two real data sets and concluded that GLA and RMA outperform the other methods. For our cross-species data set, both GLA and RMA methods to background-correct and normalize the whole data set were applied before any further analysis.
In Ji et al3 and Grigoryev et al,4 a masking procedure was applied to filter out the poorly hybridized probes in the data preprocessing step thus improving the sensitivity of cross-species analysis. However, the masking procedure is very empirical and both studies lack details explaining the reasoning of mask selection. The currently available CHO sequences were used to choose masks. GenBank was searched for all the available CHO sequences and then BLAST program was used to match those CHO sequences to the probe sequences of Affymetrix’s mouse gene chip. 907 mouse probes were identified that were 100% matched to currently available CHO sequences. We hypothesized that if a known CHO sequence is 100% matched to a mouse probe, there is less chance of cross-hybridization in this probe. Therefore all such probes were retained in the data preprocessing step. Following this hypothesis, a ratio was created,
where nm represents the number of remaining probes that are 100% matched to available CHO sequences and nT represents the number of total remaining probes after applying the mask. Intuitively thinking, a strict mask will lead to small value of nm and also small value of nm. The goal is to mask off as many unmatched probes as possible while keeping most of the matched probes. Thus, a larger value of rm results in a better mask based on our hypothesis. Three groups of masks were selected for use: PM only, PM-MM, and PM/MM. In each group, three masking thresholds together with five masking stringencies were tested. The three masking thresholds are 25th, 50th and 75th percentile of the data set. The five masking stringencies are 8.33%, 25%, 50%, 75%, and 100%. The masking stringency 8.33% means that the probe is masked off if it does not meet the masking threshold in at least 8.33% of the 12 arrays, that is 1 array. Similarly, the masking stringency 100% means that the probe is masked off if it does not meet the masking threshold in all of the 12 arrays. 45 masks were tested and Supplementary Table 1 shows the details and results of all these masks. If the masking threshold is the same, a smaller masking stringency will lead to larger rm value. In addition, if the masking stringency is the same, a larger masking threshold will lead to larger rm value. Based on these findings, three best masks were selected using the rm value in the three groups: PM only, PM-MM, and PM/MM and these were used for further analysis.
In DNA microarray studies, single-gene analysis has some limitations. A successful microarray experiment can result in a long list of differentially expressed genes which may not be easy to be interpreted by biologists. On the other hand, no single gene may be detected if the change of expression is very moderate. Gene-set analysis can generally overcome these limitations to some extents. Quite a few statistical methods have been proposed in recent years to study gene sets.16–19 The basic idea of gene-set analysis is to look at the expression patterns in a group of genes to find out if they are associated with a class label or differentially expressed under different experimental conditions. Usually the genes in a predefined gene set have some biological themes, such as coming from the same biological pathway or having similar cellular functions. Thus the results of gene-set analysis are much easier to interpret and can help biologists understand some fundamental biological mechanisms.
In the cross-species data set, preprocessing of the probe-level data using a standard method such as MAS5.0 or RMA and fitting a linear model for each gene resulted in no single gene that met the threshold for statistical significance after adjusting for multiple hypothesis testing. This is due to the low sensitivity of cross-species data analysis. In Affymetrix systems, since each gene has multiple probes on the chips, it is very unlikely to see that all the probes match well with the target even if this gene is truly differentially expressed. The standard summarization method, such as MAS5.0 or RMA, which gives an expression measure of each gene based on the intensity of all its probes, will often lead to low expression measures for truly differentially expressed genes.3,4 This makes it harder to distinguish between genes that are truly differentially expressed and genes that have minimal or no change at all.
Thus, gene-set analysis was chosen as the method of choice for our data set. Masking procedure and gene-set analysis were combined in the first step. Since there is no consensus on the selection of a good mask, we propose a new way to apply gene-set analysis to cross-species data without using masking procedure, that is, to investigate the probe-level data instead of gene-level data. By applying the general ideas of gene-set analysis to probe-level data, the effects of multiple probes are aggregated and statistically tested to identify any general trend of expression changes in a group of genes under different experimental conditions.
Through Sections 2.4–2.5, the basic algorithms of two gene-set analysis methods and the use of the following notations are introduced: S is a predefined gene set, N is the number of genes in a gene set S, L is a rank list of all the genes based on its association with class phenotypes, and r is the Pearson correlation.
Gene Set Enrichment Analysis (GSEA) is a highly developed version of gene-set analysis,17 which utilizes the Kolmogorov-Smirnov statistic to measure the degree of differential gene expression in a gene set across binary phenotypes. It ranks all the genes based on their association with a class phenotype and test whether the members of a predefined gene set are uniformly distributed throughout this list. The main steps of GSEA are:
If an entire database of gene sets is tested, a final step is added to adjust for multiple hypothesis testing. GSEA has been successfully applied to several cancer date sets in detecting significant biological pathways.17
Test of test-statistics (ToTS) method is another kind of gene-set analysis which also aims to detect significant biological pathways or any priori defined gene sets instead of individual genes. It has been successfully applied to an ionizing radiation study of prostate cancer where the patient variability in response to low doses of ionizing radiation (LDIR) creates substantial difficulties in detecting any differential gene expression.19 The method consists of three steps:
In the ionizing radiation study, the effects of each gene in different patients are aggregated to test the significance of a gene set. In the cross-species hybridization experiment using Affymetrix GeneChips, the effects of multiple genes or multiple probes are aggregated to test the significance of a gene set.
The website of GSEA provides a molecular signature database (MSigDB) that has a collection of gene groups and pathways from online pathway databases, publications in PubMed, knowledge of domain experts, and so on. Most of the gene group and pathway information was collected in human. Annotation files for the human chip HG-U133A and the mouse chip MG U74Av2 were downloaded from the website of Affymetrix. In addition a linking table that has Orthologs/Homologs information and which therefore links probesets from mouse chip to human chip, was also downloaded from website of Affymetrix. Using this linking table which provides 1-to-1 relationship of human to mouse probe sets, a collection of mouse gene groups and pathways was created. This resulted in 522 gene sets.
A subset of gene groups and pathways were further selected from the collection of 522 gene sets, on the basis of their relevance to cancer, signaling, oxidative stress which is believed to be one of the outcomes of exposure to PhIP, mitochondrial pathways, pathways involving p53, pathways pertaining to androgen and estrogen, and electron transport pathways. This resulted in 100 gene sets, which will be investigated in our analysis. The list of these 100 gene sets can be found in Supplementary Table 2.
Our fist step for statistical analysis was to get loess-normalized PM, PM-MM, and PM/MM intensities, respectively. As described in Section 2.2, three different masking thresholds and five different masking stringencies with PM, PM-MM, and PM/MM data were applied. The details are listed in Supplementary Table 1. The value of rm was used to select best masks, which resulted in, PM > 5.83, PM-MM > 34.58, and PM/MM > 1.27. These masks were named mask1, mask2 and mask3 respectively. After application of these three masks, single-gene analysis still did not yield any significant result. Gene-set analysis was performed next using GSEA and ToTS methods. Whereas GSEA failed to identify any significant gene set at the threshold fdr < 0.05, ToTS identified some significant ones at the threshold fdr < 0.05.
Of the 100 gene sets that were selected, 9 were statistically significant between treatment and control after applying all the three masks, as shown in Table 1. The last 3 columns of Table 1 lists the fdr-adjusted p-values from using ToTS and the 3 masks, respectively. Supplementary Figure 1 shows the overlap of significant gene sets identified by applying mask1, mask2, and mask3. The complete list of significant gene sets by applying the three masks are shown in Supplementary Table 3. Thus, the application of a masking procedure together with the ToTS method helped identify some significant pathways. We then posed the question whether meaningful results could be achieved without applying masking procedure. The next several sections show the results of applying gene-set analysis in probe-level data.
Our first step for statistical analysis is to background-correct and normalize the whole data set using GLA algorithm.15 Through Sections 3.4, the same GLA algorithm was used. In Section 3.5, the results from the use of a different preprocessing method, Robust Multi-array Analysis or RMA, are discussed.
The GSEA method was developed to test the association of a pre-defined gene set or pathway with binary phenotypes. The first step of GSEA is to rank all the genes based on their association with the class labels. There are several metrics that can measure the association, such as Pearson’s correlation, signal-to-noise ratio, two-sample t-test statistics and so on. Thus GSEA can be generalized to data other than binary phenotypes, as long as a metric can be applied, that measures the association of each gene with class labels in the data set. For each probe in our data set, a two-way ANOVA model was applied with treatment and time as the two factors, and tested for whether the expression of the probe is statistically different between treatment and control samples. Thus a t-score was obtained for each probe out of 197993 probes. Rather than using Pearson correlation or other statistics, the absolute values of these t-scores were used to rank all the probes. For each pathway, an enrichment score was calculated based on Equation (1). Gene permutation was used in place of sample permutation here to calculate the empirical p-values of each pathway. That is, to randomly select genes as the sampling units and generate a distribution for the ES test statistic. It is criticized that gene permutation tends to give anti-conservative results because the independence assumption between genes are usually unrealistic.20 However, in the cases where there are only a few arrays available, it is impossible to do a large number of sample permutations. Such cases of insufficient data points are not as uncommon as one may assume, given the typical monetary constraints faced by the average research laboratory. It therefore becomes imperative to find novel alternatives to deal with this case. Gene permutation could be the alternative in such cases. In this cross-species data set, only 4 arrays were available at each time point. Thus gene permutation was used to test the significance of pathways. In order not to worsen the violation of gene-gene independence in a gene set, genes rather than probes were randomly selected in order to keep each gene’s probe structure intact.
It was found that none of the pathways met the statistical significance level at 0.05. Figure 1 shows the histogram of fdr adjusted p-values for all the 100 pathways tested. Most of the pathways have p-values greater than 0.50.
Similar to GSEA, our first step using ToTS was to get a t-score for each probe by fitting a two-way ANOVA model. To test whether a pathway was associated with the treatment PhIP, the t-scores of all the probes corresponding to the genes in this pathway were aggregated, and a one-sample t-test and a one-sample Wilcoxon test performed. We hypothesize that there are effects in at least some genes of a pathway, but it may be reflected in only a small number of probes that correspond to the genes because of the poor hybridization qualities of cross-species data set. We would therefore expect to see that the t-scores are biased in a positive or negative direction if there is in fact, any such kind of diffuse response. Figure 2 shows the histogram of the t-scores of all the probes corresponding to pathway “PENG GLUCOSE DN” which showed a slight upward trend of all the t-scores. In order to verify that this trend is not seen by chance, the one-sample t-test or one-sample Wilcoxon test was used to test whether the mean or median of the collection of these t-scores are different from 0. In order to reduce the bias introduced here, we also performed gene permutation to generate a null distribution for the t-test statistic or Wilcoxon-test statistic and thus obtained empirical p-values for each pathway tested. The standard p-values by t-test or Wilcoxon-test correspond well to the empirical p-values and were therefore omitted here. In addition, the Wilcoxon-test with gene permutation tended to be more sensitive than the t-test in this case. The results of empirical p-values by Wilcoxon-test are showed in Table 2. Six pathways were found to be statistically significant at the threshold 0.05. In order to compare GSEA and ToTS, the p-values of these six pathways by GSEA are also listed in Table 2. None of them met the statistical significance threshold.
The influence of the preprocessing algorithm used on cross-species studies was assessed, since several preprocessing algorithms are currently available and in use widely. In Sections 3.3–3.4, GSEA and ToTS method were applied to the probe-level data preprocessed by GLA algorithm, and our results showed ToTS to be more sensitive and providing better results than GSEA. Since RMA has been a standard and quite accepted preprocessing method for Affymetrix GeneChips, we also used RMA to preprocess the raw data and applied similar gene-set analysis methods. In Zhou and Rocke,15 they showed that GLA has comparable performance with RMA. In the current study too, the 6 six pathways that were identified to have fdr adjusted p-values less than 0.05 were the same set of pathways identified using GLA algorithm. Comparison of Table 3 and Table 2 shows that the conclusion based on using GLA and RMA methods are the same, providing evidence that preprocessing algorithms do not have great effects on the results here.
In this study, a novel masking procedure was proposed based on currently available target species sequences, to filter out probes, and this masking procedure was combining with gene-set analysis. An alternative method is proposed, for analysis of cross-species studies using Affymetrix GeneChips, wherein the methods of gene-set analysis are applied to the probe-level data. Two gene-set analysis methods and their application to a cross-species data set were investigated. The results showed that ToTS has the better performance than GSEA.
The results from ToTS with the masking procedure gave 42 pathways with significant p-values by at least one mask, and of these, 9 had significant p-values with all three masks used (Supplemental Table 3). A particularly relevant biological aspect of this result is that there is a significant overlap between these results and pathways contributing to the Cancer pathways (subway map of cancer pathways). These include the WNT signaling pathways, GSK3, B-catenin, eIF4, TERT, PI3K, and p53 among others. Although some of these are lost upon correcting for multiple hypothesis testing, the results are interesting, nevertheless, and deserve further analysis.
The results from ToTS analysis at the probe-level, done without masking, gave 6 significant pathways with p-values less than 0.05. An important detail from this analysis is that 4, of the 6, pathways are also shared by the 9 pathways identified by applying masking procedure: PENG GLUCOSE DN, PENG GLUTAMINE DN, PROTEASOME DEGRADATION, and PROTEASOMEPATHWAY. This consistency may be evidence of improved sensitivity of ToTS methods in analysis of cross-species data either at probe-level or gene-level.
The application of GSEA either at probe-level or gene-level of our cross-species data set didn’t give any significant result. A problematic issue of GSEA is that the calculation of enrichment score of a particular gene set not only depends on the expressions of genes in this gene set, but also those outside of the gene set. When a particular gene set is biologically meaningful, the expressions levels of other genes should not affect the inference on this gene set. This problem was also reported in Dinu et al21 and it may cause the power of GSEA to be low in some situations. In contrast to GSEA, ToTS uses t-statistics or Wilcoxon-statistics, which does not depend on the expressions of other genes outside of a particular gene set. ToTS may be expected to have more power than GSEA. Further study that includes a comprehensive comparison between the two methods may provide fresh insights.
Through the study of this cross-species data set, we have seen that ToTS successfully improved the sensitivity of cross-species analysis while GSEA failed to do so. ToTS had better performance than GSEA here and thus is potentially a useful tool for cross-species studies. However, a comprehensive guideline for its usage in cross-species hybridization experiments is still necessary in order for it to gain widespread success.
Funding This work is supported by grants from the Department of Energy (DE-FG02-07ER64341, DE-SC0001099), Air Force Office of Scientific Research (FA9550-07-1-0146), National Human Genome Research Institute (R01-HG003352), National Institute of Environmental Health Sciences Superfund (P42-ES04699), National Institute of Allergy and infectious Diseases (R21-A1080604) and NIH-NCI (CA5586112S1). It was performed in part under the auspices of the U.S. DOE by University of California, Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.LLNL-JRNL-421545.
Supplementary Data Supplementary data are available at http://anson.ucdavis.edu/~ychen/cross-species.html
Disclosures This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors report no conflicts of interest.