Expression analysis of mouse XM gene sequences
In order to generate an extensive survey of mammalian gene expression, we analyzed mRNA abundance in 55 mouse tissues using custom-designed microarrays of 60-mer oligonucleotides [13
] corresponding to 41,699 known and predicted mRNAs identified in the draft mouse genome sequence using gene-finding programs [12
] (NCBI 'XM' sequences; approximately 39,309 are unique; for further details, see the Materials and methods section). Tissue collection was a collaborative effort among several labs in the Toronto area, each with expertise in distinct areas of physiology; consequently, the mouse tissues we analyzed were obtained from several different strains of mice which are typically used to study specific organs and cell types of interest (additional cell lines and fractionated cells from animals were also analyzed, but the results are not included here because the data appear to bear little relationship to the tissue of origin of the cells examined). Since it has previously been established that there is a high correlation in expression of orthologous genes between mice and humans [15
], large variations in tissue-specific expression should not occur between individuals within the same species, although we cannot rule out subtle strain-specific differences. To maximize the fidelity of measurements, unamplified cDNA from at least 1 μg of polyA-purified mRNA was hybridized to each array, with fluor-reversed duplicates performed in each case. For most organs this required pooling RNA from multiple animals; for example, more than 50 mice were required to obtain sufficient prostate mRNA. Consequently, potential variations due to parameters such as circadian rhythms or individual dissections should have been minimized by averaging over multiple animals.
All hybridizations were performed in duplicate. Data processing and normalization are described in detail in the Materials and methods. The data were processed so that each measurement reflects the abundance of each transcript in each tissue relative to the median expression across all 55 tissues; although the microarray spot intensities were used to determine which genes were detected as expressed (see below), the figures herein show the normalized, arcsinh-transformed and median-subtracted data, which for convenience we refer to as ratios. All of the data, together with tables detailing correspondence to genes in other cDNA and EST databases, annotations and other features of the encoded proteins, probe sequences, and other files used in our analyses below, are available as Additional data files and without restriction on our website [16
Validation of expression data
Four lines of evidence support the quality of our data and its consistency with existing knowledge of mammalian physiology and gene expression. First, we detected the expected patterns of expression for genes previously shown to be expressed specifically in each of the 55 tissues surveyed (Figure ). This validates the accuracy of our dissections, and indicates that there was little cross-contamination between tissue samples.
Figure 1 Expression of previously characterized tissue-specific genes. Genes were identified manually by searching MEDLINE abstracts  and XM sequence description fields (see Additional data file 1) for keywords corresponding to the appropriate tissues. Rows (more ...)
Second, there is a clear correspondence, albeit not absolute, between our data and two other mouse microarray data sets [15
], which surveyed a subset of the genes and tissues that we have examined. Thirteen tissues and 1,109 genes were unambiguously shared among the three studies (Figure ). Our data are more highly correlated with those of Su et al.
], who also employed oligonucleotide array technology, whereas Bono et al.
] used spotted cDNAs (Figure ). Furthermore, our data are more highly correlated with either of the two other studies than the two other studies are to one another. (It should be noted that these previous studies did not examine the use of transcriptional co-expression to predict gene function, which is the focus of the present study.)
Figure 2 Validation of expression data by independent confirmation. (a) The P value of Spearman's Rank correlations (see Materials and methods) is shown for all possible comparisons among the 13 tissues common to all three studies (ours and those by Su et al. (more ...)
Third, our array data are consistent with RT-PCR analysis. We tested for expected tissue-specific expression of 107 genes (a mixture of characterized and uncharacterized) in 18 selected tissues. In this analysis a single primer pair was tested for each gene. (It is possible that the predicted exon structures for many of the poorly characterized XM genes are incorrect: there was a clear correspondence between whether a product was obtained and whether there was an EST or cDNA in the public databases, which would indicate correct gene structures – see Materials and methods.) Among the 55 primer pairs that could result in amplification, 53 (96%) gave a correct-product size in the tissue(s) expected on the basis of our array data, and 47 (85%) produced amplification most strongly or exclusively in the expected tissue(s) (Figure and data not shown). Although RT-PCR is semi-quantitative, there is an obvious correspondence between the left and right panels in Figure , confirming that our microarray measurements are largely consistent with a more conventional expression analysis method.
Fourth, in the analyses detailed in the following sections, we show that the annotations of genes expressed preferentially in each tissue correspond in many cases to known physiological functions of the tissue, further confirming the accuracy of the dissections and the microarray measurements. Moreover, sets of functionally related genes were often observed to display uniform expression profiles, a result that is highly unlikely to occur by chance.
Definition of 21,622 confidently detected transcripts
In order to establish rigorously which genes are expressed in each tissue sample, we used the 66 negative-control spots on our arrays (corresponding to 30 randomly generated sequences, 31 mouse intergenic or intronic regions, and five yeast genes). We considered the XM genes to be 'expressed' only if their intensity exceeded the 99th percentile (that is, all but 1%) of intensities from the negative controls (Figure ). 21,622 transcripts satisfied this criterion in at least one sample. There were 1,790 transcripts that were detected in every sample, and manual inspection verified that many of these have traditional 'housekeeping' functions (for example, ribosomal proteins, actin and tubulin). There were 4,475 transcripts detected in only one of the 55 samples (Figure ). Most of the 21,622 genes, however, were expressed in multiple tissues (Figure ). Each of the tissues expressed fewer than half of the 21,622 genes (Figure ). The number of genes detected in each sample was slightly lower than the conventional estimate of 10,000 genes expressed per cell (for example, we detected 6,094 different transcripts in embryonic stem (ES) cells, the only pure cell population examined, whereas a recent study using sequence tags indicated approximately 8,400 different transcripts in human ES cells [18
]). This level of detection is not unexpected, for several reasons. First, tissues are mixtures of cell types, such that low-abundance, cell-type-specific transcripts may be diluted below the array detection limits of 1 in 1,000,000 [13
]; second, the arrays did not include every single mouse gene; and third, our threshold for expression was conservative. The full 21,622 × 55 data matrix is found in the Additional data files. Figure shows a clustering analysis of the 21,622 expressed genes in the 55 surveyed tissues, which illustrates that distinct tissues with related physiological roles also tend to have similar overall gene expression profiles. For example, all components of the nervous system featured higher expression of a common subset of transcripts, as did all components of the lower digestive tract.
Figure 3 Defining whether a gene is expressed, and how many genes are detected as expressed per sample. (a) The curves show the cumulative distribution for negative-control probes (cyan line) and for probes on the array (blue line), over all arrays, to illustrate (more ...)
Figure 4 Correspondence between gene expression patterns and GO-BP annotations. (a) Ratios for the 21,622 expressed genes were grouped by two-dimensional hierarchical agglomerative clustering and diagonalization, using the Pearson correlation coefficient. (b) (more ...)
Correspondence between gene and tissue function
To examine the relationships among tissues and gene functions, we asked whether genes carrying specific Gene Ontology 'Biological Process' (GO-BP) categories, which reflect the physiological function of a gene, were preferentially expressed in each of the tissue samples, using a statistical test (Wilcoxon-Mann-Whitney; WMW). A selection of the WMW scores are shown in Figure , and expression patterns of all genes in all GO-BP categories can be seen in the Additional data files and at the Toronto gene expressions website [19
]. This analysis revealed that the preferentially expressed GO-BP categories typically reflected known functions of the tissue, sometimes with surprising resolution. For example, while the category 'synaptic transmission' scored highly in all neuronal tissues, 'learning and memory' was highest in cortex and striatum; 'locomotor behavior' was highest in cortex, midbrain, and spinal cord; 'response to temperature', in the trigeminal nucleus of the brainstem; and 'neurogenesis', in both adult central nervous system and embryonic heads (Figure ). While the WMW test may not have captured all of the categories relevant to each brain tissue, this finding does illustrate that our data contain differential expression of genes involved in distinct high-level neural functions. Further investigation of several tissue-associated GO-BP categories that were initially unanticipated revealed that they are easily rationalized; for instance, lung, bladder, skin, and intestines all express immune-related categories, presumably because they are exposed to the environment and infiltrated by immune cells (see for example, [20
Correspondence between gene function and transcriptional co-expression
An alternative way to ask whether gene regulation corresponds to gene function is to examine the correlations among the transcript levels of genes, independent of the tissue-source information. An initial confirmation that patterns of transcript abundance correspond to gene functions comes from simply examining the behavior of all genes within distinct functional categories. For example, Figure shows the expression of individual genes in 17 categories that exemplify ways in which gene expression relates to gene function (similar diagrams for all GO-BP categories can be seen in the Additional data files and at the Toronto gene expressions website [19
]). There are prominent patterns that are distinctive of a subset of genes in each category. The fact that not all of the genes within each annotation category conformed to a single pattern could result from imperfections in the annotations or the measurements, or could be due to the correspondence between gene function and gene expression being less than absolute. While highly tissue-specific expression of genes in a category was observed in some cases (such as 'pregnancy' genes in placenta or 'fertilization' genes in testis), it was much more common that genes within a category were expressed across multiple functionally related tissues (for example, 'bone remodeling' in all bone tissues), consistent with the results shown in Figure . In other instances, genes within a single annotation category were subdivided into multiple expression patterns: for example, 'cell-cell adhesion' contains three distinct groups of genes with elevated expression in skin-containing samples, neural tissues, and digestive tract, respectively. Consistent with a previous study [21
], we observed coordinate regulation of genes within distinct biochemical pathways; Figure includes the examples 'polyamine biosynthesis' and 'serine biosynthesis'. Moreover, a number of functional categories corresponding to basic cellular or biochemical functions which are traditionally thought of as 'housekeeping' (since they are required for cell viability) were in fact coordinately regulated across tissues: Figure shows genes in the category 'RNA splicing', which are expressed most highly in neural and embryonic tissues, perhaps reflecting the higher levels of gene expression and alternative mRNA splicing known to occur in these tissues. Interestingly, subsets of genes in the categories 'cytokinesis', 'microtubule-based movement', 'oxidative phosphorylation', and 'M phase', all of which might be considered as central to cellular physiology, were also expressed in distinctive patterns among mouse tissues.
Expression of genes in 17 different functional categories. The categories were ordered manually. The genes within each category were clustered separately from those in other categories. The order of tissues is preserved from previous figures.
We also asked more generally whether groups of co-expressed transcripts were associated with specific GO-BP categories. Figure shows that this is indeed the case: any given 'cluster' of genes with correlated expression levels is more likely than not to be associated with a local enrichment of one or a few annotation categories, and manual analysis suggests that tissue-specific expression often reflects the known physiological role(s) of the tissues in which the genes are expressed (examples are shown in Figure ). False-discovery rate analysis (see the Materials and methods section) confirmed that over 58% of the 21,622 genes were co-regulated with a set of genes significantly enriched for at least one GO-BP category. For the 7,387 GO-BP annotated genes, over 66% were co-expressed with a set of genes significantly enriched for at least one GO-BP category; in over 25% of these instances, the most significant category was one of its existing annotations. Random permutation analysis (that is, repeating the analysis with randomized gene identities) established a false discovery rate [22
] of less than 1% for these analyses (see Materials and methods for details). Hence, quantitative co-expression of functionally related genes appears to be a general phenomenon in mammals.
Using transcriptional co-expression to predict mouse gene functions
It stands to reason that a gene expressed in a specific tissue is likely to be functioning in that tissue. Therefore, we next asked how accurately mammalian gene functions can be predicted on the basis of gene expression profiles. There are many anecdotal examples in which the tissue-specific or cell-type-specific expression of a gene has been used to aid in discovering its function, and this approach has been advocated in previous analyses of mouse tissue expression data (see for example, [15
]). Our data indicate that the expression of most mouse genes shows some degree of tissue restriction, but most of the genes are not expressed in a highly tissue-specific manner (Figure ). Furthermore, most tissues express genes from multiple functional categories (Figure ), and genes from many functional categories are expressed across many tissues (Figure ), which could make it difficult to distinguish genes in these categories on the basis of expression in one or a few tissues. In addition, defining tissue specificity involves drawing thresholds to form lists, rather than using the quantitative expression information directly to draw functional inferences.
An alternative strategy is to generate functional predictions on the basis of transcriptional co-expression [23
], which we show (above) often reflects gene function (Figure ). This approach utilizes quantitative measurements and places no restriction on tissue-specificity, allowing all expressed genes to be treated equally in the analysis. Furthermore, the use of quantitative co-expression allows the application of sophisticated computational tools that have been optimized for the general problem of classification on the basis of features within a data matrix [25
]. We examined the extent to which this approach is effective for our data, and we show (below) that it yields almost universally superior predictions of gene function in comparison to using information regarding simple tissue specificity or tissue restriction.
In this analysis, we used support vector machines (SVMs) [26
]. An SVM is a machine-learning algorithm (a computer program) that has previously been shown to work well for the prediction of gene functions in yeast on the basis of microarray expression data [25
] but which has not, to our knowledge, been used extensively to predict gene functions from mammalian expression-profiling data. The theory and implementation of SVMs have been described elsewhere in detail [25
]. Briefly, an SVM outputs a 'discriminant value' for each gene in each category, and this value reflects relative confidence that the gene is in the category in question. The SVM considers each functional category separately, and the discriminant value is assigned on the basis of where the gene lies relative to other genes within the 'gene expression space' (for example, analysis of 55 samples results in 55 different coordinates). If the gene lies in a region where there is a high proportion of genes that are known to be in the category in question, this will lead to a high discriminant value. SVMs are conceptually related to clustering analysis in the sense that the discriminant values are derived from similarity among expression profiles. But in clustering analysis, genes are grouped solely on the basis of their expression levels; in contrast, SVMs use the known classifications (that is, knowledge regarding which genes are in the category and which are not) in order to map the initial gene expression space into a one-dimensional space (the discriminant values) in which the two classes are optimally distinguished.
Importantly, the discriminant values output by an SVM can be processed to obtain an estimate of the probability that the prediction for each gene in each category is correct (that is, an estimate of precision), on the basis of how well previously annotated genes in the given category can be distinguished from previously annotated genes that are not in the category. This is accomplished by a three-fold cross-validation strategy, in which the analysis is run three times, each time with a different one-third of the annotations masked so that the SVM algorithm does not know whether or not they are in the category when it is assigning discriminant values. Any given discriminant value is then converted to a precision value by simply asking what proportion of the masked genes with discriminant values above the given discriminant value really are in the category in question. The proportion of known genes in the category that are identified by the SVM as being in the category is also obtained at each discriminant value, and is referred to as recall. For all subsequent analyses we used precision and recall as our primary measures of success.
We trained separate SVMs for each of the 992 GO-BP categories. This revealed that genes in hundreds of categories could be recognized with precision greater than 50% (Figure ). Typically, not all of the genes in a category could be recognized (the curves in Figure correspond to recall of 10% through 40%); this is due to the fact that not all genes within any given category display the characteristic expression pattern (Figure ). As a control, when the gene labels were randomized, only zero to fifteen categories (depending on the randomization run) achieved 10% precision and 10% recall simultaneously (black dotted line at the bottom of Figure ). Therefore, this analysis demonstrates that, in a blind test, the known genes in many functional categories can be distinguished on the basis of the expression profiles of other genes that are members of the same functional category. This implies that there are distinct regulatory mechanisms that control these pathways, and indicates that correlation-based methods can be used to predict the functions of uncharacterized genes in mammals.
Figure 6 Predicting GO-BP categories of mouse genes using microarray data and SVMs. (a) The number of the 992 initial GO-BP categories exceeding the indicated precision value, with recall fixed for each line; for example at 40% recall (green line), around 100 (more ...)
Predicted functions for unannotated genes are supported by sequence features
We next used these trained SVMs (Figure ) to predict functions for the 12,123 unannotated genes for which we detected expression in our data. The number of genes with at least one predicted function (that is, one GO-BP category) is shown in Figure at varying precision thresholds (blue line). All of the predictions with precision above 15% are listed in the Additional data files. To make the outputs easier to peruse manually, we grouped 587 GO categories into 231 'superGO' categories, by combining categories that resulted in the same set of predicted genes and that were manually verified to be physiologically related. Figure (red line) confirms that the number of unannotated genes that are predicted to have some function by an SVM with 'superGO' categories are similar to those with the original GO categories, although the number of categories has been compressed.
In order to provide a set of 'highest priority' predictions, we singled out those with the highest estimated precision. Among the unannotated genes (that is, those carrying no annotation in GO-BP), 1,092 (representing 117 superGO categories) were associated with precision values of 50% or greater; thus, on the basis of the analysis above, each of these genes is more than 50% likely to be involved in the given biological process. Figure shows the original microarray data for these 1,092 genes, sorted by the predicted categories. Predictions were made for genes expressed in all of the tissues analyzed, and represent a wide spectrum of biological processes.
Figure 7 Expression patterns of 1,092 unannotated genes predicted to belong to any of 117 'superGO' categories with 50% confidence or higher. The vertical axis was clustered and diagonalized as in Figure 4. The height of each predicted category has been normalized (more ...)
While some predictions correspond to expression in a single tissue (for example, the 56 genes predicted in 'vision' were predominantly expressed in the eye), such cases were unusual. Rather, most of the predictions were based on expression in multiple functionally related tissues (for example, the five genes predicted in 'regulation of cell migration' were characterized primarily by high expression in colon, large intestine, and small intestine) or more complex patterns (for example, genes predicted in 'CNS/brain development' were preferentially expressed in all adult neural tissues as well as in embryonic heads). Many predictions were found to be in categories related to the cell cycle and RNA processing. These genes tended to be expressed constitutively, but were most highly expressed in embryonic tissues, presumably because of rapid cell growth during development. However, many other predictions relate to neural functions, the immune response, muscle contraction, small-molecule metabolism, and other aspects of adult physiology. All of the individual predictions are provided in a table in the Additional data files, together with the expected precision and other information regarding the gene and the encoded protein, and these can be sorted by gene or by functional category.
Among the 1,092 unannotated genes, 488 (45%) have no overt sequence features suggesting physiological or biochemical function (that is, they have no similarity to previously characterized proteins or known functional domains; they are listed in Additional data files; and also see Materials and methods). Examination of the remaining 55% provided evidence that many of the predictions are likely to be correct. First, a handful of genes that were not annotated in our version of GO have in fact been characterized in the literature. For example, SVMs correctly predicted that phospholamban, the regulator of the Ca2+
-ATPase in cardiac sarcoplasmic reticulum [27
] is involved in 'muscle contraction or development'. Other genes are similar to characterized genes in other species: for example, the mouse homolog of the yeast 'Extra Spindle Poles' (ESP1
) gene was predicted by SVM to function in 'mitotic cell cycle', 'cytokinesis', and 'microtubule based process', consistent with the function of its yeast counterpart [28
A more comprehensive and objective analysis was enabled by the fact that, in an independent sequence-based analysis we conducted (see Materials and methods), known protein domain structures were encoded by 461 (42%) of these 1,092 unannotated genes (listed in the Additional data; see also the Materials and methods section) [29
]. These provided further independent support for many of the predictions, since neither the primary sequences nor the domain features of the unannotated genes played a part in the predictions. In many cases, the domains also augment the predicted physiological function with a potential biochemical mechanism. For example, 3 of the 11 genes predicted in the category 'acyl-CoA/fatty acid/peroxisome' encode a short-chain dehydrogenase motif, suggesting that they are metabolic enzymes. Among the 86 unannotated genes predicted to function in 'microtubule-based process' are 4 with chromosome-segregation ATPase domains, one with an intermediate filament protein domain, one with a kinesin-motor domain, one with a myosin heavy-chain domain, and one with a tropomodulin domain, all of which are consistent with microtubule- and/or cytoskeleton-related functions. Of the four proteins predicted in 'skeletal development', one encodes a fibrillar collagen carboxy-terminal domain, and one encodes a collagen triple-helix repeat. Some of the relationships between predictions and domains are striking on the simple basis of their numbers: 7 of the 95 genes predicted in 'humoral immune response' encode an immunoglobulin domain; 13 of the 87 genes predicted in 'chromosome organization/DNA packaging' have high mobility group (HMG) domains, and 23 of the 149 genes predicted in 'RNA processing/ribosome biogenesis' encode helicase domains, RNA-binding domains, or RNA-modifying motifs. Table lists a selection of statistically significant associations between the different prediction classes shown in Figure and protein domains.
Domains associated with genes predicted to function in specific biological processes
Comparisons among data sets for predicting gene functions
Although there was a significant correlation among the three different mouse tissue-specific data sets compared in Figure , there were also many cases in which the three data sets disagreed in their assessment of relative abundance of individual genes in different tissues (Figure and data not shown). We reasoned that the SVM cross-validation analysis could provide an objective measure of the quality of the different data sets: since random measurements lead to very poor predictions (Figure ), any errors in the data would tend to degrade the precision and recall values. While our manuscript was in preparation, an additional data set was released by Su et al.
, the authors of reference [15
]. Their newer data [30
] include measurements of 36,182 known and predicted genes over 61 tissues, measured in duplicate using custom-built Affymetrix arrays, and are thus similar in scope to our data set. Figure shows a comparison between cross-validation results from running SVMs on the three data sets: ours, that of Su et al.
], and that of Bono et al.
], with each restricted to the 13 tissues and 1,800 genes common to all three, and the same GO-BP annotations used for all three data sets. Figure shows that, although our data fare slightly better, the power of our data set and that of Su et al.
] for predicting GO-BP categories are comparable. This confirms that distinct and coordinate regulation of many mammalian functional pathways is authentic because it is observed in two independent data sets.
Comparison of tissue-specificity with co-expression for predicting gene functions
We used two different approaches to ask how well tissue specificity can predict the functional classes of genes, in comparison to co-expression. First, from our data we compiled three sets of lists: genes that are expressed in each of the 55 individual samples; genes that are expressed highest in each of the individual 55 samples and in groups of functionally related samples (for example, treating all neural tissues as a single group); and also genes that are expressed uniquely in individual samples. All of these lists (175 in total) are compiled in the Additional data files. For each of the 992 GO-BP categories, we assessed the precision and recall for each of these lists (that is, whether these lists can distinguish genes in the category from those not in the category), and then identified the best precision value and its associated recall value for that category. Figure shows a histogram of the difference between SVM precision and tissue-specificity precision, at the same recall value, for each GO-BP category. The vast majority of data points are greater than zero (P < 10-76; two-sided pairwise t test), indicating that co-expression patterns can be used (by SVMs) to predict gene functions significantly better than tissue-specificity alone.
It is possible that improved results might be obtained by other ad hoc procedures for sorting the genes in different ways, or by more automated procedures for generating large numbers of lists. However, an alternative analysis suggests that this is unlikely: when we re-ran SVMs with the matrix of 1s and 0s indicating which gene is expressed (or not) in each tissue, rather than the matrix of quantitative expression values, the resulting predictions were inferior (dotted magenta line in Figure ). In theory, if any combination of on/off information about gene expression in different tissues was informative for identifying genes in any category, it would have been identified by the SVMs. The result we obtained indicates that the quantitative measurements contain critical information reflecting functions of genes that is not, for the most part, contained in the binary (expressed/not expressed) information.
Validation of predictions by de novo functional analysis
Finally, we asked whether new functional predictions could be confirmed by directed experimentation. Among the genes we predicted to function in RNA processing and ribosome biogenesis was PWP1
, which encodes a protein that includes WD40 repeats and which has previously been found to be up-regulated in pancreatic cancer tissue [31
]. In our data, PWP1
was most highly expressed in embryonic tissues, as is characteristic of most genes annotated as 'RNA processing' by GO-BP (Figure ). The encoded protein Pwp1p is highly conserved across eukarya (Figure ) but to our knowledge it has not been functionally characterized in any species, although it has been found in the human nucleolus [32
], and in yeast it is essential for cell growth [33
]. We created a titratable-promoter allele of yeast PWP1
, and found that cells depleted for Pwp1p displayed a striking reduction in 25S rRNA (Figure ), confirming the involvement of this gene in RNA processing and ribosome biogenesis. Given that WD40 repeats are thought to be protein interaction domains, we also asked whether Pwp1p physically associates with other proteins. We found that epitope-tagged yeast Pwp1 protein co-purified with known trans
-acting ribosome biogenesis factors, as well as with several ribosomal protein subunits (Figure ), consistent with a direct role in ribosome biosynthesis.
Figure 8 PWP1 functions in ribosomal large-subunit biogenesis. (a) The expression pattern of mouse Pwp1 is similar to that of most known RNA-processing proteins. (b) The domain structures of Pwp1 homologs identified by BLASTP searches. Accession number and amino-acid (more ...)