|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: MDC CTM OGT. Performed the experiments: MDC. Analyzed the data: MDC. Contributed reagents/materials/analysis tools: CH. Wrote the paper: MDC CTM OGT.
Tissue-specific gene expression plays a fundamental role in metazoan biology and is an important aspect of many complex diseases. Nevertheless, an organism-wide map of tissue-specific expression remains elusive due to difficulty in obtaining these data experimentally. Here, we leveraged existing whole-animal Caenorhabditis elegans microarray data representing diverse conditions and developmental stages to generate accurate predictions of tissue-specific gene expression and experimentally validated these predictions. These patterns of tissue-specific expression are more accurate than existing high-throughput experimental studies for nearly all tissues; they also complement existing experiments by addressing tissue-specific expression present at particular developmental stages and in small tissues. We used these predictions to address several experimentally challenging questions, including the identification of tissue-specific transcriptional motifs and the discovery of potential miRNA regulation specific to particular tissues. We also investigate the role of tissue context in gene function through tissue-specific functional interaction networks. To our knowledge, this is the first study producing high-accuracy predictions of tissue-specific expression and interactions for a metazoan organism based on whole-animal data.
In animals, a crucial facet of any gene's function is the tissue or cell type in which that gene is expressed and the proteins that it interacts with in that cell. However, genome-wide identification of expression across the multitude of tissues of varying size and complexity is difficult to achieve experimentally. In this paper, we show that microararray data collected from whole animals can be analyzed to yield high-quality predictions of tissue-specific expression. These predictions are of better or comparable accuracy to tissue-specific expression determined from high-throughput experiments. Our results provide a global view of tissue-specific expression in Caenorhabditis elegans, allowing us to address the question of how expression patterns are regulated and to analyze how the functions of genes that are expressed in several tissues are influenced by the cellular context.
Tissue-specific gene expression is a fundamental aspect of multicellular biology, underlying the development, function, and maintenance of diverse cell types within an organism. Accounting for tissue-specific expression is a precursor to any systems-level understanding of metazoan organismal development and function and large-scale studies of spatio-temporal gene expression both at the single-gene and whole-genome level have been performed in several organisms –. Additionally, tissue specificity is an important aspect of many complex diseases; notable examples of tissue interactions associated with disease include stroma-tumor interactions in cancer  and tissue-specific effects of insulin signaling in diabetes . Although several experimental techniques have been developed to identify tissue-specific gene expression signatures, both at the single-gene and whole-genome level, our current knowledge of tissue-specific expression is incomplete.
The model organism Caenorhabditis elegans provides a good framework for the study of tissue-specific expression. Its invariant cell lineage allows single-cell resolution of tissue-specific expression patterns through a variety of experimental techniques ,. In situ hybridizations of the entire transcriptome are in progress , and GFP-promoter tagging has been applied on a large scale ,,; as a result, the expression of approximately 3500 genes has been studied at the single-gene level , providing a “gold standard” for gene expression. Additionally, several methods have been developed to isolate mRNA samples enriched for a specific tissue or cell type, allowing global analysis using microarrays or SAGE –.
Despite the variety of techniques available and the number of studies performed thus far, our understanding of tissue-specific expression in C. elegans is not yet complete; most genes have not been analyzed at the single-gene level, nor under diverse conditions and developmental stages. Additionally, each of the individual techniques for measuring tissue-specific expression suffers from drawbacks. GFP-promoter constructs, though they present the most accurate method amenable to high-throughput analysis, may incompletely capture endogenous expression or may fail to express well, a problem that is particularly severe in the germ line due to silencing . Directed microarray studies, while powerful, depend on the ability to isolate mRNA from a particular tissue, since dissection is not possible in most cases, and methods to achieve this each have disadvantages: studies using mutants may report non-endogenous expression; embryonic cell sorting misses expression that only occurs in later stages of development, as post-embryonic cell sorting is not yet feasible; and poly-A binding studies depend on the ability to introduce the binding protein construct into and extract the protein out of the tissue of interest . Thus, the ability to directly study the expression specificity of each gene across tissues, especially small tissues, and ideally to also account for the effects of development and environmental conditions, remains challenging.
Here we present a computational method that leverages existing experimental information to expand and improve our knowledge of tissue-specific expression. Using data from whole-animal microarrays, we accurately predict tissue-specific expression in all major tissues and even for several tissues that comprise only a few cells. Our approach not only outperforms directed high-throughput studies in all but one case, but also captures information that complements existing experiments, for example, by uncovering tissue-specific expression that is only seen under specific conditions. To confirm our predictions, we experimentally verified the expression of several genes. We have made our predictions available through a dynamic web-based interface at http://function.princeton.edu/worm_tissue to enable hypothesis generation and further experimental follow up by the community.
Using this accurate large-scale, tissue-specific information, we perform further computational analyses, such as prediction of transcriptional regulatory motifs specific to understudied tissues as well as tissue-specific miRNA target regulation. In addition, we extended our algorithm to produce tissue-specific functional interaction networks that provide a framework for discovering protein function specific to particular tissues. Our ability to uncover tissue-specific information should allow higher-detail analysis of expression and further hypothesis testing to identify expression changes that are important for biological function.
We compiled a large compendium of C. elegans microarray data (comprised of 916 experiments from 53 datasets). A few (16) of these microarray studies address tissue-specific expression, but most studies examined changes in gene expression in the animal as a whole (see supplementary website at http://function.princeton.edu/worm_tissue for a list of microarray experiments used). Using a rank-based statistic, we evaluated the level of under- or over-expression of genes associated with each tissue in a given microarray experiment against a “gold standard” of 2872 genes known to be expressed in a particular tissue. Our gold standard is composed of information derived from single gene studies such as promoter-GFP tagging, antibody staining, and in situ hybridizations (WormBase), which we hand curated to account for tissue naming synonyms. The gold standard also includes the 1872 promoter –GFP fusions from the C. elegans Tissue Expression Consortium ,,. Importantly, the gold standard is completely independent from the microarray or SAGE gene expression data in our compendium. This gold standard of tissue-specific gene expression allowed us to identify substantial tissue bias in the transcriptional responses of microarray experiments. We quantified over or under-expression of tissue-specific gene sets using a rank-based statistic (Figure 1A). Despite the fact that only a small number of studies isolated specific tissues, we found that tissue-specific signals can be observed in many whole-animal experiments. For example, analysis of two developmental time courses  revealed dramatic tissue-specific temporal patterns that reflect developmental timing; as might be expected because neurons are born in early larval stages, earlier developmental stages are enriched for neuronal transcripts, while later stages are enriched for germ line transcripts, correlating with the development of reproductive tissues and the onset of reproduction (Figure 1B). We can also quantify a number of previously uncharacterized tissue-specific responses. For example, motor and sensory neurons have distinct developmental profiles (Figure 1B) and, in contrast to other non-reproductive somatic tissues, intestinal expression steadily increases with developmental stage.
Tissue-specific responses can also be observed when experimental treatments are applied to animals in the same developmental stage. For example, our analysis of tissue-specific signals in a whole-animal microarray study of unfolded protein response  revealed that various mutations in UPR pathway genes have different effects on tissue-specific expression (Figure 1C). Consistent with previous studies ,, we observed that an ire-1 mutation has a strong effect on epithelial tissues such as the intestine and the excretory cell, as genes expressed in those tissues are significantly down-regulated as a result of ire-1 mutation. On the other hand, an atf-6 mutation causes a decrease in neuronal transcripts suggesting greater reliance on the atf-6 branch of the UPR in neurons. Distinct tissue-specific profiles can be observed for other treatments as well. Thus, our analysis demonstrated that we can identify both known and novel tissue-specific expression information from existing gene expression microarray experiments.
The previous examples suggest that substantial information about tissue-specific expression can be gained by a directed analysis of whole-animal microarray data. As such, we applied a state-of-the-art machine learning algorithm, support vector machines (SVM) , to build a predictive model of tissue-specific microarray profiles. Intuitively, SVM automatically identifies expression patterns in our compendium whose combination maximally separates genes expressed in a particular tissue (e.g., neurons) from other (e.g., non-neuronal) genes. This classifier can locate hidden tissue-specific expression patterns that are scattered through only a few experiments in the compendium and might come from diverse types of studies. By contrast, clustering methods (e.g. standard hierarchical clustering  or the C. elegans TopoMap ), while clearly important for functional data exploration, cannot detect these signals at resolution sufficient for prediction of tissue-specific expression (see Table S1 for comparison between correlation and SVM). Using the SVM classifier to predict tissue-specific gene expression based on the microarray compendium, we achieved a high degree of accuracy, outperforming directed microarray-based studies of tissue-specific expression in most cases. Our evaluation is based on the standard cross-validation technique, where only a fraction of the genes with known expression is used for building the classifier while the rest is held out for evaluation. Our predictions reach a precision of 90% for all of the major tissues of the worm (intestine, hypodermis, muscle, neurons, and pharynx) except germ line (Figure 2A). It is likely that germ-line performance is substantially underestimated, since the expression of many of the genes in the gold standard was investigated using promoter-GFP fusions, which are often germ-line silenced .
We also evaluated performance of other tissue-enriched gene lists acquired from directed microarray experiments against the same gold standard, using the processed lists from each publication. In all but one case, our approach outperforms these studies, predicting more genes at higher accuracy (Figure 2A). The single exception is the neuronal gene list published by Von Stetina et al.  that correctly recalls 384 of our gold standard neuronal genes with 89% accuracy, somewhat above what we are able to predict at the same recall.
Our method accurately predicts tissue-specific gene expression even from whole-animal microarray data alone. When we exclude the 16 studies that directly address tissue-specific expression, our prediction accuracy remains high; in some cases it is even unchanged (Figure 2A, “tissue data excluded”). In particular, even for the intestine, for which there are a number of high-quality directed studies ,, our prediction accuracy is not decreased when we use only whole-animal data.
Functional analysis of the top tissue-specific predictions (GO enrichment analysis) demonstrates that many of the genes we predict to express in specific tissues have functions consistent with that tissue. For example, predictions for germ line expression were enriched for cell cycle-related GO terms, those for muscle included “muscle contraction” and “respiration”, intestine included terms related to digestion and metabolism such as “fatty acid biosynthetic process”, neuron predictions were associated with “synaptic transmission” and “memory”, and hypodermis-expressed predictions included enrichment for terms related to molting and cuticle components. The pharynx is a complex organ that is comprised of muscle, structural and gland cells and genes predicted to express in the pharynx are enriched for diverse functions related to cytoskeleton, cuticle components, and secretion. (See supplementary website for all GO enrichment results.)
While techniques for isolating tissue-specific mRNA are steadily improving, it remains a particular challenge to examine the expression of genes in smaller tissues. Therefore, it is of particular interest to be able to predict expression in tissues that are comprised of only a few cells. Using our approach, we were able make high-quality predictions for many tissues where biochemical methods have yet not been successfully applied. While we do not achieve the high level of precision we observe in major tissues (which is expected, as far fewer genes are reported to express in the smaller tissues, making new candidates significantly more difficult to identify), we were able to identify genes that are significantly enriched for expression in the small tissue of interest when compared to the genomic background (Figure 2B). For example, among the genes in our gold standard, only 1 in 10 express in the vulva. However, we were able to correctly recall 30% of all vulval genes with a precision of 20% percent, a two-fold improvement above the genomic background rate, and likely an under-estimate as our GFP-based gold standard is far more incomplete for these small tissues than for larger tissues. Among the genes that scored highly in vulval predictions is dgn-1, a homolog of human dystroglycan. The gene was not a top prediction for any major tissue except pharynx, suggesting that it is not widely expressed. Among small tissues, dgn-1 was predicted to express in the uterus, distal tip cells, and the excretory cell in addition to the vulva. While dgn-1 was not included in our gold standard, expression in these tissues, including expression in pharyngeal epithelia, has been confirmed recently . Additionally, this gene has been shown to be functionally important for the development of the vulva and the excretory cell ,, in contrast to its vertebrate homolog, which functions in muscle.
Among other small tissues, we were also able to make reasonable predictions for the excretory cell, the spermatheca, the uterus, ceolomocytes, and distal tip cells. In many cases the predicted genes have annotations that are consistent with the function of the tissue. For example, our distal tip cell predictions are enriched for many GO terms including “cell migration,” “protein localization,” and several “cellular component” terms associated with exocytosis. These GO associations appear reasonable, as distal tip cells are two highly polarized cells that lead gonad migration during development. Secretion from these cells is known to play an active role in gonad migration , and the cells' morphology (as visualized by EM) is indicative of active endo/exocytosis . In addition, the top 200 distal-tip cell predictions significantly (p<10−2, hyper-geometric test) overlap with the list of genes associated with distal tip cell migration phenotypes compiled in a recent RNAi study . Thus, our results demonstrate that even small tissues that are challenging to isolate experimentally have distinct expression profiles within whole-animal microarray data. Our ability to make such predictions will likely improve as new gene expression experiments are added to the compendium.
We experimentally verified tissue-specific expression of six top genes with previously unreported tissue-specific predictions by creating transgenic lines carrying promoter-GFP constructs (Figure 3). Three of these genes were predicted to express in hypodermis. We chose to focus on hypodermis since, to our knowledge, no large-scale study investigating hypodermal expression has been reported. Promoter-GFP constructs of two of the predicted hypodermal genes, K08B12.1 and F58H1.2, were most prominently expressed in the hypodermis at earlier stages (Figure 3A and 3B and Figure S4). The third gene, F55H12.4, showed strongest hypodermal expression during L4 and adult stages (Figure 3C). We also verified the expression of genes that we predicted to be expressed in muscle (C29F5.1, Figure 3D), intestine (F13D12.6, Figure 3E), and neurons (gnrr-1, Figure 3F). The tissue specific expression of gnrr-1, a homolog of the human gonadotropin releasing receptor, was previously studied using antibody staining . While our algorithm predicted with high confidence that gnrr-1 expresses in neurons, neuronal expression was not reported in that study, and the gene was not included in the Von Stetina et al. list of neuronally-enriched genes . Nevertheless, our promoter-GFP (Pgnnr-1::gfp) construct expressed primarily in head neurons and ventral cord neurons (Figure 3D), validating our prediction. It is likely that the protein product of gnrr-1 is heavily post-translationally modified, as species of multiple molecular weights are observed . Thus, it is possible that differences in such modification explain the discrepancy between our gene expression results and the previous antibody staining experiment, due to epitope differences. Furthermore, gnrr-1 is strongly over-expressed in L1 and L2 larval stages in multiple developmental microarray time courses, which is the pattern observed for many neuronal genes
Our ability to make high-quality predictions also provided potential insights regarding the transcriptional regulation associated with the tissue-specific expression signal in whole animal data. We used a motif-finding program, FIRE , to identify motifs that are overrepresented in the upstream regions of our top-scoring predictions for each of the major tissues (Figure 4). While no genome-wide study of hypodermal expression has been published thus far, we were able to use our predictions to uncover motifs that are promising candidates for regulators of hypodermal transcription. A GATA-like motif was enriched among our top hypodermal predictions. This is consistent with previous studies showing that GATA transcription factors are essential for hypodermal cell specification, and that a GATA consensus sequence is required for hypodermal expression ,. In addition, we have identified a motif that is similar to the binding site for the CF1/USP-like nuclear hormone receptor that affects molting and developmental transitions in insects . An intriguing possibility is that this motif and a functional USP homolog are involved in the nematode molting process as well, despite the fact that no direct USP homologs have been detected in the genome .
Using our germ-line predictions, we recovered an E2F-like motif (ETF). The C. elegans homolog of mammalian E2F, efl-1, is expressed exclusively in the germ line and is involved in oogenesis, regulating the expression of genes whose promoters contain the E2F binding motif . Another motif, TAC.GTA, was also strongly represented among germ-line predictions. We could not detect a clear match to any known transcription factor consensus sequence, but a similar motif was previously discovered in a C. elegans-C. briggsae sequence comparison .
A GATA-like motif was also overrepresented among intestine predictions. GATA transcription factors are known to regulate expression of intestinal genes , and this motif is very similar to those reported by previous whole genome intestinal expression studies , and aging studies ,. Our pharynx prediction yielded the largest number of motifs of any tissue. One of the motifs represents a possible match to the pha-4 consensus ([T[AG]TT[TG][AG][TC] ) though other motifs did not resemble any known binding sites (see Table S2 for a complete list of motifs). Surprisingly, there was a shortage of neuronally-overrepresented motifs. In fact, the most significant result for neurons was instead motif avoidance. This is consistent with the hypothesis, supported by many experimental observations in C. elegans (see for example ,) that neuronal differentiation is a “ground state” that is superseded in non-neuronal cells.
The identification of global tissue-specific expression patterns allows us to address biological questions that are difficult to address experimentally, such as the question of tissue bias in microRNA targets. Non-coding microRNAs have emerged as critical developmental regulators, and are predicted to regulate the expression of a large fraction of all mammalian genes ,. Specific miRNAs direct development in particular tissues ,, yet experimental identification of miRNA targets in individual tissues remains difficult. This is in part because expression of miRNA targets may be unchanged if translational inhibition, as opposed to mRNA degradation, is involved. Moreover, the ability to identify all targets for all miRNAs simultaneously is still more challenging.
Previous studies using human data have detected cell type-specific signatures among miRNA targets . To address this problem in C. elegans, we leveraged our predictions of tissue-specific expression to investigate tissue bias, as measured by a rank-based statistic, among a list of likely C. elegans miRNA targets predicted by Miranda , TargetScan , and PicTar . While many miRNAs had no detectable tissue bias among their targets, a subset showed significant tissue preference or tissue avoidance (see Figure S3 for all microRNAs-tissue interactions). In particular, robust tissue avoidance for three microRNAs was detected in all three sets of target predictions (Figure 5). The miR-124 mammalian homolog is known to induce neuronal differentiation . Our analysis demonstrates that its predicted targets are depleted for neuronal genes, while enriched for genes specific to other somatic tissues; these results suggest that its function is conserved in C. elegans. miR-2 showed a pattern of neuronal depletion similar to mir-124's pattern, implying that it is also involved in neuronal differentiation; this is consistent with the exclusively neuronal pattern of GFP expressed from the miR-2 promoter . The mir-71 target set, on the other hand, is significantly depleted for intestinal genes but enriched for genes expressed in muscle, hypodermis and pharynx. In contrast to miR-2, the anatomical expression of miR-71 appears to be ubiquitous, suggesting that tissue-centric target analysis provides complementary information that is not captured by expression studies.
We have been able to leverage diverse microarray data to predict tissue-specific expression, including for genes expressed in more than one tissue. However, many genes that are expressed in several tissues (or ubiquitously) perform different functions in different cellular contexts. A natural way to explore such functional roles is through functional interaction networks, which connect genes that participate in the same biological process, an approach that has been used by us and others to examine functional roles of proteins on whole-genome scale ,. In contrast to previous approaches, in the case of tissue-specific functional networks, a network for a given set of genes may vary depending on the tissue of interest, as the same set of gene products may not perform the same function or share the same physical or other interactions in different tissues.
We have developed an SVM-based algorithm to predict tissue-specific functional networks from our compendium of C. elegans transcriptional data. Although simple expression correlation has often been used to investigate gene function on a global (non tissue-specific) level, our analysis above (and in Figure S1) demonstrates that a single global correlation computation is unable to distinguish between tissue-specific effects. On the other hand, the observation that whole-animal microarrays may contain a strong tissue specific signal suggests that it is possible to assess the tissue-dependent functional roles of genes given the right analytic approach. Thus, we have developed a network generation algorithm in which certain experiments are trusted more or less depending on the extent to which they reflect a particular tissue-specific functional signal.
Similarly to previous network integrations, we define a gold standard of functional interactions that is then used to determine how data is combined into a network. However, in contrast to previous studies ,, we define several tissue-specific gold standards, one for each tissue, and we use an SVM rather than a Bayesian formulation to combine microarray data. An advantage of the SVM for this problem is that SVMs have the ability to adjust weights of individual experiments while Bayesian integration typically assigns weights to whole datasets. In the case of the C. elegans compendium, the ability to treat each experiment individually is crucial for prediction of tissue-specific networks, as a single dataset can contain experiments that are informative for different tissues. For example, within a single developmental time course (see Figure 1B), early larval stages are informative of neurons, when neuronal cells are overrepresented, while the adult stage is highly informative of germ-line.
Using an SVM-based approach, we are able to integrate microarray data into different tissue-specific functional interaction networks. Such networks link genes that are likely to participate in the same process within a specific tissue context and contain information that may otherwise be overwhelmed in a global view of co-expression. As an example, we considered exc-7, an RNA-binding protein that is involved in the formation of the excretory canal, but that also plays a role in neuronal development, affecting cholinergic synaptic transmission . Several of the interaction partners present in its neuron-specific interaction network support our understanding of exc-7 neuronal function (Figure 6A): hmr-1 is required for the outgrowth of some motor neurons ; unc-38 is an acetylcholine receptor ; and the mammalian homolog of abl-1 is involved in post-synaptic acetylcholine receptor clustering . Another partner, rhgf-1, a RhoGEF, is known to regulate neurotransmitter release at the neuromuscular junction . Our network results also suggest an interaction between exc-7 and smg-1, a key component of the nonsense-mediated mRNA decay pathway, and spk-1, which is involved in mRNA splicing . The presence of RNA processing genes among the interaction partners is potentially related to exc-7's RNA-binding function. A standard correlation computation produces an entirely different, non-neuron-specific set of genes associated with exc-7, including aquaporins and a gene involved in excretory cell formation (Figure S2). Our technique, on the other hand, automatically identifies a subset of microarray experiments with strong neuronal signals, and thus we are able to uncover neuron-specific functional interactions that are not immediately visible in a global correlation network.
Apart from finding tissue-specific interactions that can be lost in a global view, as in the above example, tissue-specific networks have the potential to tease out how the same gene may perform different functions within different tissue contexts. The C. elegans homolog of Ras, let-60, is a canonical example of a ubiquitously-expressed gene that participates in diverse processes. For example, let-60 promotes progression through meiosis during oogenesis  and affects olfaction in neurons . To explore these tissue-specific functions of this gene, we queried our germ line and neuronal networks with let-60. Two of the genes in the neuronal network are involved in chemosensation (Figure 6B): rgs-3 is a regulator of G-protein signaling required for normal response to a variety of sensory stimuli , and ckk-1 is a CaM kinase kinase that regulates the expression of chemosensory receptor genes . Other neighbors in the network are involved in further aspects of neuronal function: zaf-1, syd-9, and sad-1 function in synapse development, and egl-19 is a calcium channel that contributes to fate specification in olfactory neurons .
By contrast, the germ line let-60 network is comprised of an entirely different set of genes that are consistent with let-60's function in meiosis: cej-1(cpg-1) is required for proper meiotic chromosome segregation , and zyg-11 is part of a ubiquitin-ligase complex that promotes meiotic anaphase II . Other interactors are likely to participate in related processes: zen-4 is a kinesin protein that localizes to midzone microtubules ; kbp-1 localizes to kinetochores ; and both rfc-4 and pos-1 affect a large number of events in the oocyte to embryo transitions . Our networks focus on interaction information within a tissue-specific context, providing a framework for generating precise hypotheses about tissue-specific gene functions that can help direct follow-up experiments.
We have developed a computational method that accurately predicts tissue-specific expression based on expression profiles of primarily whole-animal microarrays. We show that strong tissue biases can be observed in data from microarray experiments, despite the fact that most C. elegans microarray experiments isolate mRNA from the whole animal, with the resulting expression values representing a population average of many cell types. With our SVM classifier, we were able to leverage these signals in existing whole-animal microarrays to produce predictions of tissue-specific gene expression and generate networks of tissue-specific functional interactions.
In addition to achieving accuracy higher than most directed microarray studies, our algorithm captures information about tissue-specific expression that is complementary to standard approaches. Microarray experiments analyzing tissue-specific expression are able to discover tissue-specific genes based on the difference in mRNA levels, a method that is ultimately sensitive to total mRNA abundance. Our method instead relies on co-expression with known tissue-specific genes in some informative condition, and thus identifies tissue-specific expression even for genes that have very low levels of expression in any one experiment. As we analyze microarray experiments from a variety of conditions, our approach can uncover genes expressed in a particular tissue in a condition-dependent manner which may be difficult to directly detect experimentally. For example, a promoter-GFP tagging study reported expression of ins-7 exclusively in neurons , while our method predicts expression in both neuron and intestine. In fact, a recent study has shown that ins-7 is indeed expressed in the intestine at a low level, with expression increasing significantly in aging animals and under conditions of high insulin signaling . The earlier GFP study focused on young wild-type adults and thus did not identify this age-related expression. Thus, our method provides a valuable tool for study of tissue-specific expression that is relatively unbiased, as it does not rely on mRNA abundance directly and can leverage existing whole-animal compendia that provide a variety of developmental stages and conditions represented in these collections.
From a more general perspective, our method extracts tissue-specific expression and interaction information from large compendia of diverse microarray studies. Even in the case of larger animals where it may be feasible to perform microarray studies on dissected tissues, the underlying samples are nevertheless typically comprised of multiple cell types; a method to predict gene expression in tissue subtypes will be applicable to other organisms, limited only by the existence of an appropriate “gold standard” gene expression set. Our results demonstrate that sample heterogeneity, when appropriately analyzed, can provide valuable information regarding cell-type specific gene expression and function.
Tissue localization data was retrieved from WormBase 170  and parsed in a semi-automated way. Since a variety of terms are used to describe the same tissue and/or organ, we hand-compiled a table of tissue synonyms. In addition we applied some hierarchical propagation to tissue labels, such as assigning specific neurons to their neuron class (sensory, motor, interneurons). A majority of genes were reported to express in multiple tissues and each gene was considered a positive example for all tissue where it was found to express. This data includes all 1,872 genes investigated by the C. elegans Tissue Expression Consortium , as well as expression patterns from smaller scale experiments, for a total of 2872 genes in the gold standard. These data did not include any large-scale expression studies (microarray or SAGE), and was limited to single-gene GFP or in situ experiments.
We collected microarray data from 53 publications (see Supplementary website for complete list). The microarray values from a single publication were considered a coherent dataset and processed together. Data for single-channel platforms was transformed by dividing every gene value by its average over the dataset and taking the log of the result. All missing values were imputed using the KNN impute algorithm  (k=10). For input to SVM learning the gene values within a single dataset were normalized to mean 0 and variance 1 before all datasets were concatenated. Since the SVM algorithm does not accommodate missing values, genes that were present in some datasets but not others were assigned a value of 0 when absent.
For each tissue we used our gold standard to assign genes with known expression into 2 classes (tissue expressed and not tissue expressed). We the used the two classes and the microarray expression values to calculate an AUC score and the associated probability. The probabilities were used to correct the results for multiple hypothesis testing at a false discovery rate of 0.05.
Single gene predictions were made using linear support vector machines (SVM). Given a set of genes known to be expressed in a particular tissue, the SVM identifies specific patterns of gene expression in a subset of experiments that differentiates these genes from those not expressed in the tissue. We performed 5-fold cross validation and optimized the parameters for maximal precision at 30% recall (fraction of genes in the gold standard correctly recalled) for major tissues and 10% recall for small tissues. SVMs are a maximal margin classifier that optimizes classification performance on the training set while maximizing model generalizing power by maximizing the distance of the nearest correctly classified examples to the separating plane. If and define the plane that separates the positive and negative examples, are the vectors of microarray data, are the training labels, and denote the degree of misclassification for each example, the SVM problem is to minimize
subject to . The constant is empirically optimized to achieve the best performance at classifying new examples.
Genes were selected based on the following criteria: top prediction scores that are specific to a single tissue, no previously reported tissue-specific localization to that tissue, and absence of any tissue-bias that could be inferred from sequence information alone. In particular, we avoided all collagen-related genes predicted to express in hypodermis due to ease of prediction of this particular tissue-specific expression from sequence. In addition, we specifically selected gnrr-1 because of the discrepancy between our predictions (made with a top prediction score) and previously published results (). Based on the above criteria we picked 14 genes, for which we obtained 9 lines; 6 of these fluoresced and these 6 are all shown in Figure 3. The GFP-promoter constructs were made using the Gateway system with the unc-119 rescue plasmid pDestDD03 and promoter clones from the C. elegans promoterome . The resulting constructs were bombarded into unc-119(ed3) mutants.
Motif discovery was performed for each tissue separately. For a single tissue, all the genes that were present in our microarray compendium were assigned a cluster number of 1 if they were in the top 500 predicted genes and a cluster number of 0 otherwise. This cluster assignment was used as input to the FIRE algorithm. Kmer length was set to 9 and default values were used for all other parameters.
To generate the tissue-specific interaction standard we first generated a global functional interaction standard using a combination of GO, KEGG, and Textpresso-curated interactions ,. We then defined a set of tissue-specific interactions by cross-referencing with our gold standard of tissue expression used for single gene expression prediction. A tissue-specific interaction was defined as a pair of genes that were co-annotated to a specific GO term (see Supplementary methods) and were also both found to express in a particular tissue in our expression gold standard. The negative set was composed of positive interactions from other major tissues as well as random pairs of GO annotated genes. The classification problem is then to differentiate interactions specific to a particular tissue from interactions in other tissues as well as non-interacting gene pairs.
The algorithm computes a weighted sum of single experiment similarity measures. Since the expression values are normalized to have mean 0 and variance 1, single experiment similarity measures are thus single terms within a per-dataset Pearson correlation. The contribution of expression data to the final value is thus
where and represent the expression values of genes and in experiment and is the weight assigned to that experiment by the SVM classifier. (See Text S1 for a detailed description).
Method flow for SVM predictions. Five-fold cross validation was used to generate precision-recall plots, optimize learning parameters, and calculate estimated precision for novel predictions.
(0.22 MB PDF)
All miRNA target tissue interactions as measured by a ranksum statistic significant at 0.01. Numbers inside the cells represent how many target prediction sets gave a significant result (out of 3, Mirna, Pictar, Targetscan). When multiple target sets give significant results the ranksum statistic is averaged. “D” signifies that multiple target prediction sets gave significant results but disagreed in the direction.
(0.04 MB PDF)
A correlation network for exc-7 computed across the same expression data as was used for SVM learning. In contrast to the neuron-specific network generated by our method, this network is more representative of exc-7's excretory cell function. aqp-3 and aqp-10 are aquaporins, while eor-1 is known to affect excretory system development.
(0.03 MB EPS)
Additional imaged of strains expressing hypodermal GFP.
(0.07 MB PDF)
Comparison of area under precision-recall curve (corrected for base-line) between a correlation-based method (sum of correlations with known tissue-specific genes) and an SVM based method.
(0.01 MB PDF)
All significant motifs. Top 500 predictions for each tissue were used do define a cluster for FIRE motif analysis . The regular expressions corresponding to significantly enriched or depleted motifs are shown.
(0.02 MB PDF)
Network Prediction Methods
(0.03 MB PDF)
We would like to thank members of the Troyanskaya and Murphy labs for reading the manuscript, and members of the Murphy lab for experimental assistance.
The authors have declared that no competing interests exist.
This research is partially supported by NSF CAREER award DBI-0546275 to OGT, NIH grant R01 GM071966, NIH grant T32 HG003284, and NIGMS Center of Excellence grant P50 GM071508. OGT is an Alfred P. Sloan Research Fellow. CTM is a Pew Scholar, a McKnight Scholar, and a Keck Young Scholar. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.