Nearly all C. elegans genes show differential expression over development
In order to investigate C. elegans
development on a genome-wide scale we used Agilent 4×44k C. elegans
oligo microarrays to measure expression of 13,474 genes at 6 different developmental time points (egg, L1, L2, L3, L4, and young adult) in two different strains (N2 and CB4856). We analyzed the data using a two-way ANOVA to identify the relative impact of strain, stage, and strain by stage interaction on the observed transcript levels. 12,390 (91.9%) transcripts were found to have a significant stage effect, 2,797 (20.8%) were found to have a significant strain effect, and 283 (2.1%) were found to have a significant strain by stage interaction term (q<.05, Table S1
). A strain by stage interaction term indicates that the developmental pattern of expression for that gene differs between N2 and CB4856. To determine which stage has the largest number of transcripts whose expression levels differ between N2 and CB4856 we analyzed each stage separately. A one-way ANOVA was used to measure the contribution of strain towards the observed variation. As can be seen in , variation in transcript levels between N2 and CB4856 is the highest during the L4 stage, a finding consistent with the idea that selection is relaxed during the later developmental stages 
. Overall, 2,211 genes are expressed differently between N2 and CB4856 in at least one stage. Most of the strain effect is not due to large-scale deletions of the gene in question, as only 88 genes in the dataset were identified as deleted in CB4856 
The L4 stage displays the largest variation in expression levels between strains.
Much of the observed variation due to strain by stage effects is due to differential expression of innate immunity genes
Four general expression patterns emerged when genes displaying a strain by stage interaction were clustered hierarchically (, Table S2
). One set contains genes that are mostly expressed only during the egg stage of either N2 or CB4856 (A). Two sets of genes (B) are mostly only expressed in either N2 or CB4856. One set of genes turn on roughly one stage earlier in CB4856 than in N2 (C), while another set of genes turn off earlier in CB4856 than in N2 (D).
Genes that display significant strain by stage variation fall into four main categories.
Several gene families are enriched in this set of genes, including: Math-BTB (p
.002), Duf-19 (p
2.4e-5), major sperm proteins (p
.0035), protein tyrosine kinases (p
.000953), protein tyrosine phosphatases (p
6.21e-15) and serine/threonine kinases (p
4.03e-7). Genes containing Math-BTB or DUF-19 domains are preferentially deleted in CB4856 when compared to the N2 genome, relative to other gene families 
. The differentially expressed protein tyrosine kinases and phosphatases are members of cluster C () and are expressed earlier in CB4856 than in N2, as are the major sperm proteins.
Many of these gene families have been implicated in innate immunity 
. Members of the C-type lectin and the pathogenesis related protein families are also present in the genes that display strain by stage variation. In order to test whether genes that have been implicated in the innate immune response are significantly enriched in this set, a list of genes that are upregulated in response to pathogen exposure was curated by hand, mainly from recent microarrays investigating the response of N2 young adults to multiple types of pathogens [Table S6, 29]
. The genes showing a strain by stage interaction display significant enrichment of genes that have been implicated in innate immunity (p
Of these genes, none are highly expressed during the egg stage (). Every other pattern observed in the strain by stage significant genes is seen in this subset, including a set of genes that are highly expressed in one strain, but expressed at a very low level in the other. This could be indicative of the different pathogens that each strain is exposed to in nature, or due to different mechanisms that each strain uses in response to the same pathogen.
Innate immune response genes show several patterns of expression.
Clustering identifies temporal patterns enriched for GO-terms associated with development
The Short Time series Expression Miner (STEM) was used to cluster the microarray data due to its ability to take into account the temporal nature that is inherent in a developmental time course 
. STEM selects a set of model profiles independent of the data, and the algorithm decides which model profile best fits the expression pattern for each gene. The genes are distributed among the clusters without regard to the number of model profiles used, and the algorithm takes into account that the time points are ordered and are not independent measurements. For the rest of the analysis, only the N2 data was used in order to allow for the comparison of these results to previous data and to avoid the complications of having strain effects obscure stage effects. Over 9,000 genes cluster into significant STEM clusters. When the CB4856 data is used, the resulting clusters are similar to those obtained using the N2 data, as the expression of most transcripts mainly vary by stage (data not shown). 13 of the 50 model profiles have a larger than expected number of genes assigned to the cluster and are called significant ().
Thirteen significant expression profiles are found in the N2 data.
The clusters recapitulate known biology. Clusters 2–5 and 7 show significant enrichment for genes that are in the gene ontology (GO) category of multicellular organismal development, which is the most significantly enriched GO-term, as well as embryonic development ending in birth or hatching. Clusters 2 and 4 are also enriched for the GO-terms larval development (sensu nematoda) and post-embryonic development. The general developmental role can be broken down further. Cluster 2 shows enrichment for genes related to hermaphrodite genitalia development and sex differentiation. This cluster shows a peak in the egg stage as well as the L3 and young adult stages. It is known that the cells involved in the development of the gonad are born in the embryo, differentiate during L3 and have matured by the time egg laying begins in the young adult stage 
Combining current knowledge with the clustering results can also help to uncover potentially novel biology. Clusters 3, 4 and 7 are all enriched for cell-cycle response genes and for terms related to development. However, their different profiles may indicate different functions within the larger class of genes relating to the differential timing of developmental programs. Cluster 7 is the most complex of the three clusters, with many GO-terms associated with the genes in the cluster, ranging from M phase to intracellular organelle part. The genes belonging to cluster 3 appear to be mostly responsible for the early embryonic cell divisions and patterning as they are also enriched for P-granule function and pole plasm location, both of which are extremely important in C. elegans
early lineage specification 
. Cluster 4 genes are also expressed mainly in the egg stage; however, their functions appear to be broader, related to differentiation of cells into organs and tissues. In addition to cell-cycle related GO-terms, the terms cellular developmental part, organ development, morphogenesis of an epithelium, anatomical structure regulation, and cell differentiation are all enriched. Genitalia development and regulation of vulval development are also enriched, perhaps suggesting slightly different function for these genes during expression in the L4 and young adult stage. By breaking down a single biological function, the cell cycle, into different sets of genes through clustering, it becomes possible to dissect broader functions into narrower ones.
One surprising observation from this data is the prevalence of GO-terms related to neuronal development or function that appear in the developmentally organized clusters, especially clusters whose expression peaks after the embryonic stage. 80 new neurons are born during the larval stages of development—mainly during the L1 and the L2 larval stages, although the PDA neurons are born during L3 and several neurons that are born in the embryonic stage have been shown to reverse polarity in the larval stages 
. Cluster 8 contains 501 genes whose role appears to be in neuronal activity. Broad aspects of neuronal activity are represented in this cluster. Terms from extra-cellular glutamate-gated ion channel to neurological system process to ion transport to axon function are all enriched. This set of genes is interesting because it is also enriched for terms such as pharyngeal pumping and eating behavior. UNC-73, a neurotransmitter that has been shown to be required for the regulation of pharynx pumping, is a member of this cluster 
. Several of the eat genes 
are also found in this cluster. The genes in this cluster are expected to be important for neural function, rather than creation, as no new pharyngeal neurons are created after hatching 
. Expression of this cluster of genes is highest in the L1 stage, when the worms begin feeding. Lower expression in subsequent stages is presumably due to the creation of stable proteins. Cluster 13, which shows peak expression level during the L3 stage, is also enriched for GO-terms related to neural function, including neurotransmitter activity and ion channel activity. Rather than functioning in neural creation, this cluster of genes appears to be enriched for signal transduction of neuronal responses. GO enrichments for cluster 13 include neurotransmitter receptor activity, postsynaptic membrane and transcription factor activity. Thus these genes appear to act in response to neural signals and perhaps help to create, or maintain, synaptic connections. A complete list of significant GO-terms for all of the significant clusters can be found in Table S3
Genes in clusters contain motifs with a known developmental function, as well as novel motifs
In order to identify motifs important for developmental regulation of expression levels, the genes belonging to the significant STEM clusters were analyzed using FIRE 
. FIRE uses mutual information to look for enrichment of motifs in both the 5′ promoter region and the 3′ untranslated region (UTR) in one cluster compared to the rest of the clusters. Motifs found in the 5′ region are expected to be binding sites for transcription factors or other DNA binding proteins that affect transcription, while motifs in the 3′ UTR are expected to be sites for regulation of mRNA by microRNAs or RNA binding proteins. The FIRE analysis was implemented using the STEM clusters described above (), as well as with clusters that are enriched for GO-terms related to development and computed without the young adult time point (clusters not shown). This was done to ensure that the young adult time point did not drive the clustering.
59 motifs were found in the complete dataset—14 motifs from the 3′ UTR and 45 motifs from the 5′ end. Of these motifs, nearly half are highly conserved (conservation index >.9) between C. elegans and C. briggsae. The conservation score measures the fractions of 7-mers that are better conserved than the motif, i.e. a conservation score of 1 means that the motif is better conserved than all 7-mers. For the data without the young adult stage and using only the significant clusters from STEM that were enriched for GO-terms pertaining to development, 19 motifs were found—11 from the 5′ promoter region and 8 from the 3′ UTR. Of the 19 motifs found in the smaller dataset, 7 were also found in the larger one. Some of the motifs that were seen in the smaller dataset, but not the larger, could have been obscured through the inclusion of many more genes not directly related to development. As much of worm biology is still unknown, the exclusion of clusters not directly relating to development could also obscure relevant biology. This is shown by the fact that several known developmental motifs are found in the larger dataset but not in the smaller.
The FIRE analysis is able to recapitulate known biology. For example, the HNF-4 motif was found in both analyses. HNF-4 is the hepatocyte nuclear factor 4, a nuclear hormone receptor that is the vertebrate homolog to the C. elegans
NR2A4 family 
. Nuclear hormone receptors (NHR) are important for a broad range of developmental phenotypes in multiple species 
. The C. elegans
NHRs that are most homologous to the vertebrate HNF4 are classified as belonging to the supplementary nuclear receptor (supnrs) group I, which consists of 24 members 
. All members of this class share a conserved DNA binding domain and thus may bind the same stretch of DNA while interacting with other proteins to lead to differential patterns of expression both temporally and spatially 
. NHR-40, which is a member of this family, is required for the development of muscle cells 
. NHR-60 appears to be a maternally deposited mRNA and is required for embryogenesis and early larval development 
. Many supnrs are expressed in the gut, possibly demonstrating a conserved function in intestinal development 
. NHR-64 and NHR-69, which show the closest homology to vertebrate HNF-4, have no obvious phenotype in RNAi knockdown experiments, but NHR-64 is expressed in the neurons while NHR-69 is expressed mostly in the gut 
The E2F motif was found in the dataset that did not include the young adult time point. E2F proteins have been shown to be important for cell-cycle progression and development in mammals 
. Kirienko and Fay 
identified this motif as enriched in LIN-35 responsive genes, many of which involve cell-cycle regulated genes. E2F has also been shown to promote programmed cell death, which is important in the maturation of larvae 
and is required for embryonic asymmetry 
. The predicted targets of E2F are enriched for the GO-terms development and embryonic development.
The GATA-1 motif found in the larger data set belongs to the GATA transcription factor ELT-2 that is known to regulate most intestinal development in C. elegans 
. The GATA-1 motif was also observed in LIN-35 regulated genes that do not contain an E2F motif, possibly due to the role of LIN-35 in pharyngeal development 
. Other previously identified transcription factors identified in this analysis include EVI-1 and ADR-1. The EVI-1 homolog in C. elegans
, EGL-43, is necessary for the AC/VU cell fate specification 
, while ADR-1 is highly expressed in the developing nervous system and vulva 
MicroRNAs (miRNA) are known to play a large role in C. elegans
and vertebrate development 
. 31 motifs were found in the 3′ UTR, of which two correspond to previously identified miRNAs. While the functions of the miRNAs identified through FIRE are currently unknown, the miRNA field is just maturing and there is still much that is not known about the roles of various miRNAs. miR-43 displays a stage dependent expression pattern; it is most highly expressed during the egg stage and not expressed at the adult stage 
. miR-238 and 239 show the opposite expression pattern, in that they are expressed at a low level in the egg stage and at higher levels during subsequent stages 
. miR-238 and 239 targets show enrichment for neurotransmitter binding, signal transducer activity, and localization to the membrane. It is possible that these miRNAs are partially responsible for neuronal development or synaptic maintenance.
In addition to the miRNA binding sites, several other motifs were identified in the 3′ UTR. There are over 500 RNA-binding proteins that have been annotated in the C. elegans
genomes, most of whose targets and binding sites remain unknown, and many of which have developmental phenotypes when deleted 
. A complete list of GO-term enrichments for the putative targets associated with the predicted motifs can be found in Table S4
Of the motifs identified by FIRE, there are seven new 5′ motifs and two novel 3′UTR motifs found in sets of genes that are significantly enriched for terms relating to development, including: positive regulation of growth, development, embryonic development, development (sensu metazoa), organ development, reproduction, hermaphrodite genitalia formation, cuticle components, and signaling and ion channels (, , , S4
). Additional new motifs show functions that are probably not related to development, including nucleotide binding, ATP binding and protein modification (Table S4
). One example of a novel motif is [ACT]CAC[AT]C[AC][CT]A which is enriched in clusters 17 and 41. Like the clusters, the targets of these motifs are enriched for GO-terms such a synapse part and neurotransmitter activity.
Motifs identified by FIRE in the promoter region.
Motifs identified by FIRE in the 3′ UTR.
Motifs identified by FIRE using clusters without young adult data.
New GO annotations can be proposed for many genes
When compared to the yeast genome, the worm genome is severely underannotated with regard to gene ontology. Of the roughly 9,000 genes that are placed into significant clusters by the STEM algorithm, over 2,600 genes currently have no GO-annotations of any kind in the most current release of the ontology (4/20/08). Using clustering and motif analysis to divide genes into sets of putatively functionally related genes, and applying prior knowledge regarding the function of the known motifs, it becomes possible to propose annotations for previously unannotated genes. The motif analysis is useful because it can be used to divide clusters into smaller classes with more narrow functions. We propose general annotations for these 1,568 previously unannotated genes (Table S5
For example, cluster 7 is composed of genes responsible for reproduction and development, but also genes responsible for mismatch repair and other cell-cycle functions that require DNA binding. While both functions are closely related, they represent distinct cellular processes. By using motif analysis, we are able to separate the functions in genes with no known function.
We propose that those genes in cluster 7 with the motif [AT]CG[AG]A[AGT]TA[ACT] are involved in nucleotide binding, while those with the motifs [CGT]CG[AC]GA[AT][CG][AGT], .[AG]ATCGAT[AT], .[ACT]A[CT]GCGC and .CGA[AC]G[AC]A[ACT] are involved in processes that are related to reproduction (). It is possible to further narrow the putative functions of unannotated genes. Along with the GO-terms related to reproduction, genes with the motif .[ACT]A[CT]GCGC are also highly enriched for terms related to organ development (Table S4
). As cluster 12 as a whole is highly enriched for organ development, including hermaphrodite genitalia development, we propose that the genes in cluster 7 with this motif likely function in organ development. In addition, the motif .[AG]ATCGAT[AT] is the binding site for the CDP/Cut-like transcription factors. Ceh-44
, which is the C. elegans
homolog, is known to function in neuronal development and is enriched in larval neurons and thus the genes that possess this motif in their promoter regions are likely involved in neuronal development 
. Finally, cluster 7 is also enriched for GO-terms involved in meiosis and gamete generation, as are the genes with the motif [CGT]CG[AC]GA[AT][CG][AGT].
Best motifs for a subset of genes in cluster 7.
An additional 338 genes have a significant motif in their promoter region, but the genes with these motifs have no known coherent function.