Overview of the high-throughput in situ hybridization procedure
The starting material for the production of hybridization probes was the set of cDNA clones that comprise the
Drosophila Gene Collection [
14,
15,
16]. The cDNAs were amplified in 96-well PCR plates using a vector-specific primer set that introduces a promoter for the production of a digoxigenin-labeled antisense RNA probe by
in vitro transcription. PCR products were sized to confirm the identity of each clone and that the PCR reaction was successful (Figure ). The strength of each probe was determined by a dot-blot color reaction (Figure ). These data were entered into a relational database and later used as experimental controls when assessing the outcome of each hybridization experiment.
RNA probes were hybridized to fixed
Drosophila embryos [
19] in 96-well plates (see Materials and methods). Three genes (
engrailed, hunchback, brinker) with well-described expression patterns were included in each 96-well plate and used to monitor hybridization efficiency. After hybridization, each plate was examined to determine the morphology of the embryos, quality of the staining and proportion of wells that showed staining (Figure ). A plate containing embryos of acceptable morphology, relatively free of staining artifacts and with more than 50% of the wells stained was considered successful and passed on to the image-acquisition stage.
Embryos from successful plates were mounted onto microscope slides. Low-magnification digital images of a group of embryos were taken (Figure ) to provide a permanent record of the hybridization in each well. Low-resolution imaging was insufficient to document highly restricted expression patterns or to identify small subsets of cells. For that purpose, each slide was examined under higher magnification using a Zeiss Axiophot optical microscope. At this stage, a human annotator carefully examined the entire slide, taking a large number of high-resolution digital photographs that document that gene's expression pattern (Figure ). All images were submitted to the relational database using a web-based annotation tool. We determined the success of each hybridization experiment by taking into account the results of the agarose gel analysis of PCR products, dot-blot analysis of probes, microarray data (described below), available information from public databases, and the quality of the captured images. Each experiment had two possible outcomes: either the observed expression pattern or the absence thereof was consistent with the available data, or there was a discrepancy indicating failure at some point and that the experiment needs to be repeated. About 13% of the in situ experiments failed, as a result of either the absence of a PCR product (9.2%) or a poor probe-labeling reaction (3.4%), which resulted in no detectable staining. Additionally, probes that generated expression patterns inconsistent with previously published data (0.3%) or the microarray expression profile (7%) were rejected as being possibly mislabeled or cross-contaminated. Overall, we obtained useful expression data for 2,179 out of 2,721 (80%) of the genes whose transcript distribution we analyzed.
Documenting expression patterns by digital photography
We captured high-resolution photographs of the 1,388 genes (63.7% of the 2,179 successfully asssayed genes) that exhibited some level of tissue-specific gene expression. The captured images were ordered according to the developmental stage of the embryos in order to visualize the change of the expression pattern over time. Embryogenesis is traditionally divided into a series of consecutive stages distinguished by morphological markers [
20]. The duration of developmental stages range from 15 minutes to more than 2 hours; therefore the stages of development were differentially represented in our embryo collections (see Materials and methods). Some consecutive stages, although morphologically distinguishable, differ very little in terms of changes in gene expression, whereas other stage transitions, such as the onset of zygotic transcription or organogenesis, are accompanied by massive changes in gene expression. We divided the first 16 stages of embryogenesis into six convenient stage ranges (stages 1-3, 4-6, 7-8, 9-10, 11-12 and 13-16). Each captured image is assigned to a stage range, and for each stage range a number of images are taken so that all stages within the range are represented. The groups of images assigned to a stage range are arranged in the web-based annotation tool from left to right so that one can follow the pattern through development (Figure ).
We took, on average, 16 individual images for each gene; however, the number of images per gene varies from 1 to 80. This variability reflects our strategy to document highly dynamic, complex, novel and otherwise notable patterns extensively, while progressively lowering the number of images documenting common or simple expression patterns. The number and type of images collected for each probe were chosen so as to allow an embryologist to reconstruct the expression pattern as if they were examining the stained embryos under a microscope.
Although the in situ hybridization analysis is performed on fixed tissues, the ability to take many snapshots of developmental processes and to order them allows us to reconstruct dynamic developmental events. For example, we can visualize the progressive segmental proliferation of the fat body (Figure ) or follow the dispersal of blood-cell precursors throughout the embryo (Figure ).
Many genes are either not expressed at all during embryogenesis or their expression is not tissue specific. Several canonical examples of these staining patterns were captured, and then only textual annotation and a low-magnification image were used to document additional occurrences. A total of 791 genes (36.3% of the 2,179 genes successfully assayed) were not documented by high-resolution images. These were assigned to one of the following four classes: 362 (16.6%) do not appear to be expressed during embryogenesis; 21 (1%) are not maternally contributed but are ubiquitously expressed in the developing embryo; 317 (14.5%) are exclusively maternally expressed; and 91 (4.2%) show both maternal and ubiquitous zygotic expression.
Microarray data as independent measurements of RNA expression patterns
We used microarray time-course expression profiles as independent measurements to ascertain the accuracy of the captured expression patterns. For the microarray measurements to be truly independent, it was important that no single clone contamination or misidentification event be able to affect the integrity of both the microarray and
in situ hybridization result. Existing microarray datasets, such as that reported in [
21], were generated using spotted cDNA arrays [
13] derived from the same DGC clone set [
14,
15,
16] that we used for the
in situ probes, and were therefore unsuitable as independent controls. For this reason, we chose to generate a new dataset using Affymetrix GeneChip technology [
3], where the oligonucleotide probes are designed directly from the genome sequence and do not depend on the cDNA clone identity. For the microarray analysis, we divided embryogenesis into 12 1-hour time windows, collected three independent embryo samples for each time window, and hybridized total RNA from each sample to the GeneChip
Drosophila Genome Array. As each embryo sample contained a distribution of different ages, we examined the distribution of morphological stage-specific markers in each sample to correlate the time-course windows with the nonlinear scale of embryonic stages. As described above, the image data are organized into groups of images associated with a range of embryonic stages. Both the image and the microrray datasets are thus linked by a common time scale of developmental stages. This facilitates the direct comparison of the
in situ staining patterns with microarray expression indexes. Figure shows the comparison of array and image data for a highly unusual gene-expression pattern. In the absence of any additional information on the cDNA clone LD43816 (
CG4702), the contrast between the heavy staining seen in late embryos and the earlier highly restricted pattern might suggest the possibility of probe contamination. However, when stage-specific images are compared with the microarray profile, the expression profile is consistent with the
in situ hybridization data. Although high-throughput
in situ hybridization is subject to potential sources of error, such as cross-contamination during PCR amplification or probe synthesis, using microarray expression data in parallel helped us avoid erroneous pattern assignment.
For a number of reasons, correlating staining patterns and microarray values is not straightforward. First, the intensity of
in situ hybridization staining is dependent on the strength of the probe and length of the color reaction. Second, it is not possible to distinguish weak ubiquitous staining from strong staining in a small subset of cells using whole-embryo microarray data. For microarray data, it is broadly accepted that relative comparison of independent measurements for the same gene are reliable, whereas absolute intensities across different genes are not, especially for low expression values [
12]. For all these reasons, the most useful factor in correlating microarray and image data is the relative fluctuation of signal intensity over the course of development. Figure shows 14 examples of distinct expression profiles that exhibit strong correlations between microarray and image data. The expression patterns of all these genes (except CG8782) have been previously described, providing independent confirmation that each profile is correct. Clear-cut correlations between microarray and
in situ hybridization data as seen in Figure are possible when the expression pattern exhibits changes over time. For monotonically expressed genes, the correlation is rather subjective and relies on absolute intensities of both microrarray and
in situ hybridization signals. In 7% of the experiments, we rejected the observed expression pattern because of an obvious mismatch with the microarray expression profile; such cases account for approximately one third of the genes whose analysis will need to be repeated.
Large-scale production of RNA in situ data is prone occasionally to produce false-negative results due to failures in probe production and hybridization. Microarray data are useful to identify experiments where the RNA in situ hybridization failed but the given gene is highly expressed. High absolute microarray-derived gene-expression values, coupled with either a continuous profile across the time course or distinct on/off periods during the time course, is a reliable indication that a gene is expressed. Low expression values, and a lack of consistency among replicate experiments, may indicate that a gene is not transcribed during embryogenesis. However, if a given gene is active in only a very small subset of cells, the microarray results from a whole-animal experiment may not be sensitive enough to detect its expression. Therefore, even with the microarray data at hand, it is not possible to avoid completely false negatives for low-abundance transcripts.
Textual annotation of gene-expression patterns and assembly of a public database
To provide searching capability beyond queries for a specific gene, we rigorously annotated the gene-expression profiles using a controlled vocabulary. We used human annotation, rather than automated approaches based on pattern-recognition algorithms, because of the overwhelming complexity of annotation. Variation in morphology and incomplete knowledge of the shape and position of various embryonic structures make computational approaches impracticable at present. Moreover, a human annotator does not only take into account the image data, but also integrates other information such as the microarray profile and previously published data into the final assessment of the expression pattern. In our project, a single person carried out the initial annotation, resulting in a highly consistent dataset.
Annotation of gene-expression patterns that change dynamically over time poses a significant challenge. There is a need to have a specific name not only for the final developed embryonic structures but also for all the developmental intermediates that precede them. Every terminally differentiated structure of the embryo descends from a group of cells within the cellular blastoderm epithelium [
17]. We used this embryological concept to define a set of embryonic structure names that depict a 'path' describing the development of each organ. Four basic categories of developmental structures, called anlage in statu nascendi, anlage, primordium and organ, are distinguished.
At the end of embryogenesis, organs can be distinguished by their unique morphology and function (for examples, see Figure ). Traditionally, two types of developmental intermediates that precede the terminally differentiated organ have been defined: anlage and primordium (for examples, see Figure ). An anlage is defined as a morphologically indistinct group of contiguous cells, established by lineage tracing, that gives rise to an individual organ. Anlagen for most organs can be distinguished at the late cellular blastoderm or gastrula stage (stages 5-9). A primordium can be recognized on the basis of its distinct morphology. A primordium will give rise to one or more differentiated organs. We have included the germ layers in the primordium category. Primordia develop from anlagen but an anlage - for example, a group of cells defined by gene expression that will give rise to a subset of an organ - can also be part of a primordium. Individual names are connected by relationships that define the way the respective tissues develop from one another or encompass one another.
Many genes whose expression is ultimately restricted to, and required for the determination of, a specific anlage initially appear in a larger area. This dynamic expression may reflect the working of an underlying molecular network of activating and inhibitory factors which only gradually succeed in directing expression of a given gene to a specific subset of cells which thereby become defined as a distinct anlage. We propose the term 'anlage in statu nascendi' (in statu nascendi can be loosely translated as 'in the process of being formed') for the larger domain from which a specific anlage originates. Anlagen in statu nascendi can only be visualized by gene-expression analysis. They typically appear at the cellular blastoderm stage and resolve into specific anlagen towards the beginning of gastrulation (examples are given in Figure ).
Using this naming scheme we are able to describe the development of embryonic structures starting from anlage in statu nascendi at the cellular blastoderm stage through a series of developmental intermediates - anlage and primordia - to a differentiated embryonic structure. For example, the transcription factor single minded is expressed in the glial cells of the mature embryo [
22]. The origin of this expression pattern can be traced from the mesectoderm anlage in statu nascendi, to the mesectoderm anlage, to the mesectoderm (or midline primordium) and, finally, to the mature midline glial cells (Figure ).
In the annotation tool (Figure ), only those annotation terms that describe structures present within a specific stage range are displayed beneath the images of embryos from that stage. Annotation terms are organized into a hierarchy according to the developmental relationship between them. Using the annotation tool, one can follow and describe the development of each structure through its intermediates by observing the development of the staining pattern and selecting the appropriate annotation terms. Integration of image data with the predefined developmental hierarchy is one of the prime goals of our annotation effort.
One advantage of storing the data in a database is the ability to query the data and compare results in a rigorous manner. However, this is only possible if the data that are entered into the database are themselves rigorously controlled. Comparisons of biological data are complicated by the lack of standards in both reagents and nomenclature. We provide a standardized set of
in situ expression images prepared using the same hybridization probes, laboratory protocols, and descriptive nomenclature. For the nomenclature we needed an agreed vocabulary of terms to describe the different anatomical features of the
Drosophila embryo and the different stages of embryonic development. This was provided to us by the controlled vocabularies of anatomy and development that have been constructed by FlyBase [
23] over the past few years. A further advantage in using these vocabularies is that our data will be recorded in a way that is wholly consistent with that used by the FlyBase curators, who record gene-expression data from the scientific literature.
The FlyBase controlled vocabularies are organized as a directed acyclic graph (DAG). In a DAG the terms are the nodes in the graph and the relationships between terms are the arcs of the graph. A DAG has two characteristics that make it extremely useful for describing vocabularies. First, the graph is directed, which means that the reciprocal roles of two terms in a relationship are unequal. Thus, the 'parent' term may be a less specific and the 'child' term more detailed. Second, unlike strict hierarchies, a child term may have more than one parent terms. This data structure is the same as that used by the Gene Ontology (GO) Consortium for terms that are used to annotate gene products [
24].
The FlyBase controlled vocabulary uses three classes of relationship between parent and child terms. The first of this is 'is an instance of'; for example, the 'anterior spiracle' is an instance of its parent term 'spiracle'. The second is 'part of; for example, the 'gastric caecum' is part of the 'foregut'. The third is 'develops from'; for example, a 'glial cell' develops from a 'glioblast'.
The major change that had to be made to the FlyBase controlled vocabulary to support this project was to name the developmental intermediates that arise during embryogenesis, that is anlagen, primordia, and anlagen in statu nascendi. Our annotation uses a subset of 300 or so of the 5,800 terms in the FlyBase controlled vocabulary, many of which only apply to later stages of development.
We modified the GO database schema [
25] for the storage and searching of our gene-expression data. For the management of the terms and their relationships, the core of the GO database schema remains essentially unchanged. However, the database was extended to support the annotation process itself; that is, assigning terms to describe the expression patterns. Additional tables were also added to describe when and where a given mRNA is expressed, and the images that constitute the evidence for these observations.
We implemented two publicly available tools to search and mine the gene-expression dataset. The advanced search page is a modified version of the Gadfly search interface [
26,
27]. It can be utilized to search for the expression pattern of an individual gene of interest, retrieve a list of genes expressed in a given embryonic structure, or set of structures, or all genes expressed at a certain stage of embryonic development. Sets of genes and their corresponding expression profiles can be grouped on the basis of cytological position in the genome, functional GO assignments, or the presence of protein domains in the gene sequence. The results returned by all types of searches can be formatted to display controlled vocabulary annotations exclusively or also show images and microarray profiles of sets of genes side by side (Figure ). Alternatively, one can explore the dataset by browsing through the controlled vocabulary using a modified version of the Amigo Gene Ontology browser, which we call ImaGO [
28]. Once a term from the vocabulary is selected, ImaGO will return a page that lists all genes expressed in a given structure and also all genes expressed in all structures that descend from it.
Systematic analysis of the in situ data
The embryonic expression patterns of 80% of the 1,388 genes that display restricted expression, and were therefore annotated in our database, are described by a unique set of annotation terms. This observation illustrates the tremendous diversity of gene-expression patterns during embryogenesis, which range from expression in a single embryonic structure to expression in 36 distinct tissues. Although expression patterns of genes are rarely identical, there is a noticeable similarity among patterns of expression of many genes. Therefore we sought to order these genes on the basis of the similarity of their expression patterns. Furthermore, gene-expression data can be used to quantify the relatedness of the various embryonic structures in terms of similarity in gene expression between them [
29]. We carried out hierarchical clustering of genes and tissues using a method based on binary similarity metrics ([
30]; for details see Materials and methods). Figure shows the result of clustering the 99 differentiated embryonic structures that expressed at least two of the 1,388 genes. A black dash at the intersection of a row and a column of the matrix occurs whenever the gene corresponding to that row is expressed in the embryonic structure corresponding to that column.
The clustering organizes the embryonic structures in such a way that related structures are close together whereas unrelated ones are widely separated. For example, components of the nervous system (Figure , green shading) that are ectoderm derivatives, cluster together and away from mesodermal derivatives, such as muscles (Figure , cyan shading). In other words, very few of the genes in our dataset are expressed in both muscles and the nervous system, reflecting the physiological separation and different developmental origins of these two tissues.
By analogy, the genes are organized by the hierarchical clustering so that those with the most similar expression patterns are close together and those with highly divergent patterns are widely separated. The distribution of clusters in the matrix can be used to identify genes that are expressed in a tissue or set of tissues. For example, the red rectangle in Figure highlights a cluster of genes that are expressed exclusively in the embryonic fat body. Genes expressed in the tracheal system are split into at least two clusters; genes expressed exclusively in the trachea form a cluster in the middle of the matrix (Figure , blue rectangle), and a second cluster includes genes expressed in the trachea and variety of epidermal structures (Figure , magenta rectangle). Interestingly, no genes are expressed in both trachea and the CNS (note the gap in CNS clusters at the level of the tracheal cluster). Genes expressed in the nervous system form the most prominent clusters. The annotation vocabulary subdivides the nervous system into specific subsets based on tissue types (neurons, glia) and anatomical position (brain, ventral nerve cord, lateral cord, sensory nervous system). A cluster of genes that are expressed specifically in the components of the peripheral nervous system and absent from CNS can be identified (Figure , yellow rectangle). An increase in the depth of annotation will be required to subdivide the large clusters that correspond to complex organ systems.
Figure represents one possible outcome of the clustering analysis of the annotation dataset. Filtering of the genes and anatomical structures, the type of clustering algorithm and the distance metric are variables that need to be optimized to address specific questions about the variations in patterns of gene expression. We used a relatively simple metric to define similarity among the annotation data. In the future it will be interesting to explore more complex similarity metrics that incorporate the distance between annotation terms within the ontology. Clustering of the annotation data and other data-mining approaches will establish sets of co-regulated genes that will provide a starting point for investigating
cis-regulatory sequences that may elucidate novel regulatory relationships in development. Interactive web pages with a complete clustering matrix can be accessed at [
31].
Systematic RNA
in situ hybridization is an alternative to mutagenesis screens [
32] to uncover genes involved in embryonic patterning. Figure illustrates one case where conventional mutagenesis failed to identify genes with highly specific expression patterns at the cellular blastoderm stage, most probably because of genetic redundancy. At map position 66E4, three adjacent and highly homologous genes for Brachyury/T-box-containing transcription factors are expressed with strikingly similar expression patterns. Thus disruption of any one of these genes might not be sufficient to produce an embryonic phenotype. Conversely, in many cases the expression patterns of duplicated genes have diverged and are largely non-overlapping. Expression data can provide a guide as to which multiple gene knockouts are likely to yield specific phenotypes.
A systematic screening approach also reduces bias in the selection of the types of genes chosen for study. Figure shows five examples of metabolic genes that, contrary to naive expectation, are expressed at cellular blastoderm in domains suggestive of roles in embryonic patterning. A similar observation was recently reported for the expression of genes isolated specifically from blastoderm-specific cDNA libraries [
7]. Tight regulation of expression of these genes could be explained by a preferential requirement for certain metabolic pathways in specialized embryonic tissues. That argument may not be plausible for genes expressed within limited regions of the cellular blastoderm embryo. It is possible that these gene products are not translated, are otherwise inactive in the tissues where the transcripts appear abundant, or that their restricted expression is simply the result of their proximity to a transcriptional control element of a neighboring gene. A more interesting possibility is that this expression may reflect yet unknown functions of these genes.
A major advantage of RNA in situ hybridization is that this method has sufficient spatial resolution to uncover subcellular localization of mRNAs. Figure shows an apically localized mRNA in the columnar epithelium of the developing hindgut surrounding the migrating pole cells (Figure ). This staining is strikingly complementary to that of a basally localized mRNA from a different gene that is expressed in the same cells (Figure ). The transcript of yet another gene, whose expression is shown at high magnification at the pre-cellular blastoderm stage, appears to be localized to a sub-compartment of the nucleus (Figure ). We also found novel mRNAs localized asymmetrically along the anterior-posterior axis of the early pre-blastoderm embryo (data not shown). Overall, we find that about 1% of genes exhibit easily discernible subcellular localization, and thus our dataset can also be useful in identifying aspects of RNA localization and trafficking.