Data integration is a major challenge of the project. Besides the differences of the model organism themselves, databases provide gene expression data in different formats (flat files, sql-dumps, direct database access) and annotation has been done differently (screens, literature, curators).
So far we have integrated expression data for zebrafish (1
), medaka (4
) and mouse (5
). gives an overview on the expression pattern annotations that have been integrated for 4DXpress
. The best-annotated model species at the moment are Drosophila
and zebrafish with almost 6000 annotated genes each. Mouse follows with 3893 annotated genes; some annotations were done using a 3D virtual embryo (6
Content of 4DXpress. Annotation status of gene expression patterns at present time
Also expression data has been gathered differently. For medaka and Drosophila the major annotation results from a screen. Expression has been analysed at distinct time points and cover between 3 and 4 stages per gene on average (, stages per gene), whereas zebrafish expression patterns are additionally annotated from literature by a team of database curators. Annotation is done for continuous developmental stages.
Anatomy ontologies are often very rich, however only a limited fraction of the terms is actually used for expression annotation (, distinct annotations). Again, ZFIN uses a rich vocabulary with almost 700 distinct terms. The values for mouse and medaka need to be treated with care, as the ontologies used for annotation here are the cross product of anatomy and stage ontologies and therefore overestimates vocabulary richness.
Our database schema can store all information required by the MISFISHIE standard (minimum information specification for in situ
hybridization and immunohistochemistry experiments) (7
). This will allow us to efficiently adapt other model species as well as developing a data exchange format to keep up to date with other resources.
One of the major goals of our project is to be able to compare gene expression patterns between the different model species. For doing so, relationships need to be established between genes (orthology), between time windows (developmental stages) and most challenging between anatomical structures (homologue/analogue).
EnsEMBL compara (8
) provides a reliable source of sequence homology relationships, which was computed using a tree-based approach. We have chosen to use this and update regularly upon new EnsEMBL releases. We assigned each gene to a cluster of orthologues using the EnsEMBL notification: one2one-, one2many- and many2many-orthology relationships. Through the web interface (described below), these clusters are visualized as a network and homology relationships are used to sort the gene list retrieved from a query as well as for allowing quick links from one gene to the orthologues in other species.
Developmental stage mapping
It is very difficult to identify corresponding developmental stages in two species, even when comparing two closely related fish species like medaka and zebrafish. For instance in medaka, the head and brain develop faster, whereas the tail and somites develop slower than in zebrafish. So a matching zebrafish stage regarding the number of somites (which is a very popular staging feature) would correspond to an earlier stage than a matching zebrafish stage based on head features.
However there are key events in development, which allow researchers to define a list of eight stages that is described in all developmental biology text books and is common to all bilaterian animals: zygote, cleavage, blastula, gastrula, neurula, organogenesis, juvenile and adult. By mapping each of the species stages onto one of the bilaterian stages the link between species stages can be done and combinatorial explosion can be prevented. A new species will only need to be mapped to the common stages (, top right) and not against all stages of all other species (, top left).
Mapping of developmental species was done via a list of stages common to all bilaterian animals.
Obviously temporal resolution is lost when mapping a list of 40 developmental stages onto a list of only eight common stages, but the eight stages seem to be the largest set shared by all bilaterian species and they represent the key events in the development of an organism. The original species-specific stage annotation is not replaced by the stage mapping terms to keep high temporal resolution. However, the stage mapping establishes temporal relationships that can be used for cross-species queries.
The anatomy mapping will be an ongoing process the same as it is also an ongoing debate in the scientific community about which structures can be defined as being homologous. We have not yet carried out a complete anatomy mapping, but we have set up the resources and tools for doing so. Evidence from different analyses will need to be integrated for approaching this problem. One can use lexical, anatomy structure and co-expression cues to establish relationships between the anatomical terms. The first two cues can be used by just comparing the anatomy ontologies available for the model species (9
). For the inclusion of co-expression we are currently examining conserved network patterns in species-specific co-expression networks via orthology relationships. The user can exploit lexical cues already, using the term-based expression search (described below).
The common anatomy reference ontology (CARO) is being developed to facilitate interoperability between existing anatomy ontologies for different species. It aims to provide a template for building new anatomy ontologies. We think CARO could serve as a template to build an anatomy ontology shared by all bilaterians. Similar to the stage mapping we then want to map species-specific anatomy terms onto this common ontology.
4D ArrayExpress data warehouse
Expression data acquired through in situ
hybridization, antibody or transgenic expression can be complemented through microarray data. The first methods provide high-resolution data in both space and time, which microarray data cannot provide; microarray experiments however can quickly give a quantitative overview on the overall expression of all genes in a genome. Especially useful are time series that provide insight in expression changes during development. That is why we have set up a complementary project at ArrayExpress (10
), which stores corresponding microarray data. The project is called 4D ArrayExpress data warehouse (4DDW) and is accessible at: http://www.ebi.ac.uk/microarray-as/4DDW_EMBL/
. The 4DDW will be described in detail elsewhere.
So far we have established 4737 reciprocal links for mouse, Drosophila and zebrafish. When querying microarray data at the 4DDW users can quickly go to 4DXpress and vice versa. The close linkage of these two resources allows researchers for example to quickly examine the gene expression patterns of a list of genes that cluster together in a microarray experiment.
Expression patterns within a species can easily be compared when representing the expression annotation as a binary vector (1 for expressed, 0 for not expressed). Different methods to calculate the similarity between these vectors can be applied.
We have chosen the Jaccard coefficient as a similarity measure for a start, which is simple to calculate and has been used in the first BDGP release (2
) for the same purpose.
The Jaccard distance has been calculated between the expression vectors of gene pairs. The expression binary vector was compiled considering stage and anatomy. If a gene is expressed (has positive annotation) at a given stage in a given anatomical structure the vector value is set to true, otherwise to false.
The Jaccard similarity coefficient is defined as the size of the intersection divided by the size of the union of the sample vectors:
The Jaccard distance is supposed to estimate how different expression patterns are. However this value depends on the extent and quality of the expression annotation. Thus, in the cases where annotations are incomplete or have been done inconsistently, this measure might be misleading. Also, this method treats all anatomical structures equally. Relations defined in the anatomy ontology are not taken into account. In future we will provide additional similarity measures e.g. the semantic similarity, which accounts for that.
Still, the Jaccard distance provides a quick and easy way for identifying similarly annotated genes. The values are stored in the database and helps users to find genes within the species with similar expression patterns. This measure can also be used to cluster genes with similar gene expression pattern annotations as shown for Drosophila
). We use these similarity relationships to generate co-expression networks and plan to search for conserved network patterns across species using orthology relationships.