Over 3000 genomes have been sequenced, annotated and deposited in the archives maintained by the International Nucleotide Sequence Database Collaboration. While these genomes are mostly bacterial, the number of eukaryotic species sequenced already exceeds 200; and increasing quantities of data are available for most of these species, with resequencing of large populations of individuals (to identify polymorphism) now common in all parts of the taxonomy. These developments have created a need for sophisticated systems [e.g. Ensembl, the Generic Model Organism Database toolset (http://gmod.org
), the UCSC genome browser (6
)] for the management of genome-scale data; but the growth in data volumes (the nucleotide archives are presently doubling in size every 9 months) naturally raises questions of future scalability. In the case of annotated genomes, such concerns are focused less on absolute data volume (as an annotated genome sequence is itself a compacted, interpreted form of the raw data present in the archives), but on the human resources needed to integrate and interpret the raw data correctly. Pipelines for constructing Ensembl databases are highly automated, but owing to differences in the biology and available data for each species, the production of high quality annotation still depends on a measure of manual intervention (for parameterization, quality control and biological validation). Expert groups focused on particular organisms exist in many domains, but sometimes lack the resources to develop infrastructure to support the increasing variety of available data types. Moreover, cross-species analysis may be difficult when data is distributed across different sites, each with their own technical implementations.
In Ensembl Genomes, we address these problems by collaborating with community-based groups maintaining specialized resources, to assist in the process of genome annotation and to provide a permanent integrative portal for data from many species. The form of these collaborations depends on the mandate and expertise of the collaborating group, and the fit of their data to the Ensembl toolset. In domains where a well-established resource (such as a model organism database) is already maintaining primary annotation, this information is integrated into the Ensembl data structure, and supplemented with additional high-value data sets where available. In many cases, Ensembl Genomes actively participates in these community-based initiatives, directly contributing to the production of reference annotation. In these partnerships, each project's own portal remains the primary point of access for researchers working within the relevant domain, and will often display a wide range of data types; while Ensembl Genomes places the project's genome-centric data in its broader context. We believe that this model for collaboration—involving specialized resources, which take custodianship of collections of closely related genomes and which are grounded in their own communities; the re-use of technology and expertise; and the provision of a pan-taxonomic integrating portal—is a viable response to increasing data growth.
In the plant domain, we work with Gramene (http://www.gramene.org
), a peer resource (also utilizing Ensembl technology) to maintain a common resource for plant genomics in Europe and USA. We have also recently established a new collaborative network, transPLANT, to develop a European-wide infrastructure for genome scale data in plants. We are part of the VectorBase (http://www.vectorbase.org
) consortium (8
), a NIH NIAID resource for the genomes of invertebrate pathogens of human diseases. We have recently joined WormBase (http://www.wormbase.org
), which maintains resources for nematode genomes, especially the model species Caenorhabditis elegans
. Two recent collaborations have established PomBase (http://www.pombase.org
), a new model organism database focused on the fission yeast Schizosaccharomyces pombe
; and PhytoPath (http://www.phytopathdb.org
), a resource for plant pathogens, with a focus on fungi and oomycetes. These two projects are utilizing Ensembl software for data visualization. Additionally, we have been working with the Central Aspergillus
Data Respoistory, CA
) on genomes of the genus Aspergillus
; and with the Broad Institute and the U.S. Agricultural Research Service on the genome of the wheat stem rust pathogen, Puccinia graminis
. We are always open to new collaborations to extend the range of species covered, and to analyse and integrate the output of specific scientific projects.