The MIPS yeast genome database is developed and maintained by a group of European databases and yeast laboratories forming a decentralized network of expertise in order to provide detailed information on protein-coding sequences and other genetic elements. Representing the best investigated eukaryote, the database is organized by complementary classifiers with the aim of allowing for the interpretation of functional relations between genes and their corresponding proteins. For instance, the Functional Catalogue (FunCat), providing a systematic classification of protein function is intensively used to project functional data such as expression profiles onto known or probable functional units. Manual FunCat classifications are available not only for yeast, but also for other MIPS-curated genomes such as Arabidopsis thaliana
and the human genome (1
). In addition, the set of proteins represented in the PEDANT genomes database was assigned to FunCat classes; the complete list of the assignments is accessible (see Table ). Within each functional class, the sequences have been clustered into disjoint homology-based subsets.
URL addresses for MIPS database resources
The Catalogue of Protein–Protein Interactions, the Protein Complex Catalogue and the Protein Localization Catalogues allow information related to the interaction of proteins in yeast to be obtained. More than 10 600 protein–protein interaction records (~9100 physical, ~1500 genetic) were compiled from published large-scale experiments and the literature. The annotated protein complexes (>1000) can be split into ~87 000 putative binary interactions. The vast majority of the records are documented by PubMed reference IDs and by information on the nature of the experimental evidence, which correlates with the confidence of the assignment used in probabilistic computations.
Detailed information on transport proteins [Yeast Transport Protein DB (2
)], transcription factors and their binding sites [TRANSFAC (3
)] and metabolic pathways are either part of the core yeast database or can be retrieved using the BioRS data integration system. To be able to represent complex data of fungal genomes, we use the Genome Research Environment (GenRE) as our annotation data structure. GenRE allows combination of information on different classes of genetic elements and their relations, such as protein–protein interactions or common regulatory features; it provides annotation as well as flexible data retrieval interfaces.
Related proteins from other species can be retrieved using the precomputed SIMAP database (see below) but also using the integrated SESAM tool [Seed Extraction Sequence Analysis Method (4
)]. SESAM was developed to achieve better selectivity and sensitivity for the characterization of proteins on a large scale without being dependent on secondary data collections, such as InterPro (5
). The selectivity and sensitivity particularly addresses the challenging ‘twilight zone’ of <30% overall pairwise sequence identity. SESAM does not require the manual adjustment of parameters and copes well with different cases of highly conserved as well as distantly related homologues. A subsequent clustering step starts from SESAM seed-based alignments and leads to ‘SESAM feature clusters’.
In CYGD, manually annotated genomes are interlinked via BioRS with the PEDANT analysis of recently published full genomes as well as the 13 hemiascomycetous yeasts, generated by the Génolevures I project (6
). An up-to-date compilation of the Saccharomyces cerevisiae
introns and the analysis of introns in seven related species can be accessed through the ‘Hemiascomycetous Yeast Spliceosomal Introns’ view (7
). Comparative analysis of the S.cerevisiae
chromosomes is enabled by a graphical display of the fungal orthologues. The integrated complete genomes include: Schizosaccharomyces pombe
, Candida albicans
, Saccharomyces bayanus
, Saccharomyces castellii
, Saccharomyces kluyveri
, Saccharomyces kudriavzevii
, Saccharomyces mikatae
, Saccharomyces paradoxus
[Whitehead Genome Center (http://www-genome.wi.mit.edu/
) and George Washington University, St Louis (http://www.genetics.wustl.edu/
)], Candida glabrata
, Debaryomyces hansenii
, Kluyveromyces lactis
, Yarrowia lipolytica
(Génolevures II, http://cbi.labri.u-bordeaux.fr/Genolevures
), as well as the genomes annotated at MIPS: N.crassa
(MNCDB), Magnaporthe grisea
, Aspergillus nidulans
, Fusarium graminearum
(FGDB) and Ustilago maydis
. Further genomes will be added to enable a comprehensive comparative fungal data resource.
The recently annotated genome of the filamentous fungus N.crassa
is based on data from the German Neurospora
Sequencing Project (Chromosomes II and V) (8
) and the whole genome sequence, assembled by the Whitehead Genome Center, Cambridge, MA in 2002 (9
). In a collaborative effort with the Whitehead group, the MIPS group has annotated the complete genome including manually supervised gene modeling and functional classification of the proteins encoded. The genome of ~40 Mb encodes ~10 000 proteins automatically predicted by the program FGENESH (http://softberry.com
), specifically trained for Neurospora
. The manual inspection of the gene models included intrinsic and extrinsic information such as comparison with known proteins and ESTs as well as splicing consensus signals. Protein sequences were subsequently submitted to the comprehensive analysis of the functional and structural attributes. All information is available at the Neurospora
project page (Table ).