|Home | About | Journals | Submit | Contact Us | Français|
Three independent databases of eukaryotic genome size information have been launched or re-released in updated form since 2005: the Plant DNA C-values Database (www.kew.org/genomesize/homepage.html), the Animal Genome Size Database (www.genomesize.com) and the Fungal Genome Size Database (www.zbi.ee/fungal-genomesize/). In total, these databases provide freely accessible genome size data for >10 000 species of eukaryotes assembled from more than 50 years' worth of literature. Such data are of significant importance to the genomics and broader scientific community as fundamental features of genome structure, for genomics-based comparative biodiversity studies, and as direct estimators of the cost of complete sequencing programs.
Eukaryotic genome size data are becoming increasingly important both as the basis for comparative research into genome evolution and as direct estimators of the cost and difficulty of genome sequencing programs for an expanding sphere of non-model organisms (1–3). Nuclear DNA content data for >10 000 species of plants, animals and fungi are made freely available through three independent databases of eukaryotic genome size that have been either launched or re-released since 2005: the Plant DNA C-values Database (http://www.kew.org/genomesize/homepage.html) (4), the Animal Genome Size Database (www.genomesize.com) (5) and the Fungal Genome Size Database (www.zbi.ee/fungal-genomesize) (6).
Genome sizes are typically given as gametic nuclear DNA contents (‘C-values’) either in units of mass (picograms, where 1 pg = 10−12 g) or in number of base pairs (in eukaryotes, most often in megabases, where 1 Mb = 106 bases). These are directly interconvertible as 1 pg = 978 Mb (or 1 Mb = 1.022 × 10−3 pg) (7). The majority of modern genome size estimates are based on either Feulgen densitometry (more recently using computerized image analysis) or flow cytometry, although DNA reassociation kinetics, bulk fluorometry, static fluorometry, electrophoretic methods, quantitative real-time PCR and complete genome sequencing have also been used. Data from all such measurements are compiled into the databases along with updated taxonomy, analytical details and other relevant information (e.g. chromosome number) where available.
The first genome size estimates were conducted in the late 1940s, and the earliest attempt at a comprehensive list was provided ~25 years later by Sparrow et al. (8). M.D. Bennett and colleagues carried on this important effort by publishing a series of lists for botanical genome size data beginning in 1976. Unfortunately, zoological and mycological counterparts were not forthcoming for another 30 years, aside from a few taxon-specific compilations based on a small number of sources [e.g. (9)] or online lists of limited scope [e.g. Database of Genome Sizes (http://www.cbs.dtu.dk/databases/DOGS/) and DBA Mammalian Genome Size Database (http://www.unipv.it/webbio/dbagsdb.htm)]. The databases described below, therefore, provide the first truly comprehensive catalogues of eukaryotic genome size data and represent a much-needed resource for members of the genomics community.
By 1995, four major lists of angiosperm genome sizes had been published, which together contained data for 2802 species (10–13). Although the lists were well used (collectively, they have been cited >1500 times as on August 2006), it became increasingly cumbersome to determine whether a particular species was listed. It was therefore decided to pool the values into a single database and release them on the internet. The resulting Angiosperm DNA C-values Database was compiled by M.D. Bennett and I.J. Leitch and was coded by the Information Services Department at the Royal Botanic Gardens, Kew; it went live in April 1997. Between 1997 and 2001, two updates of the Angiosperm DNA C-values Database were released and the Pteridophyte DNA C-values Database was added.
Based on the evident utility and high usage of these two databases, efforts were initiated to construct counterparts for other plant taxa as data became available. Ultimately, this led to the assembly of the overarching Plant DNA C-values Database, which was made available through Java-based queries of a SyBase database and was released in September 2001. Initially, it contained C-values for 3864 species from the four land plant groups [angiosperms, gymnosperms, pteridophytes (comprising monilophytes and lycophytes) and bryophytes], with available genome size estimates for three algal groups (Rhodophyta, Chlorophyta and Phaeophyta) added to Release 3.0 in December 2004.
Release 4.0 of the Plant DNA C-values Database, launched in October 2005, contains genome sizes for 5150 species including 4427 angiosperms, 207 gymnosperms, 87 pteridophytes, 176 bryophytes, and 253 algae compiled from over 550 publications or personal communications (4). Tables 1 and and22 provide a breakdown of absolute and relative coverage of the major plant groups. Table 1 also gives the minimum, maximum and mean C-values for each of the major groups and shows that genome sizes in plants range from 0.01 pg in some unicellular algae (e.g. Cyanidium caldarium) to 127.4 pg in the tetraploid angiosperm Fritillaria assyriaca. Most of these data have been acquired through either Feulgen densitometry of stained root tip squashes (63%) or flow cytometry of freshly chopped leaf material (31%) using a range of plant calibration standards. These data are accessible through a variety of search options that allow users to either analyze C-value data across different groups of plants (by clicking on the Plant DNA C-values Database icon), or by searching within taxonomically specific subsections of the database (by clicking on the appropriate plant group icon).
The database contains information where available for the following fields:
At the end of each output, the number of records returned and summary statistics (minimum, maximum, mean and standard deviations) of these records are given.
Additional search options further enhance the flexibility of the database:
In addition to searching the entire database in this way, users can choose to search subsections of the database by selecting the specific plant group of interest (i.e. angiosperms, gymnosperms, pteridophytes, bryophytes or algae) from the homepage. In doing so, the user is provided with additional options for querying and/or outputting that are of unique relevance to each taxon:
Users are required to provide an email address to query the database, which aids in the tracking of usage and in the protection of intellectual property, but otherwise there are no restrictions whatsoever on access.
Besides genome size data, the database includes a summary of the development and release history of the database, instructions on how to search the database, author contact information, links to other databases containing genome size data, and the meeting reports from the international Plant Genome Size meetings, two of which have been held to date (in 1997 and 2003) at the Royal Botanic Gardens, Kew.
The Plant DNA C-values Database has been widely used, with >110000 hits from over 55 countries since its (re-)launch in 2001. On average, the database receives 2000–3000 hits per month with a mean of >60 queries per day, with each query downloading on average 110 genome size estimates. As on August 2006, the database has been cited in ~130 publications since its initial launch as the Angiosperm DNA C-values Database in 1997.
The first large-scale compilation of animal genome size data was created for an analysis of the correlation between genome size and erythrocyte size in mammals (18), which was later expanded for a similar study in birds (19). Recognizing the severe limitations on the study of animal genome size variation posed by the lack of access to such data, these unpublished datasets were expanded to include data from both vertebrates and invertebrates and were posted online as the Animal Genome Size Database on January 10, 2001. This initial release consisted only of flat text tables and included ~2900 animal species. As data continued to be added over the ensuing 5 years, the flat table format became increasingly cumbersome in terms of both updates and for the growing number of users.
Animal genome size data are accessed through either browse or search functions. The browse function allows users to select an entire group of animals (e.g. mammals, insects), or to select subsections of the database using progressive pull-down menus ranging in specificity from phylum to species. The advanced search feature allows a variety of queries, including genus, species or common name, as well as options to select genome sizes equal to, less than/greater than or between user-specified values. Finally, it is also possible to retrieve all records generated using a given method, standard species or cell type.
Data are returned in customizable dynamic tables, with users specifying the number of records displayed per page (100, 250, 500 or All). The default results page includes taxonomic details (Phylum/Subphylum, Class, Order, Family, Genus, Species, common name), C-value in pg, chromosome number (where available), and the method, cell type and standard species used in the analysis. The source is given as a numbered reference with a hotlink to the full citation. Two courses of action are possible from this results table: (i) the data can be downloaded and can be viewed using Excel (with the spreadsheet following the same customized format as the dynamic tables), or (ii) users can click on species names to enter individual species pages. The latter option provides a detailed record for the species of choice, including taxonomic and methodological details, the C-value estimate from the chosen record as well as links to other available records for the same species, chromosome number, the full source citation and both internal links (e.g. to call up data for all members of the genus, family, order, etc.) and external links [e.g. to NCBI, image searches and both general (e.g. the Integrated Taxonomic Information Service) and specific (e.g. FishBase, AmphibiaWeb) taxonomic databases as applicable]. There are no limitations on browsing or searching the database, but downloading data to Excel requires users to input a name and valid email address as a digital signature of a data sharing agreement. A randomized and limited-duration link to the compiled spreadsheet is then emailed to the input address as a means of protecting intellectual property without hindering access to information.
Release 2.0 of the Animal Genome Size Database also provides users with up-to-the-minute summary statistics for the entire database and each major taxonomic group and subgroup therein, number of species covered, min/max, mean ± standard error, a breakdown of methods, cell types, standards used for all records in the given group, and a brief summary of the major patterns and correlates reported to date for the taxon in question. Other features available to users include a real-time Flash-based graphical summary of the total dataset, relevant announcements and a list of the 10 most recently added records on the main page, as well as a fully searchable reference list, an FAQ, author contact information, links to related sites and a genome size discussion forum.
Traffic at the Animal Genome Size Database has increased steadily since its launch in 2001, and the main page now receives 50–100 unique visitors per day. Records regarding individual queries are not kept, but a typical data download includes all data for one or more entire groups of animals (i.e. up to several hundred species for a particular vertebrate group). The database has been cited in ~90 publications since 2001.
In a discussion of the plant and animal genome size databases penned in mid-2004, it was noted that ‘unfortunately, equivalent databases have not yet been compiled for fungi or "protists", although this would clearly be a worthy project for experts in those groups to undertake' (3). On March 20, 2005, a major portion of this gap had been filled with the launch of the Fungal Genome Size Database (6).
Numerous relative genome sizes (i.e. in arbitrary units) had been estimated in the late 1980s and early 1990s by researchers at the University of Regensburg in Germany using a classical cytophotometry technique, including 287 records for Basidiomycetes (20,21) and 743 for Ascomycetes (22). Using the same method as well as flow cytometry and image cytometry, and by employing an internal standard (Saccharomyces cerevisiae), it became possible to convert these estimates from arbitrary units into far more informative absolute genome sizes in Mb (23–25). These converted data formed the basis of the Fungal Genome Size Database, which has since been expanded to include 1298 records covering 739 species and 335 genera from 40 orders (Table 4) based on the taxonomy of the Index Fungorum Partnership (www.indexfungorum.org) (26).
Data from the Fungal Genome Size Database are made available through queries (PHP, HTML) of a MySQL database. The user and administrative interfaces for the database are generated by a CMS system developed by Trump Trading Ltd (TTCMS). The data can be queried by different taxonomic levels (phylum, order, genus, species epithet, variety) as well as by ploidy level, chromosome number, chromosome size range, method of genome size estimation, standard specimens used, cell type analyzed and source reference. Responses to queries are presented as HTML tables, with detailed information about given records (e.g. herbarium index, original reference and additional remarks) provided in a separate pop-up window accessed by clicking on a given genus or species name in the main table.
Compared with plants and animals, fungi display very small genomes: ~90% of the available fungal data lie within the range of 1C = 10–60 Mb, with an average of ~37 Mb and a median of 28 Mb (Figure 1). The largest fungal genome size reported to date, that of Scutellospora castanea (Diversisporales) is a mere 795 Mb (0.81 pg) (27), whereas the smallest, 6.5 Mb (0.007 pg) in Pneumocystis carinii f. sp. muris (Pneumocystidales), is far more miniscule than even the most streamlined animal or non-algal plant genomes (www.broad.mit.edu/annotation/fungi/fgi/FGI_01_whitepaper_2002.pdf) (28).
As with plants (and to a far lesser but not insignificant degree with animals), ploidy level variability is an important consideration in fungi. Ploidy level (x) has been estimated for 1036 (80%) of the records in the database, and varies from 1x to 50x. Diploidy (2x) is the single most commonly observed level (36% of records), although haploidy (1x) is also common; a level of 50x has been reported for only one species, Neottiella rutilans (22). Chromosome numbers have been reported for 81 of the species included in the database, ranging from n = 3 in Schizosaccharomyces pombe (Schizosaccharomycetales) (29) to n = 20 in Ustilago hordei (Ustilaginales) and Batrachochytrium dendrobatidis (Chytridiales) (28,30).
In both plants and animals, the majority of variation among estimates for individual species is attributed to experimental error (3,14,15). In fungi, however, it remains unclear to what extent apparent intraspecific variation is non-artifactual as data regarding heteroploidy in this group remain controversial (20,31,32). There is evidence that interspecific hybrids may occur in most fungal phyla, with both sexual and asexual origins evident among the growing list of apparent fungal hybrids (33). Hybrids may be diploid or maintain the dikaryotic state, they may undergo karyogamy and normal meiosis to reconstitute the euploid state, or they may undergo abnormal meiosis to yield a heteroploid hybrid. During vegetative growth, chromosomes and chromosome segments can be lost at random, which would generate legitimate variation in estimated genome sizes.
Electrophoretic karyotyping has shown that variation in chromosome number and size is a rule rather than an exception for many, mostly asexual, species (32). This method indicated that genome size in Pleurotus ostreatus (Agaricales) ranges from 20.8 to 35.1 Mb (0.021–0.036 pg, a relative difference of >60%) and chromosome number ranges from 6 to 11 (34,35). Using flow cytometry, genome size in the same species appears to range from 18.5 to 28.7 Mb (0.019–0.021 pg, a 55% difference) (B. Kullman, unpublished data), whereas microfluorometric measurements resulted in a reported range of 24.0–27.53 Mb (0.025–0.028 pg, a 15% difference) (21). It bears noting, however, that even small absolute differences among estimates that might be considered within the margin of measurement error in plants or animals (e.g. 0.01 pg) translate into substantial relative differences in species with such tiny genomes.
At this early stage, the database receives ~10–20 unique hits per day, and at the time of this writing has been visited by >9000 visitors from around the world.
Taken together, the three eukaryotic genome size databases represent some of the broadest genetic datasets available, covering >10 000 species. In relative terms, however, this comprises a very small minority of eukaryotic diversity. It is therefore a primary objective of modern genome size research to greatly increase the coverage of taxa in all three kingdoms. Perhaps the least well studied of all, however, are the members of the extremely diverse (and paraphyletic) assemblage commonly known as ‘protists’. The construction of a database of genome sizes for this group, and subsequent efforts to fill the gaps therein, represents an equivalently high priority. Overall, the release of these databases has proved to be a boon for the advancement of knowledge about eukaryotic genome structure and evolution, and has made it possible for the first time to identify the key areas still in need of intensive study.
The authors wish to thank their many colleagues and collaborators for assistance with various aspects of the construction and maintenance of the genome size databases. Work on the Animal Genome Size Database has been supported by the Natural Sciences and Engineering Research Council of Canada in the form of several scholarships, fellowships and grants to T.R.G. Research leading to the development of the Fungal Genome Size Database was supported by Estonian Science Foundation grant number 4989 to B.K. The Open Access publication charges for this article were waived by Oxford University Press.
Conflict of interest statement. None declared.