|Home | About | Journals | Submit | Contact Us | Français|
The University of California, Santa Cruz, Genome Browser Database (GBD) provides integrated sequence and annotation data for a large collection of vertebrate and model organism genomes. Seventeen new assemblies have been added to the database in the past year, for a total coverage of 19 vertebrate and 21 invertebrate species as of September 2007. For each assembly, the GBD contains a collection of annotation data aligned to the genomic sequence. Highlights of this year's additions include a 28-species human-based vertebrate conservation annotation, an enhanced UCSC Genes set, and more human variation, MGC, and ENCODE data. The database is optimized for fast interactive performance with a set of web-based tools that may be used to view, manipulate, filter and download the annotation data. New toolset features include the Genome Graphs tool for displaying genome-wide data sets, session saving and sharing, better custom track management, expanded Genome Browser configuration options and a Genome Browser wiki site. The downloadable GBD data, the companion Genome Browser toolset and links to documentation and related information can be found at: http://genome.ucsc.edu/.
Fundamental to expanding our knowledge of how the human body works in health and in disease is the capability to access and share data produced through experimentation and computational analysis. The University of California, Santa Cruz (UCSC) Genome Browser Database (GBD) (http://genome.ucsc.edu) (1) provides a common repository for genomic annotation data—including comparative genomics, genes and gene predictions; mRNA and EST alignments; and expression, regulation, variation and assembly data—and robust, flexible tools for viewing, comparing, distributing and analyzing the information. Produced and maintained by the Genome Bioinformatics Group at the UCSC Center for Biomolecular Science and Engineering, the GBD focuses primarily on vertebrate and model organism genomes, with an emphasis on comparative genomics analysis.
As of September 2007 the GBD contains data for 11 mammalian species including human, mouse, rat, chimpanzee, rhesus macaque, horse, cow, cat, dog, opossum and platypus; 8 other vertebrates: chicken, lizard (Anolis carolinensis), frog (Xenopus tropicalis), zebrafish, fugu, tetraodon, medaka and stickleback; and 21 invertebrates including 11 flies, honeybee, Anopheles mosquito, five worms, one yeast (Saccharomyces cerevisiae) and two deuterostomes—purple sea urchin and sea squirt. For many of the organisms, more than one assembly is provided, and several older archived assemblies may be found at: http://genome-archive.cse.ucsc.edu/. The GBD stores a collection of annotation data for each assembly, which can be viewed graphically in the UCSC Genome Browser (2) as a series of ‘tracks’ aligned to the genomic sequence and grouped according to shared characteristics, for example gene predictions, gene expression and variation data. In most instances, each annotation track is represented by a position-oriented table based on genomic sequence coordinates, and may be supplemented by additional non-positional tables that supply related information or link the primary table to other tables in the database. The data are stored in a variety of formats described at: http://genome.ucsc.edu/FAQ/FAQformat.
Minimally, the GBD provides assembly data, comparative genomics annotations, and mRNA, EST and RefSeq (3) gene alignments (when available) from GenBank (4) for each assembly. When available, links are provided to the complementary annotations in two other major genome browsers, Ensembl (5) and NCBI's MapViewer (6). A large set of additional annotations is available for widely studied genomes such as the human and mouse. Assemblies that lack sufficient native RefSeq data alignments and are of sufficient evolutionary distance from the human genome may also include a human proteins annotation that maps human exons using tBLASTn. The organizations and individuals who contributed to the sequencing, assembly, and annotation of featured organisms are acknowledged at: http://genome.ucsc.edu/goldenPath/credits.html; detailed information about the individual annotation tracks may be found in the Genome Browser by clicking the vertical gray or blue bars to the left of the displayed tracks.
UCSC updates the genome assemblies and annotations in the GBD as new releases become available, with priority given to primate and model organism assemblies and annotations that we feel are of widespread interest to GBD users, based on input from our Scientific Advisory Board and feedback received through our mailing lists and user surveys. (The results from a users’ survey conducted in May 2007 may be reviewed at: http://genome.ucsc.edu/goldenPath/help/GBsurvey507.html.) RefSeq and mRNA data from GenBank are updated daily; EST data are updated weekly.
In addition to the Genome Browser, several other graphical tools for exploring the data are available from the GBD website, including the Table Browser (7), which provides access for downloading and manipulating the GBD tables as text or tracks; the BLAT sequence-mapping tool (8); the In Silico PCR tool that searches a sequence database with a pair of PCR primers; the Gene Sorter (9) for exploring expression, homology and other gene relationships; the VisiGene in situ image browser, the Proteome Browser (10) for viewing related protein information; and the new Genome Graphs tool for uploading and viewing genome-wide data sets. This toolset is accompanied by a comprehensive set of online documentation and FAQs listed at http://genome.ucsc.edu/FAQ/. Online and hands-on training materials are available via the Training link (http://genome.ucsc.edu/training) on the GBD home page.
The GBD data, tools and source may be downloaded from http://hgdownload.cse.ucsc.edu/downloads.html. Instructions for setting up a local server to mirror all or part of the GBD data can be found at http://genome.ucsc.edu/admin/mirror.html.
During the year ending September 2007 UCSC added eight new organisms to the GBD: lizard (A. carolinensis), horse, platypus, medaka, stickleback and three worms (Caenorhabditis brenneri, C. remanei and Pristionchus pacificus). Nine existing organisms were updated with new assemblies: mouse, cow, cat, fugu, zebrafish, Drosophila melanogaster, two worms (C. elegans, C. briggsae) and sea urchin. As updates are added, older assemblies remain accessible either on the primary website or through the GBD archives.
In April 2007 UCSC released UCSC Genes (W.J. Kent, manuscript in preparation), an improved version of the existing Known Genes annotation (11), on the March 2006 (Build 36, hg18) human assembly. This annotation, which includes putative non-coding genes as well as protein-coding genes and 99.9% of RefSeq genes, is a moderately conservative prediction set based on data from RefSeq, GenBank and UniProt (12). Each entry requires the support of one GenBank RNA sequence and at least one additional line of evidence, with the exception of RefSeq RNAs, which require no additional evidence. Although some of the transcripts labeled as ‘non-coding’ in the set may actually code for protein, typically the evidence for the associated protein is weak. Compared to RefSeq, this gene set generally has about 10% more protein-coding genes, approximately five times as many putative non-coding genes, and about twice as many splice variants. As part of the migration to the UCSC Genes annotation, we now use our own UCSC Genes accession numbers as the primary key into the underlying knownGene table, rather than the GenBank mRNA accessions used in previous Known Genes annotations. The base accession numbers remain stable across iterations of the data set, although the suffixes may change to reflect version updates.
A companion annotation to UCSC Genes, the Alt Events track, shows various types of alternative splicing, alternative promoter and other events that result in more than a single transcript from the same gene.
UCSC released a new Conservation (13) annotation track on the March 2006 (Build 36, hg18) human genome in June 2007. This track displays multiz (14) multiple alignments of 27 vertebrate species to the human genome, along with measurements of evolutionary conservation across all 28 species and a separate measurement of conservation across the placental mammal subset of species (18 organisms). Included in the track are 5 new high-quality assemblies—horse, platypus, lizard, stickleback and medaka; 6 new low-coverage mammalian genomes—bushbaby, tree shrew, guinea pig, hedgehog, common shrew and cat; 6 updated assemblies—chimp, cow, chicken, frog, fugu and zebrafish; and 10 assemblies included in the previous version of the Conservation track—rhesus, mouse, rat, rabbit, dog, armadillo, elephant, tenrec, opossum and tetraodon. In addition to the expanded species list, the new Conservation track has been enhanced to include additional filtering of pairwise alignments for each species to reduce paralogous alignments and information about the quality of aligning species sequence included in the multiple alignments downloads. A similar Conservation annotation of at least 30 species is scheduled for release on the July 2007 (Build 37, mm9) mouse assembly in the last quarter of 2007.
Within the Variation and Repeats annotation group, UCSC has added several new data sets. The simple nucleotide polymorphism (SNP) data from dbSNP (15) Build 126, already available on the human and mouse assemblies, has been added to the chimp, rat, and dog. Updates to SNP Build 128 will be incorporated pending data release from dbSNP. SNP annotations may be filtered by several attributes, including average heterozygosity and weight, location type, class, validation, function and molecule type. The alignments of the SNP's flanking sequences to the genome are displayed on the details page for each SNP; in addition, the hg18 SNP details pages include the chimp and rhesus macaque orthologous alleles. We have also added a HapMap SNPs annotation (16) to the hg17 and hg18 assemblies containing data for 4 million SNPs (dbSNP 125) from four populations, together with the display of orthologous alleles from chimp and rhesus macaque and several options for filtering the data display. The Structural Variation annotation has been expanded to include structural variation data (17), deletions detected by several techniques (18–20) and numerous copy number polymorphism data sets (21–25). The SNP Arrays track displays SNPs available for genotyping with several different microarrays. The Exapted Repeats annotation displays conserved non-exonic elements that have been deposited by characterized mobile elements (26).
In addition to updating selected existing data sets, we have introduced several new annotations to various assemblies. High-confidence gene annotations from the Consensus Coding DNA Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) have been added to more human assemblies and to the mouse. The Affymetrix Transcriptome Phase 3 data set (human) (27) displays transcriptome data from tiling Affymetrix GeneChips. The ORegAnno (Open Regulatory Annotation) track (several species) shows literature-curated regulatory regions, transcription factor binding sites and regulatory polymorphisms from the ORegAnno database (28). The ACEScan annotation (human) identifies predicted alternative human-mouse conserved exons from ACEScan (29). The CGAP SAGE track (human, mouse) displays genomic mappings for human LongSAGE tags from the Cancer Genome Anatomy Project (CGAP) (30), using the Serial Analysis of Gene Expression (SAGE) quantitative technique (31). The MGI QTL track (mouse) shows approximate positions of quantitative trait loci based on reported peak LOD scores from the Jackson Laboratory Mouse Genome Informatics group. The zebrafish genome now provides expression data using the Affymetrix Zebrafish GeneChip Genome Array (32).
The track details pages for features in the Mammalian Gene Collection (MGC) (33) Genes annotation track (available for several species) now include extensive information about the MGC clones, including links for ordering the clones.
A new annotation on the hg18 human assembly, ORFeome Clones, shows alignments of human clones from the ORFeome Collaboration (http://www.orfeomecollaboration.org/) (34), a project that aims to be an unrestricted source of fully sequence-validated full-ORF human cDNA clones, with the goal of providing at least one fully sequenced full-ORF clone for each human gene. This annotation is automatically updated daily as new clones become available.
The Genome Browser serves as the data repository for the ENCODE (Encyclopedia of DNA Elements) project (35). The set of human genome annotations available on the UCSC ENCODE portal (http://genome.ucsc.edu/ENCODE/) (36), contributed by members of the ENCODE Consortium, has increased by 40% in the past 12 months, from 130 tracks and 950 tables on 2 assemblies in September 2006 to 199 tracks and 1583 tables on 3 assemblies in September 2007. The ENCODE data sets are now available on the March 2006 (Build 36, hg18) as well as the May 2004 (Build 35, hg17) human assembly.
We have updated the VisiGene image sets from the Jackson Lab Mouse Genome Informatics Database and the Allen Brain Atlas, and have added links from the mouse Known Genes and UCSC Genes details pages to X. laevis images. The VisiGene probe-processing utilities have been updated to interact with the new UCSC Genes set.
The hg18 Gene Sorter has been expanded to include the Wanker (37), Vidal (38) and Human Protein Reference Database (39) data sets showing the neighborhood of protein interactions surrounding selected genes. The neighborhoods are computed from a genome-wide protein–protein interaction network that connects genes if the proteins they encode have been detected to physically interact in high-throughput experiments. The xxBlastTab tables, which display gene ortholog data for human, mouse and rat in the Gene Sorter and UCSC Genes details pages, have now been filtered for synteny.
The GBD and the Genome Browser toolset are dynamic resources that continually evolve to accommodate new genome assemblies, data types and research requirements. In the past year we have expanded and enhanced many of our Genome Browser tools to improve data browsing and manipulation capabilities for our users and collaborators.
Genome Graphs (http://genome.ucsc.edu/cgi-bin/hgGenome), a new tool accessible via the ‘Genome Graphs’ link on the GBD home page, can be used to display genome-wide data sets, for example, the results of genome-wide SNP association studies, linkage studies and homozygosity mapping. Using the Genome Graphs tool, it is possible to upload or import several sets of genome-wide data and display them simultaneously (Figure 1), then accomplish such tasks as restricting the display to only those regions that exceed a set significance threshold, displaying genes via the Gene Sorter that exist in areas where the data meet a given significance threshold, displaying an area of interest in the Genome Browser and calculating the correlation coefficient (R) among the data sets. Both public and personal data sets may be loaded into the tool, and the display can be configured to suit individual needs.
We have launched a wiki site for sharing information about the UCSC Genome Browser and its data. The wiki—at: http://genomewiki.ucsc.edu—provides an informal forum for our browser users, mirror sites and staff to discuss topics of interest in the genome biology field and exchange usage tips, scripts, programs and notes about mirroring the Genome Browser and working with the Genome Browser source. As with most wiki sites, general users are welcome to edit and add pages after logging in.
Users can now save their favorite Genome Browser sessions for reuse and sharing by using a new session management feature, accessible via the ‘Session’ link (http://genome.ucsc.edu/cgi-bin/hgSession) on the GBD home page and the blue navigation bar at the top of many of the tool web pages. Log-in access to the session features is controlled through the Genome Browser wiki site. Once logged in, the user can save the current Genome Browser session, including the exact position and track combination on display, share the session with another user or keep it private and load one's own saved sessions as well as those shared by others. Saved sessions persist for 1 year after the last access, unless deleted. Custom tracks within sessions persist for at least 48 h after the last time they are viewed.
Custom annotation tracks, a popular Genome Browser feature for several years, allow users to load, display and manipulate personal data in the Genome Browser and Table Browser. The new custom tracks manager (http://genome.ucsc.edu/cgi-bin/hgCustom) makes the use of custom tracks much easier. The management interface can be accessed through the ‘add/manage custom tracks’ button on the Genome Browser gateway (http://genome.ucsc.edu/cgi-bin/hgGateway) or tracks (http://genome.ucsc.edu/cgi-bin/hgTracks) page. In addition to the data upload options supported in previous versions, users can now load and display multiple custom tracks simultaneously, add to, delete and modify the uploaded custom track set, load and manage tracks from multiple assemblies, and upload description pages for custom tracks. The lifespan of a custom track on the UCSC server has been increased from 8 to 48 h after last access, and we have converted the underlying custom track architecture from a file-based system to a database system to improve the performance.
We have added several new user-configurable display options to the Genome Browser that expand navigation within gene, mRNA and EST-based tracks and allow increased manipulation of the tracks image and track control groups. These options are controlled on the browser configuration page, which is accessed through the ‘configure’ button on the Genome Browser gateway or tracks page. When the ‘Next/previous item navigation’ configuration option is toggled on, gray double-headed arrows display in the Genome Browser tracks image on both sides of the track labels of gene, mRNA and EST tracks (or any standard tracks based on BED, PSL or genePred format). The image window may be shifted to display the next track feature towards the 5' or 3' end of the chromosome by clicking the corresponding left or right arrow. Similarly, the ‘Next/previous exon navigation’ configuration option displays white double-headed arrows on the both 5' and 3' end of each track item that has exons positioned beyond the edges of the current image. Clicking on one of the arrows shifts the image window to the next exon located towards that end of the feature. Another new configuration option—‘Enable track reordering’—allows the user to change the display order of the track groups as well as the order of annotation tracks within the groups, and to move tracks between track groups. This is particularly useful for customizing the browser display to individual research needs or for creating images for publication. Track groups on the Genome Browser tracks page may now be quickly collapsed and expanded by clicking the + and − icons on the left side of the group label.
We have expanded the coloring and display options for mRNA tracks to include several ways to highlight gaps in alignments of query sequences (usually transcripts) to the genome, which frequently indicate a problem with the query sequence or with the genome assembly. Tracks may be colored by genomic, mRNA or nonsynonymous mRNA codons, mRNA bases, or different mRNA bases. Tracks may then be configured, through options on the description page, to display double horizontal lines at locations where both the genome and query sequence have an insertion and to display vertical lines of different colors to distinguish poly(A) tail insertions and insertions at the beginning, end or middle of the query (Figure 2). The new coloring scheme makes it easier to visually scan a region with hundreds of alignments and pick out regions of interest (Figure 3). The new display options are explained in detail on the mRNA track description pages.
To visually simplify the display of the large number of similar annotation tracks present in the ENCODE track groups, collections of related tracks are now represented by a single ‘super-track’ control that provides a descriptive overview of the group and lets the user control display characteristics of the entire track set on one page. For example, the Yale ChIP-chip super-track control provides information and access for seven related Yale ChIP-chip annotations.
In the upcoming year, UCSC will continue to extend the GBD to include more species and assembly updates—focusing on primates, model organisms and species of critical importance to evolutionary studies—and more annotation data. We may also provide browser access to several of the low-coverage (2×) assemblies currently included in our Conservation tracks. Following the release of UCSC Genes data for the latest mouse assembly in Fall 2007, updates to both the human and the mouse UCSC Genes annotation will be offered ~3–4 times per year. We will continue to expand our collection of human variation, disease-related, expression and genome-wide association data, and plan to explore the incorporation of federated data into the browser. Among the enhancements to our display and data-mining tools, we plan to facilitate the use of custom tracks in the Table Browser, extend display features such as the next/previous item navigation to a larger range of tracks, and add a user-annotated wiki track.
We would like to thank the many collaborators who have contributed data to our project, our Scientific Advisory Board for their valuable advice and recommendations, and our users for their feedback and support. We would also like to acknowledge the dedicated system administrators who have provided an excellent computing environment: Jorge Garcia, Erich Weiler, Chester Manuel and Victoria Lin. This work was funded by National Human Genome Research Institute (2 P41 HG002371-06 to UCSC Center for Genomic Science, 3 P41 HG002371-06S1 ENCODE supplement to UCSC Center for Genomic Science); National Cancer Institute (Contract No. N01-CO-12400 for Mammalian Gene Collection). TW is a Helen Hay Whitney fellow. Funding to pay the Open Access publication charges for this article was provided by the Howard Hughes Medical Institute.
Conflict of interest statement. D.K., R.M.K., R.B., G.P.B, H.C., M.D., R.A.H., A.S.H., F.H., A.P., B.J.R., B.R., K.R.R., K.E.S., A.T., H.T., A.S.Z., D.H., and W.J.K receive royalties from the sale of UCSC Genome Browser source code licenses to commercial entities.