It is not possible to overstate the impact that genome sequencing and assembly has had on biomedical research. While the release of a new genome assembly once spawned worldwide press releases and announcements (in some cases multiple times) there is now a general expectation that if you are to do serious work on a model organism, a genome assembly is a necessary part of the research plan. These genome assemblies serve as the backbone for whole-genome studies, comparative genomics and for research labs performing locus-specific work. A critically important aspect of the success of the Human Genome Project (HGP) was the decision to immediately release pre-publication primary sequence data [1
]. This policy flew in the face of tradition, especially in the community of those researching aspects of the human genome, which stated that genome sequence need only be made available upon publication. Although there was some concern that this would jeopardize the genome center's ability to analyze and publish the data they had produced, most involved felt that the benefit of early release outweighed the risks of an outside group publishing a genome assembly and analysis before the data producers. Guidelines for both the release and use of these data were published in what are commonly referred to as the Bermuda principles and the Fort Lauderdale agreement [2
]. While the Bermuda principles have been incredibly valuable to the research community, they were established more than 10 years ago, and it is time to revisit them as sequencing technologies, standards and expectations are evolving at a rapid pace.
The necessity to revisit these guidelines is underscored by the simultaneous publication of two different assemblies of the cow genome: Btau 4.0 as described by the Bovine Genome Sequencing and Analysis Consortium (BGSAC) [3
], and UMD 2.0 as described by Zimin et al
]. Both these genome assemblies are based on sequence traces generated by the Baylor College of Medicine as a part of the BGSAC. While the Zimin et al
. publication does not violate the Fort Lauderdale agreement as both genomes are being published simultaneously, the availability of two genome assemblies produced from the same dataset raises a series of questions that will need to be addressed by funding agencies, sequence producers and the user community. How many assemblies are necessary and useful? Who has the right to perform the genome assembly? How should the community select reference assemblies? Are genome centers responsible for assembly updates forever?
Many users may be surprised that the same dataset would produce two different assemblies. However, the process of genome assembly is akin to putting together a 3 billion piece jigsaw puzzle. Of course, in the genome case many of the pieces look almost identical and there may be multiple correct solutions, depending on the data source. In addition to polymorphisms and alternative haplotypes, other complications include the presence of segmental duplications, defined as regions larger than 1 kb that have greater than 90% sequence identity with another region of the genome [5
], and large-scale structural variation, meaning that two chromosomes can differ by millions of base pairs or have regional ordering differences [6
]. Even the two most complete and best studied mammalian genomes - human and mouse - which were produced by clone-based rather than whole-genome strategies, contain regions that remain unassembled or that contain errors [7
Genome centers put a great deal of effort into producing high-quality sequence data and assemblies for the research community and they deserve to have the chance to assemble and analyze the data they produce. Although the effort involved in producing a genome assembly has not decreased, it is becoming increasingly difficult to get such work published. There is a danger that the effort required to perform the analysis required for publication in a top-tier journal can significantly delay publication of the genome. Whereas the assembly is typically available before publication, the inability of an outside group to publish a genome-wide analysis of an assembly before its publication can hinder the advancement of science. In other cases, there may be a substantial delay between the production of sequence reads and the production of the genome assembly. It is quite clear that the research community is not well served in these cases. It would be useful for the stakeholders to establish timelines by which such assembly and publication milestones should be reached.
A number of assembly programs are currently available but none produces a base-perfect assembly with data from current technologies. The shift from clone-based sequence to whole-genome sequencing and assembly (WGSA) means that the most highly duplicated, lineage-specific regions of the genome are poorly represented in the final assembly [8
], but the way these regions are handled will vary with the assembly package. Because of complications like those described above, as well as the incomplete and non-uniform representation of the sequence in whole-genome sequencing datasets, even with a single assembly tool typically there are multiple possible solutions to any given assembly that are each completely consistent with the underlying data. Several projects have taken advantage of the fact that multiple assemblers exist and have produced multiple genome assemblies as a part of the project. For example, during the WGSA phase of the mouse genome projects, three rounds of assemblies were performed using two different genome assemblers (Arachne [9
] and Phusion [10
]). Both these assemblies were made available during the early stages of the project, but one was ultimately chosen for analysis and publication. A similar approach was taken for both the chimpanzee genome project [11
] and the rhesus macaque genome project [12
]. The availability of multiple algorithms and assemblies during the course of these projects improved the final product immensely. In all these projects the final assembly was made better because the different groups performing the assembly worked with the genome center responsible for the sequence data.
Everyone benefits if multiple assemblies are produced and compared. Statistics such as chromosome length and scaffold N50 (a measure of continuity that is defined as the scaffold length for which 50% of the bases in an assembly reside), although poor measures of base-level quality or global assembly correctness, are often taken into account when assessing assemblies. More importantly, comparison of the genome sequence to independently derived sequences, such as transcript collections or regions already finished using clone-based sequencing, has also proved an effective way to assess the quality of an assembly. Recently, additional approaches that look for inconsistencies in the assembled data have been described [13
But despite the ability to perform many levels of analysis, there are typically no set metrics for determining which assembly should be deemed the reference. As different genomes have different biological characteristics and different levels of funding, it is difficult to establish a one-size-fits-all policy. However, at the beginning of each project it would be useful for all stakeholders to specify whether the analysis of multiple assemblies is desired and to define how any assemblies generated for the project will be measured. The development of a third-party group, perhaps consisting of representatives of the major annotator and browser groups, could assist the centers in the quality assessment stage of the assessment. Making the data from such assessments widely available, perhaps through the browsers, would help the user community understand both the positive aspects as well as the limitations of a given assembly. While it is generally advantageous to release a single assembly for a given dataset, there may be instances where it is not possible to determine the one best assembly, and in those cases it is better to release both.
There is an additional issue of assembly updates and improvements. Users performing genome-wide analysis want a single, stable coordinate system, whereas users interested in a specific gene or region want the best possible representation of that region. However, not all genome assemblies are updated after the initial publication. In many cases the centers no longer have funding to work on the projects, but the community continues to rely on the data and in many cases adds new data that could be used to improve the assembly. The resources generated by these large projects are too valuable to be allowed to lie fallow and we must explore mechanisms that do not burden the genome centers but enable the genome assembly to improve as our understanding of the data and genome increase. These may include continued funding to the center for the project or the transfer of the assembly to a third party for management and updates. This would be useful for the community as well as for the centers initially involved [7
The notion of having multiple assemblies raises additional questions and underscores the need to develop better tools for tracking, comparing and displaying genome-assembly data. As sequencing costs drop, additional datasets and assemblies will inevitably be produced. This is already the case for humans, for whom three different genome assemblies (the HGP public reference, Celera's, and Venter's) are already available. The overhead of analyzing, annotating and displaying genome sequences is considerable but manageable. However, the problems of data display, establishing stable coordinates for exchange and assembly tracking are considerable.
The first problem is assembly management. Although most assemblies are deposited in the International Nucleotide Sequence Database Consortium (INSDC) databases, commonly referred to as GenBank/EMBL/DDBJ, this is not sufficient for tracking the actual assembly, only the individual sequences associated with it. Currently, most assemblies are tracked by name and date, with no formal detailed notation of individual sequence changes. Tools for formally managing and tracking genome assemblies are currently in development, but they will only be the first step to the suite of tools that need to be developed for managing assemblies. There have been three updates to the human genome since the publication describing the 'finished' genome [14
] and simply specifying that a feature is on human chromosome 1 at 10,000 base pairs is not sufficient to uniquely identify that base.
In addition to improved tools for tracking and managing assembly data, additional tools for comparing and displaying multiple assemblies need to be developed. Currently, Ensembl and the University of California Santa Cruz genome browser can only annotate and display a single current assembly within a given view, although archival versions of the reference assemblies are available. The National Center for Biotechnology Information has long supported the ability to annotate and display multiple assemblies for a given organism, but the book-keeping and user interface need improvement. Tools based on aligning assemblies and displaying comparative annotation are necessary to help most users navigate these data. In addition, tools for rapidly identifying assembly differences will be critical for honing in on regions that should be judged skeptically and may need manual intervention for improvement.
The sequencing of the human genome did not mark the end of sequencing, but merely the beginning. Sequence data are now easier to produce, but decisions about timelines for data release, publication, and ownership and standards for assembly comparison and quality assessment, as well as the tools for managing and displaying these data, need considerable attention in order to best serve the entire community.