Two major problems faced the GRC at the outset of this project, the decentralized
nature of the Human Genome Project and the lack of a suitable data model for
representing complex genomes. Much of the data underlying curation decisions had not
been captured nor standardized. The human reference assembly had never been
submitted to the International Nucleotide Sequence Database Collaboration (INSDC)
lacked stable, trackable sequence identifiers that could be accessed from any INSDC
Initial efforts at assembling the human genome were guided by the concept of “a
golden path” 
, a single clone tiling path that could be reduced to one
non-redundant haploid representation of the human genome. While this model fit well
with the prediction that single nucleotide variants (SNVs) would be the predominant
source of variation in the population, it is now clear that structural variation is
a much larger source of genomic diversity than previously recognized 
this model did not deal robustly with sequences that were not part of chromosome
assemblies. These often represent sequences that cannot be easily ordered or
oriented on the chromosome assembly due to structural complexity but frequently
contain genes that may be of biological interest 
or represent alternate
haplotypes of regions in the chromosome assembly 
. Earlier versions of the
reference genome assembly included some of these allelic variants (such as at the
MHC region) but the sequences themselves often were not used because they had no
relation to the chromosome sequence and could not be easily distinguished from
sequences reflecting biological or artificial duplication.
The GRC has addressed these problems by establishing common tools and standard
operating procedures (SOPs) so that the genome assembly is now constructed in a
regularized fashion. We have developed a single database to store all data
underlying the genome assembly. Finally, we have developed a system to track
individual regions that are under review. All of these data are made publicly
available through our Web site (http://genomereference.org/
Additionally, the GRC has formalized an assembly model ( and Box 1
) that provides for improved accounting
for all sequences, including those that are not part of chromosome assemblies, and
facilitates genome annotation by placing additional structure on those sequences.
Structurally complex regions can be represented by more than one tiling path; one of
which will be integrated into the chromosome assembly while the others will be
instantiated as an independent sequence that, by alignment to the chromosome,
provides the chromosome context for the alternate allele.
Assembly representation for GRCh37.p3.
Box 1. Assembly Definitions
AGP: A file used to describe the instructions for building a contig,
scaffold, or chromosome sequence. This file specifies the order, orientation,
and switch points for each component.
Alternate Locus: A sequence that provides an alternate
representation of a locus found in a largely haploid assembly. These sequences
don’t represent a complete chromosome sequence, although there is no hard
limit on the size of the alternate locus; currently these are less than 5
Assembly: A set of sequences (chromosomes, unlocalized, unplaced,
and alternate loci) used to represent an organism’s genome.
Assembly Unit: Collections of sequences used to define discrete
parts of an assembly.
Component: The basic genomic level sequence used to construct the
genome; typically these are clone sequences, Whole Genome Shotgun sequences, or
PCR fragments. These sequences must be submitted to GenBank/EMBL/DDBJ.
Contig: A contiguous sequence generated from determining the
non-redundant path along an ordered set of component sequences. A contig should
contain no gaps.
Patch: A genome patch is a scaffold sequence that is part of a minor
genome release. These sequences either correct errors in the assembly (a FIX
patch) or add additional alternate loci (a NOVEL patch). These sequences allow
us to update the assembly information without disrupting the chromosome
coordinate system. FIX patches will be removed at the next major assembly
release, as the changes will be rolled into the new assembly. NOVEL patches will
be moved from the PATCHES assembly unit to a proper assembly unit.
Primary Assembly Unit: Represents the collection of sequences that,
when combined, represent a non-redundant haploid genome.
Scaffold: An ordered and oriented set of contigs. A scaffold will
contain gaps, but there is typically some evidence to support the contig order,
orientation, and gap size estimates.
TPF: Tiling Path File; this provides the order of the component
sequences that are used to build a higher order sequence (contig, scaffold, or
Switch Point: The base at which the contig sequence stops being
generated from one component sequence and switches to using the next component
sequence. There must be at least one switch point between adjacent component
sequences in a contig.
Unlocalized sequence: A sequence found in an assembly that is
associated with a specific chromosome, but that cannot be ordered or oriented on
Unplaced sequence: A sequence found in an assembly that is not
associated with any chromosome.
We have also introduced the concept of a “minor” assembly update, in the
form of genome patches. This mechanism provides users with timely access to genome
improvements without inducing frequent changes to the coordinate system upon which
assembly annotations are based. Because genome patches take the same form as
alternate loci the two forms of data can be similarly managed.
The release cycle for major assembly updates will not occur on a fixed schedule. In
order to minimize the need for frequent re-annotation, major assembly updates will
occur infrequently when we have produced at least 100 fix patches or affected
>1% of the euchromatic sequence. The GRC will announce planned updates on
their Web site at least 6 months in advance of any major assembly release.
Additional, detailed information regarding major releases will be publicly announced
via the Web site as data freeze dates approach. Minor assembly updates will be made