We describe the sequencing, de novo assembly, and preliminary analysis of an individual diploid human genome. In the course of our study, we have developed an experimental framework that can serve as a model for the emerging field of en masse personalized genomics [82
]. The components of our strategy involve: (i) sample consent and assessment, (ii) genome sequencing, (iii), genome assembly, (iv) comparative (one-to-one) mapping, (v) DNA variation detection and filtering, (vi) haplotype assembly, and (vii) annotation and interpretation of the data. We were able to construct a genome-wide representation of all DNA variants and haplotype blocks in the context of gene annotations and repeat structure identified in the HuRef donor. This provides a unique glimpse into the diploid genome of an individual human (Poster S1
The most significant technical challenge has been to develop an assembly process (points ii–v) that faithfully maintains the integrity of the allelic contribution from an underlying set of reads originating from a diploid DNA source. As far as we know, the approach we developed is unique and is central to the identification of the large number of indels less than 400 bp in length. We attempted de novo recruitment of sequence reads to the NCBI human reference genome, using mate pairing and clone insert size to guide the accurate placement of reads [83
]. Although this approach can produce useful results, it does limit variant detection to completed regions of the reference genome and, like genome assembly, can be confounded by segmentally duplicated regions.
The genome assembly approach with allelic separation allows the detection of heterozygous variants present in the individual genome with no further comparison. The one-to-one mapping of our HuRef assembly against a nearly completed reference genome permits the detection of the remaining variants. These variants arise from sequence differences found within and also outside the mapped regions, where the precision of the compared regions is being provided by the genome-to-genome comparison [59
]. The ability to provide a highly confident set of DNA variants is challenging, because more than half of the variants are a single base in length but include both SNPs and indels. A filtering approach was used that accounts for the positional error profile in a Sanger sequenced electropherogram in relation to the called variant. Additional filtering considerations necessitated minimal requirements for read coverage and for the proportional representation of each allele. The filtering approaches were empirical and used the large amounts of previously described data on human variation (dbSNP). The utility of using paired-end random shotgun reads and the variant set defined on the reads via the assembly enabled the construction of long-range haplotypes. The haplotypes are remarkably well constructed given that the density of the variant map is comparable to those used in other studies [35
], reflecting the utility of underlying sequence reads beyond just genome assembly. To understand how an individual genome translates into an individual transcriptome and ultimately a functional proteome, it is important to define the segregation of variants among each chromosomal copy.
While several new approaches for DNA sequencing are available or being developed [84
], we chose to use proven Sanger sequencing technology for this HuRef project. The choice was obviously motivated in part for historical reasons [7
], but not solely. We attached a high importance to generating a de novo assembly including maximizing coverage and sensitivity for detecting variation. We further anticipated that long read lengths (in excess of 800 nucleotides), compatibility with paired-end shotgun clone sequencing, and well-developed parameters for assessing sequencing accuracy would be required. High sequence accuracy is essential to avoid calling large numbers of false-positive variants on a genome-wide scale. Long paired-end reads are especially useful for achieving the best possible assembly characteristics in whole-genome shotgun sequencing and for providing sufficient linkage of variants to determine large haplotypes.
We have been able to categorize a significant amount of DNA variation in the genome of a single human. Of great interest is the fact that 44% of annotated genes have at least one, and often more, alterations within them. The vast majority—3,213,401 events (78%) of the 4.1 million variants detected in the HuRef donor—are SNPs. However, the remaining 22% of non-SNP variants constitute the vast majority, about 9 Mb or 74%, of variant bases in the donor. Using microarray-based methods, we also detected another 62 copy number variable regions in HuRef, estimated to add some 10 Mb of additional heterogeneity. Given these potential sources of measured DNA variation, we can, for the first time, make a conservative estimate that a minimum of 0.5% variation exists between two haploid genomes (all heterozygous bases, i.e., SNP, multi-nucleotide polymorphisms [MNP], indels, [complex variants + putative alternate alleles + CNV]/genome size; [2,894,929 + 939,799 + 10,000,000]/2,809,547,336) namely those that make up the diploid DNA of the HuRef assembly. We also note that there will be significantly more DNA variation discovered in heterochromatic regions of the genome [87
], which largely escaped our analysis in this study.
We had mixed success when attempting to find support for the experimentally determined CNVs in the HuRef assembly itself or the data from which it was derived. More than 50% of the CNVs overlapped segmental duplications, and these regions are underrepresented in HuRef, which complicated the analysis. We attempted to map the sequence reads onto the NCBI human genome and then identify CNVs by detecting regions with significant changes in read depth. However, we found significant local fluctuations in read depth across the genome, limiting the ability for comparison and suggesting that a higher coverage of reads may be required to use this approach effectively.
As we have emphasized throughout, a major difference of the genomic assembly we have described is our approach to maintain, wherever possible, the diploid nature of the genome. This is in contrast to both the NCBI and WGSA genomes, which are each consensus sequences and, therefore, a mosaic of haplotypes that do not accurately display the relationships of variants on either of the autosomal pairs. For BAC-based genome assemblies such as the NCBI genome assembly, the mosaic fragments are generally genomic clone size (e.g., cosmid, PAC, BAC), with each clone providing contiguous sequence for only one of the two haplotypes at any given locus. Moreover, there are substantial differences in the clone composition of different chromosomes due to the historical and hierarchical mapping and sequencing strategies used to generate the NCBI reference assemblies [7
In contrast, for WGSA, the reads that underlie most of the consensus sequence are derived from both haplotypes. This can result in very short-range mosaicism, where the consensus of clustered allelic differences does not actually exist in any of the underlying reads. To address this issue, the Celera assembler was modified to consider all variable bases within a given window and to group the sequence forms supporting each allele before incorporation into a consensus sequence (see Materials and Methods
). In our experience, this reduces the incidence of local mosaicism, although, between windows, the consensus sequence remains a composite of haplotypes. Efforts to build haplotypes from the genome assembly (Haplotype Assembly) will likely lead to future modification of the assembler, allowing it to output longer consensus sequences for both haplotypes at many loci. Clearly, a single consensus sequence for a diploid genome, whether derived from BACs or WGS, has limitations for describing allelic variants (and specific combinations of variants) within the genome of an individual.
Partial haplotypes can be inferred for an individual from laboratory genotype data (e.g., from SNP microarrays) in conjunction with population data or genotypes of family members. However, at least in the absence of sets of related individuals (e.g., family trios), it is difficult to determine haplotypes from genotype data across regions of low LD. We have shown that sequencing with a paired-end sequencing strategy can provide highly accurate haplotype reconstruction that does not share these limitations. The assembled haplotypes are substantially larger than the blocks of SNPs in strong LD within the various populations investigated by the HapMap project. In addition to being larger, haplotypes inferred in our approach can link variants even where LD in a population is weak, and they are not restricted to those variants that have been studied in large population samples (e.g., HapMap variants). We note that in addition to the implications for human genetics, this approach could be applied to separating haplotypes of any organism of interest—without the requirement for a previous reference genome, family data, or population data—so long as polymorphism rates are high enough for an acceptable fraction of reads or mate pairs to link variants.
There are several avenues for extending our inference of haplotypes. As noted, although the naive heuristics used here give highly useful results, other approaches may give even more accurate results, as we have observed with an MCMC algorithm. There are various natural measures of confidence that can be applied to the phasing of two or more variants, including the minimum number of clones that would have to be ignored to unlink two variants, or a measure of the degrees of separation between two variants. The analysis presented here provides phasing only for sites deemed heterozygous, but data from apparently homozygous sites can be phased as well, so we can tell with confidence whether a given site is truly homozygous (i.e., the same allele is present in both haplotypes) or whether the allele at one or even both haplotypes cannot be determined, as occurs as much as 20% of the time with the current dataset. Lastly, it should be possible to combine our approach with typical genotype phasing approaches to infer even larger haplotypes.
Our project developed over a 10-year period and the decisions regarding sample selection, techniques used, and methods of analysis were critical to the current and continued success of the project. We anticipated that beyond mere curiosity, there would be very pragmatic reasons to use a donor sample from a known consented individual. First and foremost, as we show in a preliminary analysis, genome-based correlations to phenotype can be performed. Due to the still rudimentary state of the genotype-phenotype databases it can be argued that at the present time, DNA sequence comparisons do not reveal much more information than a proper family history. Even when a disease, predisposition, or phenotypically-relevant allele is found, further familial sampling will usually be required to determine the relevance. Eventually, however, populations of genomes will be sequenced, and at some point, a critical mass will dramatically change the value of any individual initiative providing the potential for proactive rather than reactive personal health care. In a simple analogy, absent of family history, genealogical studies can now be quite accurate in reconstructing ancestral history based purely on marker-frequency comparisons to databases. Here, with a near-unlimited amount of variation data available from the HuRef assembly, we can reconstruct the chromosome Y ethno-geneographic lineage (), which is not only consistent with, but better defines the self-reported family tree data (A and unpublished data).
Chromosome Y ethno-genogeographic lineage
There are always issues regarding the generation and study of genetic data and these may amplify as we move from what are now primarily gene-centric studies to the new era where genome sequences become a standard form of personal information. For example, there are often concerns that individuals should not be informed of their predisposition (or fate) if there is nothing they can do about it. It is possible, however, that many of the concerns for predictive medical information will fall by the wayside as more prevention strategies, treatment options, and indeed cures become realistic. Indeed we believe that as more individuals put their genomic profiles into the public realm, effective research will be facilitated, and strategies to mitigate the untoward effects of certain genes will emerge. The cycle, in fact, should become self-propelling, and reasons to know will soon outweigh reasons to remain uninformed.
Ultimately, as more entire genome sequences and their associated personal characteristics become available, they will facilitate a new era of research into the basis of individuality. The opportunity for a better understanding of the complex interactions among genes, and between these genes and their host's personal environment will be possible using these datasets composed of many genomes. Eventually, there may be true insight into the relationships between nature and nurture, and the individual will then benefit from the contributions of the community as a whole.