We and others recently demonstrated methods for experimentally determining haplotypes for both rare and common variation at a genome-wide scale (5
). In the current study, we set out to integrate the haplotype-resolved genome sequencing of a mother, the shotgun genome sequencing of a father, and the deep sequencing of cell-free DNA in maternal plasma to non-invasively predict the whole genome sequence of a fetus (). Although results associated with two such mother-father-child trios are described (“I1” – a first trio at 18.5 wks gestation, “G1” – a second trio at 8.2 wks gestation), we focus here primarily on the trio for which considerably more sequence data of each type was generated (“I1”) ().
Fig. 1 Experimental approach. (A). Schematic of sequenced individuals in a family trio. Maternal plasma sequences were ~13% fetal-derived based on read depth at chrY and alleles specific to each parent. (B). Inheritance of maternally heterozygous alleles inferred (more ...)
Individuals sequenced, type of starting material, and final fold-coverage of the reference genome after discarding PCR or optical duplicate reads. GA, gestational age.
In brief, the haplotype-resolved genome sequence of the mother (“I1-M”) was determined by first performing shotgun sequencing of maternal genomic DNA from blood to 32-fold coverage (coverage = median-fold coverage of mapping reads to the reference genome after discarding duplicates). Next, by sequencing complex haploid subsets of maternal genomic DNA while preserving long-range contiguity (5
), we directly phased 91.4% of 1.9 × 106
heterozygous SNPs into long haplotype blocks (N50 of 326 kilobases (kbp)). The shotgun genome sequence of the father (“I1-P”) was determined by sequencing of paternal genomic DNA to 39-fold coverage, yielding 1.8 × 106
heterozygous SNPs. However, paternal haplotypes could not be assessed because only relatively low molecular weight DNA obtained from saliva was available. Shotgun DNA sequencing libraries were also constructed from 5 mL of maternal plasma (obtained at 18.5 wks gestation), and this composite of maternal and fetal genomes was sequenced to 78-fold non-duplicate coverage. The fetus was male, and fetal content in these libraries was estimated at 13%. To properly assess the accuracy of our methods for determining the fetal genome solely from samples obtained non-invasively at 18.5 wks gestation, we also performed shotgun genome sequencing of the child (“I1-C”) to 40-fold coverage via cord blood DNA obtained after birth.
Our analysis comprised four parts: (1) predicting the subset of ‘maternal-only’ heterozygous variants (homozygous in the father) transmitted to the fetus; (2) predicting the subset of ‘paternal-only’ heterozygous variants (homozygous in the mother) transmitted to the fetus; (3) predicting transmission at sites heterozygous in both parents; (4) predicting sites of de novo mutation – that is, variants occurring only in the genome of the fetus. Allelic imbalance in maternal plasma, manifesting across experimentally determined maternal haplotype blocks, was used to predict their maternal transmission (). The observation (or lack thereof) of paternal alleles in shotgun libraries derived from maternal plasma was used to predict paternal transmission (). Finally, a strict analysis of alleles rarely observed in maternal plasma, but never in maternal or paternal genomic DNA, enabled the genome-wide identification of candidate de novo mutations (). Fetal genotypes are trivially predicted at sites where the parents are both homozygous (for the same or different allele).
We first sought to predict transmission at ‘maternal-only’ heterozygous sites. Given the fetal-derived proportion of ~13% in cell-free DNA, the maternal-specific allele is expected in 50% of reads aligned to such a site if it is transmitted, versus 43.5% if the allele shared with the father is transmitted. However, even with 78-fold coverage of the maternal plasma “genome”, the variability of sampling is such that site-by-site prediction results in only 64.4% accuracy (). We therefore examined allelic imbalance across blocks of maternally heterozygous sites defined by haplotype-resolved genome sequencing of the mother (). As anticipated given the haplotype assembly N50 of 326 Kb, the vast majority of experimentally defined maternal haplotype blocks were wholly transmitted, with partial inheritance in a small minority of blocks (0.6%, n
=72) corresponding to switch errors from haplotype assembly and to sites of recombination. We developed a Hidden Markov model (HMM) to identify likely switch sites and thus more accurately infer the inherited alleles at maternally heterozygous sites ( and , and SOM Materials and Methods
). Using this model, accuracy of the inferred inherited alleles at 1.1 × 106
phased, ‘maternal-only’ heterozygous sites increased from 98.6% to 99.3% (). Remaining errors were concentrated among the shortest maternal haplotype blocks (fig. S1
), which provide less power to detect allelic imbalance in plasma data as compared with long blocks. Among the top 95% of sites ranked by haplotype block length, prediction accuracy rose to 99.7%, suggesting that remaining inaccuracies can be mitigated by improvements in haplotyping.
Fig. 2 Accuracy of fetal genotype inference from maternal plasma sequencing. Accuracy is shown for paternal-only heterozygous sites, and for phased maternal-only heterozygous sites, either using maternal phase information (black) or instead predicting inheritance (more ...)
HMM-based predictions correctly predict maternally transmitted alleles across ~1 Mbp on chromosome 10, despite site-to-site variability of allelic representation among maternal plasma sequences (red).
Fig. 4 HMM-based detection of recombination events and haplotype assembly switch errors. A maternal haplotype block of 917 Kbp on chromosome 12q is shown, with red points representing the frequency of haplotype A alleles among plasma reads, and the black line (more ...)
Table 2 Number of sites and accuracy of fetal genotype inference from maternal plasma sequencing (percentage of transmitted alleles correct out of all predicted) by parental genotype and phasing status. Sites later determined by trio sequencing (including the (more ...)
We performed simulations to characterize how the accuracy of haplotype-based fetal genotype inference depended upon haplotype block length, maternal plasma sequencing depth, and the fraction of fetal-derived DNA. To mimic the effect of less successful phasing, we split the maternal haplotype blocks into smaller fragments to create a series of assemblies with decreasing contiguity. We then subsampled a range of sequencing depths from the pool of observed alleles in maternal plasma, and predicted the maternally contributed allele at each site as above (). The results suggest that inference of the inherited allele is robust to either decreasing sequencing depth of maternal plasma, or to shorter haplotype blocks, but not both. For example, using only 10% of the plasma sequence data (median depth = 8X) in conjunction with full-length haplotype blocks, we successfully predicted inheritance at 94.9% of ‘maternal-only’ heterozygous sites. We achieved nearly identical accuracy (94.8%) at these sites when highly fragmented haplotype blocks (N50 = 50 Kb) were used with the full set of plasma sequences. We next simulated decreased proportions of fetal DNA in the maternal plasma by spiking in additional depth of both maternal alleles at each site and subsampling from these pools, effectively diluting away the signal of allelic imbalance used as a signature of inheritance (). Again, we found the accuracy of the model to be robust to either lower fetal DNA concentrations or shorter haplotype blocks, but not both.
Fig. 5 Simulation of effects of reduced coverage, haplotype length, and fetal DNA concentration on fetal genotype inference accuracy, defined as the percentage of sites at which the inherited allele was correctly identified out of all sites where prediction (more ...)
We next sought to predict transmission at ‘paternal-only’ heterozygous sites. At these sites, when the father transmits the shared allele, the paternal-specific allele should be entirely absent among the fetal-derived sequences. If instead the paternal-specific allele is transmitted, it will on average constitute half the fetal-derived reads within the maternal plasma “genome” (~5 reads given 78-fold coverage, assuming 13% fetal content). To assess these, we performed a site-by-site log-odds test; this amounted to taking the observation of one or more reads matching the paternal-specific allele at a given site as evidence of its transmission, and conversely the lack of such observations as evidence of non-transmission (). In contrast to maternal-only heterozygous sites, this simple site-by-site model was sufficient to correctly predict inheritance at 1.1 × 106
paternal-only heterozygous sites with 96.8% accuracy (). We anticipate that accuracy could likely be improved upon by deeper sequence coverage of the maternal plasma (fig. S2
), or alternatively by taking a haplotype-based approach if high molecular weight genomic DNA from the father is available.
We next considered transmission at sites heterozygous in both parents. We predicted maternal transmission at such shared sites phased using neighboring ‘maternal-only’ heterozygous sites in the same haplotype block. This yielded predictions at 576, 242/631, 721 (91.2%) of shared heterozygous sites with an estimated accuracy of 98.7% (). Although we did not predict paternal transmission at these sites, we anticipate that analogous to the case of maternal transmission, this could be done with high accuracy given paternal haplotypes. We note that shared heterozygous sites primarily correspond to common alleles (fig. S3
), which are less likely to contribute to Mendelian disorders in non-consanguineous populations.
mutations in the fetal genome are expected to appear within the maternal plasma sequences as ‘rare alleles’ (), similar to transmitted paternal-specific alleles. However, the detection of de novo
mutations poses a much greater challenge: unlike the 1.8 × 106
paternally heterozygous sites defined by sequencing the father (of which ~50% are transmitted), the search space for de novo
sites is effectively the full genome, throughout which there may be only ~60 sites given a prior mutation rate estimate of ~1 × 10−8
). Indeed, whole genome sequencing of the offspring (“I1-C”) revealed only 44 high-confidence point mutations (‘true de novo
sites’; Table S1
). Taking all positions in the genome at which at least one plasma-derived read had a high-quality mismatch to the reference sequence, and excluding variants present in the parental whole genome sequencing data, we found 2.5 × 107
candidate de novo
sites, including 39 of the 44 true de novo
sites. At baseline, this corresponds to sensitivity of 88.6% with a signal-to-noise ratio of 1-to-6.4×105
We applied a series of increasingly stringent filters (fig. S4
) intended to remove sites prone to sequencing or mapping artifacts. Removing alleles also found in at least one read among any other individual sequenced in this study, known polymorphisms from dbSNP (release 135), and sites adjacent to 1–3mer repeats reduced the number of candidate de novo
sites to 1.8 × 107
. Further requiring at least 2 independent supporting reads, removing sites with excessively many reads supporting the alternate allele (uncorrected P
< 0.05, per-site one-sided binomial test using fetal-derived fraction of 13%), and requiring supporting base quality scores summing to at least 105 brought the total number of candidate to 3,884, including 17 true de novo
sites. This candidate set is substantially depleted for sites of systematic error, and is instead likely dominated by errors originating during PCR, as even a single round of amplification with a proofreading DNA polymerase with an error rate of 1 × 10−7
would introduce over 300 candidate sites. This ~2,800-fold improvement in signal-to-noise ratio reduced the candidate set to a size accessible to validation by targeted methodologies [e.g.
an order of magnitude fewer than the number of candidate de novo
sites requiring validation in a previous study involving pure genomic DNA from parent-child trios within a nuclear family (14
)], particularly if only candidate mutations predicted to be pathogenic were validated [e.g.
only 33 of the 3,884 candidate sites were predicted to be protein-altering].