In 2010, Lo and colleagues reported that the entire fetal genome was represented in short cfDNA fragments in the maternal plasma, and suggested that the reconstruction of the inherited complement was technically attainable13
. We pursued and recently reported a proof-of-concept study demonstrating, for the first time, the noninvasive determination of a fetal genome sequence14
. We achieved substantial completeness and over 99% accuracy using only a sample of paternal saliva and a single tube of blood collected from the mother at 18.5 weeks gestation. Subsequently, another group achieved comparable accuracy using a similar maternal sample15
. Although differing in key technical details, both studies inferred the fetal genotypes by first sequencing the maternal genome in order to identify alleles that could be transmitted from mother to fetus, and then analyzing the mother’s cfDNA to determine which alleles she actually transmitted.
A primary technical obstacle to sequencing fetal genomes from maternal plasma is that only a minority of total cfDNA fragments in maternal plasma are shed from the placenta16
and thus reflect the fetal inherited complement. For instance, the plasma specimens used in our study from two different pregnancies contained 8% and 13% fetoplacental content, which are representative examples given their collection at weeks 8.1 and 18.5, respectively. The remaining cfDNA is derived from maternal cells and is therefore uninformative in this context. Ideally, one might isolate the fetoplacental cfDNA, allowing a direct read-out of the inherited genome. However, despite attempts to separate these two fractions on the basis of size17
or methylation profile18
, no technology has been developed to date that can do so with satisfactory yield and specificity.
Instead, efforts by our group and others demonstrate that by deeply sampling this mixture of fetal and maternal genetic material – along with statistical modeling – the fetal genotypes can be accurately inferred (). This approach relies on the fact that the fetal genome is necessarily a composite of the parental chromosomes. By determining the parental genotypes, we can constrain the possible fetal genotypes on the basis of Mendelian inheritance – discounting, for the time being, the rare chance of a de novo mutation arising in the maternal or paternal germline. To determine the parental genotypes, we performed whole-genome shotgun sequencing (WGS) of the maternal and paternal genomes. This step could be performed at any time before or during pregnancy. In combination with individual and family medical histories, it would establish a set of recessive conditions for which each parent is a carrier.
Figure 1 Overview of noninvasive fetal whole genome sequencing. (a). Sample collection. Parental blood samples are collected in the first or second trimester. After centrifugation, parental DNA is extracted from peripheral blood mononuclear cells (PBMC) or buffy (more ...)
At the vast majority of sites in the genome (>99.9%), both parents are homozygous for the same allele, and the fetal genotype is therefore unambiguous: homozygous for that allele (). At a much smaller proportion of sites (typically fewer than 1x106, or 0.03% of sites, depending upon genetic ancestry), each parent will again be homozygous, but for different alleles; at these sites, the fetus is an obligate heterozygote. Uncertainty about fetal inheritance arises only at the remaining sites – those at which one or both parents are heterozygous.
Figure 2 Inference of the fetal genome on a site-by-site basis. (a). Observed parental genotypes at a given site constrain the possible fetal genotypes. At the vast majority of sites, both parents are homozygotes and the fetal genotype is unambiguous. (b). Expected (more ...)
These uncertain cases can be further split into several possibilities. The most straightforward case is a site at which only the father is heterozygous. If the maternal cfDNA is sequenced sufficiently deeply, but the allele specific to the father is never observed, we infer that the father did not transmit that allele, but instead transmitted the shared allele (). This process is conceptually similar to determining the fetal sex by the presence of reads derived from the Y chromosome, which appear among the maternal cfDNA sequences only when the fetus is male, while their absence indicates the fetus is female. Noninvasively determining the fetal sex in this manner is straightforward, and only a small number of sequences must be sampled from the cfDNA in order to have a high degree of confidence in the presence or absence of an entire chromosome. By contrast, much deeper sampling is required to carry out the same task for each individual genomic site, and a key question is exactly how deep this sampling must be.
The answer to this question largely depends upon the proportion of fetal material among the maternal plasma cfDNA fragments. Accurately estimating this fraction is important not only for NIFWGS, but also key to current aneuploidy tests19
. To estimate this, we can identify a set of informative genetic markers that would not be observed if the cfDNA were entirely maternal in origin. The homozygous alleles specific to the father (not carried by the mother) make an ideal set of markers. If the fetus is male, these may be supplemented by sequences specific to the Y chromosome. After deep sequencing of the plasma cfDNA, the frequency of these definitively fetal sequences is tallied, doubled to account for the equal inheritance from the mother, and used as a direct estimate of the percentage of fetal cfDNA in the maternal plasma.
Precisely estimating the fetal fraction of cfDNA is important for two reasons. First, as this fraction decreases, inaccuracies in the inferred fetal genotypes accumulate. If the fetal cfDNA level is too low – for example, less than 5% -- then the accuracy of the predicted fetal genome may drop below 95%14
, potentially requiring a second plasma sample to be obtained later in pregnancy, when the fetal fraction may be higher. Second, the estimate of fetal concentration is a key parameter, along with the parental genotypes and the cfDNA sequencing reads, in the statistical model used to predict fetal inheritance.
This model is applied to infer the fetal genotypes at the remaining positions of uncertain inheritance: sites at which the mother is heterozygous and could transmit either allele. At these sites, the dosages of the two alleles among the plasma cfDNA sequences provide evidence for the maternal transmission of one or the other. For example, suppose maternal cfDNA is sequenced to a depth of 100X, with an estimated fetal fraction of 10%. At a given site, the homozygous father necessarily contributes the “A” allele, but the heterozygous mother could contribute either “A” or “B” (). On average, we will find 100 reads covering this particular site, of which 90% will be derived from the maternal genome and 10% from the fetal genome. The 90 maternal reads should have, again on average, an equal allele balance at this heterozygous site, meaning that 45 of the reads should contain the “A” allele and the other 45 should contain the “B” allele. The 10 fetal reads will consist of approximately five supporting the “A” allele contributed by the father, while the remainder represent the maternal contribution, which could be “A” or “B”. Thus, we expect that if the “A” allele is transmitted by the mother, we should observe this allele in 55 (45 + 5 + 5) of the reads, whereas if the “B” allele is transmitted, we should observe the A allele 50 (45 + 5 + 0) times. We can statistically test which of these two competing scenarios is more likely given the number of times we actually observe the A allele at this site. We can then repeat this process at all heterozygous sites to yield a set of site-by-site inheritance predictions.
Unfortunately, applying this straightforward model to the full genome yields unsatisfactory results. Suppose, from the previous example, we observe the “A” allele 59 times at this site. In this scenario, the hypothesis in which the mother transmits the “A” allele is almost four times as likely as the transmission of the “B” allele, strongly supporting the former possibility. Whole genome shotgun sequencing works by randomly sampling and sequencing fragments, and despite no change in the underlying inheritance or fetal fraction, the “A” allele at the next such site could be observed only 53 times by random fluctuation. In this event, the two hypotheses (“A” vs “B” transmitted) are nearly equally likely, suggesting that any prediction made in this scenario is roughly equivalent to a coin toss.
A simple means to overcome this limitation would be to sample the cfDNA more deeply to obtain clearer separation between the competing transmission hypotheses. For example, if we were able to sequence the cfDNA to 10,000X depth, and continued to observe the “A” allele in 53% of the reads, the transmission of the “A” allele would then be roughly 20,000 times more likely than the transmission of the “B” allele. Unfortunately, the expense of sequencing a human genome scales with the depth, such that sequencing to 10,000X would currently cost over $1 million. Even if expense were no object, this sampling depth is not achievable in many cases: a typical plasma specimen may not contain a sufficient number of distinct copies of the genome regardless of technical limitations of DNA isolation and sequencing library preparation steps.
Rather than sampling to an impractical depth at each genomic site in isolation, we employ an experimental technique to group together alleles from each parent, thereby realizing greater statistical power. This approach exploits the fact that the parental genomes are not inherited as a series of independent sites, but rather as haplotypes, or sets of variants jointly present on one of a given pair of homologous chromosomes. If we knew the haplotypes of the parental chromosomes, then we could search for evidence of joint transmission of large contiguous groups of genetic variants, allowing for a small number of crossover events during meiosis. However, long-range haplotypes that span all variants across the full length of a chromosome arm have to date remained largely recalcitrant to experimental methods, except in the context of multi-generation family studies where haplotypes can be inferred post-hoc by transmission patterns.
We recently developed a technique to ascertain smaller subsections of haplotypes, or “haplotype blocks,” each containing dozens or hundreds of heterozygous sites and covering tens to hundreds of kilobases20
. At a given locus, we define two haplotype blocks, arbitrarily labeled “A” and “B”, representing the grouping, or “phase,” of genetic variants present on the two homologs (). Applying this technique to the parental genomes allows us to search for evidence of transmission of whole blocks “A” or “B”, instead of individual alleles “A” or “B”, by aggregating evidence of overrepresentation of each phased allele along the length of a haplotype block (). The signal generated by jointly considering large blocks of sites helps to mitigate the site-by-site noise described above. Moreover, sites at which both parents are heterozygous, where inheritance is particularly difficult to individually predict owing to the addition of a third possible fetal genotype, benefit from their inclusion in haplotype blocks with stronger evidence of inheritance.
Figure 3 Inference of the fetal genome from haplotype blocks. (a). Phasing of maternal heterozygous sites into haplotype blocks (red bars). Haplotype blocks contain dozens or hundreds of such sites and cover over 300 kilobases on average. A single chromosome may (more ...)
The inferred fetal genome, then, consists of a set of predictions about inheritance of one or the other haplotype block from each of the parental genomes (). This composite picture of the fetal genome is substantially complete and highly accurate. However, several clear avenues for technical improvement remain. Intuitively, increasing the length of the haplotype blocks and ensuring they encompass every heterozygous site carried by each parent allows more evidence to be accumulated and yields more accurate predictions of inheritance. At the time of our study, we had determined haplotype blocks for only the maternal genome, and predicted paternal inheritance on a site-by-site basis. We subsequently phased the paternal genome in this same family, which increased the accuracy of prediction for paternal sites from 96.8% to 99.95%. Currently, the process of obtaining haplotypes blocks is laborious, although streamlined techniques21
promise to shorten the processing time required and improve the scalability of the method. Also, these approaches could be combined with other approaches that define longer but sparser blocks (e.g. phasing incomplete sets of heterozygous sites across entire chromosomes22–24
). Leveraging even longer haplotype blocks while maintaining completeness in terms of the fraction of sites that are phased would improve prediction accuracy and additionally allow mapping of sites of meiotic recombination.
We now return to the question of de novo
mutations, or mutations newly arising in the maternal or paternal germline. In principle, de novo
mutations are easily identified as variants in the sequenced maternal cfDNA that are not found in either parent. In practice, despite ongoing improvement, WGS technology remains imperfect, and errors introduced during PCR or sequencing far outnumber the approximately 50 to 100 true de novo
mutations that we would expect in any given fetus25
. At a sequencing depth of 100X and fetal fraction of 10%, the two types of events yield signatures that are, on the whole, nearly indistinguishable: at a given site, a small handful of reads suggests the spontaneous emergence of a fetal genotype incompatible with Mendelian inheritance. Separating the true mutations from the spurious errors introduced during the sequencing process remains a challenge and a major area for improvement in both technology and analysis.
One way to address the large number of candidate de novo mutations is to apply an increasingly aggressive set of filters designed to improve the signal-to-noise ratio in the candidate set. For example, we might exclude any candidate with only one or two supporting reads. We might remove sites that are inside or adjacent to specific sequence motifs known to generate elevated error rates. We might discard any site also identified as a candidate in other samples within the same cohort. At each step, we may trade a small decrease in sensitivity for a suitably large gain in specificity. Even after extensive filtering, we are likely to be left with several thousand candidates – still too many for follow-up. However, only a very small percentage of these candidates are likely to fall within protein coding or regulatory regions, suggesting that manual review and/or validation of high-impact candidates may be plausible in a clinical setting.
Ideally, in order to systematically map de novo mutations, a sample must be collected from the father. Without knowledge of the paternal genotypes, any paternally transmitted alleles not shared with the mother are indistinguishable from de novo mutations in the maternal germline. However, even without a paternal sample, it may still be possible to identify likely de novo mutations by searching a predefined panel of genes known to be inherited in a dominant fashion with high penetrance; mutations in these genes could be ruled as unlikely given the father’s health status. Nevertheless, for all but the most stereotyped disorders, definitively separating deleterious mutations from benign ones remains an elusive goal, even for single-gene disorders.