Even though groups such as The SNP Consortium1
and the International HapMap Consortium2,3
have identified millions of polymorphic markers and stimulated the development of high-throughput genotyping techniques4–6
, genotyping of polymorphic markers remains a labor-intensive and costly step in genetic mapping studies. To decrease the cost of family-based genetic studies, we developed a computational approach that uses high-density genotype data for a subset of individuals in a pedigree to infer genotypes for the remaining relatives (see http://genomics.med.upenn.edu/genotypeinference
for the software). This approach greatly reduces the amount of conventional ‘wet-lab’ experimentation required to carry out association analysis in pedigrees.
Many gene mapping projects use a tiered approach: first, genome-wide linkage analysis is carried out using widely spaced markers across the genome; then, genotypes are determined for many more markers near observed linkage peaks and are tested by association analysis. Our approach reduces work in the second stage because experimental genotyping is required for only a subset of individuals. Genotypes for the remaining individuals are obtained in two steps. First, low-resolution genotypes from linkage analysis are used to identify regions of shared identity-by-descent (IBD) between relatives. Then, with information on IBD sharing between individuals and high-density genotype data on some members of the family, we infer most of the unobserved high-density genotypes for the remaining individuals.
To illustrate this procedure, we used it to infer genotypes for the children in ten Centre d’Etude du Polymorphisme Humain (CEPH)–HapMap pedigrees. All the grandparents and parents of these pedigrees have been genotyped at about 1 million SNP markers in Phase I of the International HapMap Project3
. First, we used genotypes of 6,564 genetic markers obtained previously on all individuals to determine the grandparental origin for every chromosomal segment in each child. Specifically, for each child and at every marker, we considered the allele from the mother and determined whether that allele was inherited from a transmitted chromosome that originated in the maternal grandfather or grandmother; we did the same for the paternal side. Results from adjacent markers allow us to confirm the grandparental origins of each genomic region (). This step can be accomplished with existing pedigree analysis packages7–11
. In the second step, we inferred unobserved genotypes in the children by combining information from the first step, which describes the genome of each child as a mosaic of the grandparental chromosomes, with high-density genotypes of the grandparents and parents (). For example, at a particular SNP, suppose that the low-resolution genotypes show that the child inherited the chromosomal segment containing this SNP from the paternal and maternal grandfathers, and the high-resolution genotypes show the haplotypes transmitted from these grandparents carry alleles A and C, respectively; then the child’s genotype must be AC.
Figure 1 Genotype inference. (a) Inferred genotypes for eight SNPs. The inferred genotypes for each child are shown in italics. To determine the inferred genotypes, we identified regions of shared IBD (color-coded) between the child and her parents and grandparents (more ...)
When we applied this procedure to infer genotypes for children in ten CEPH-HapMap pedigrees, we obtained 53,666,501 genotypes, an average of 688,032 marker genotypes for each of 78 children (range: 629,731 to 698,165). The average of 688,032 inferred genotypes per child corresponds to ~83% of all the genotypes that can be obtained (the average number of genotypes available on each grandparent and parent in release 16 of the HapMap data is 832,703). Some genotypes were not inferred because the markers were located in regions where IBD sharing information was uncertain. In other cases, even though fully informative IBD information was available, the two grandparents in the maternal or paternal side (or both) and the corresponding parent were heterozygous at a SNP, so it was impossible to determine which alleles were transmitted. These results closely match analytical expectations: theoretically, we would expect to be able to infer ~97%, 83% and 77% of genotypes for SNPs with minor allele frequencies of 0.10, 0.30 and 0.50, respectively (Supplementary Methods
To determine the accuracy of the method, we compared the inferred genotypes with those generated experimentally by PCR-based SNP genotyping. Among the 3,210 genotypes in which both inferred and experimental genotypes were available, seven (0.2%) were discordant. Even if the inferred genotypes were incorrect in all seven discrepant cases, the error rate from inference would still be very low and comparable to the error rate obtained by experimental genotyping in the HapMap Project.
Next, we used the inferred genotypes to test for evidence of linkage and association of candidate transcriptional regulators with gene expression phenotypes. Previously, we had performed genome-wide linkage analyses to determine the chromosomal locations linked to the expression levels of genes12
.With the inferred genotypes, we performed family-based association analysis using the transmission disequilibrium test (QTDT)13,14
with markers within the significant linkage peaks. As the linkage peaks are quite broad, we would have needed to perform millions of genotyping reactions. The inferred genotypes, however, allowed us to analyze a large number of parent-offspring transmissions without having to carry out any additional genotyping reactions. We illustrate this with ten expression phenotypes for which we have previously found highly significant linkage evidence for cis
-acting regulators. We identified markers located under each significant linkage peak (pointwise P
< 4 × 10−7
) and carried out QTDT analysis with genotypes for (i) 30 genotyped CEPH-HapMap trios and (ii) the genotyped 30 CEPH-HapMap trios augmented with inferred genotypes of children in ten CEPH families. In each case, QTDT results confirmed the linkage findings and narrowed the candidate regions. However, results with the inferred genotypes included were more significant than the analysis with the 30 HapMap trios alone ( and ). With just the 30 HapMap trios, for many phenotypes, there were not enough informative offspring to carry out the analysis. In the remaining cases, the findings of cis
association were modest. With the inferred genotypes, we observed several-fold increases in χ2
values (and therefore in effective sample size).
Comparison of QTDT results without and with inferred genotypes
Simulations summarized in show that the substantial increase in power is expected whether analyzing a variant that has a strong effect (such as the cis-acting variants for gene expression phenotypes examined above) or a weaker effect (as would be expected for most complex traits). The simulated data also show that genotyping one offspring per family with high-density markers further increases the power to very near what would be achieved if all the children in each family were genotyped (see rows 3 and 5 in ).
Comparison of simulation results
Although the examples above focus on three-generation families, our method can be extended to other settings. For example, in nuclear families in which low-resolution linkage data are available, most of the unobserved genotypes in offspring can be inferred by genotyping the parents and one of the offspring with high-density markers. We applied our procedure to two-generation CEPH families (we omitted information from the grandparents) and obtained 93.7% of the missing genotypes (Supplementary Note
and Supplementary Tables 1
online). We confirmed these findings using simulated data (rows 6–8 in ).
Gene mapping projects often begin with a linkage study with relatively sparse markers. When candidate regions are found, they are further investigated by association analysis. Because association studies require a dense set of markers, the cost of conventional genotyping can be very high. Here, we show that high-density genotypes can be inferred for the relatives of genotyped individuals with greatly reduced ‘wet lab’ experimentation. Of course, in some cases not all unobserved genotypes can be obtained, as haplotype phase may remain uncertain, or genotypes from a previous scan may not be available. In these cases, it is still possible to estimate a probability distribution for each of the unobserved genotypes conditional on the observed genotype data for the pedigree. It is then possible to carry out association tests that use these probability distributions in place of observed genotypes; these tests can extract information even from individuals whose genotype is uncertain (W.C. and G.R.A., unpublished data).
In silico genotype inference provides a cost-effective way to scan many existing family collections for association, either genome-wide or within candidate genes or regions. All that is required is to genotype several well-chosen individuals in each family at very high density. This approach will facilitate genome-scale family-based association studies and, thus, the identification of susceptibility genes for complex diseases.