At least four distinct populations of chimpanzees have been defined based on morphological and geographic criteria, including bonobos (
Pan paniscus) and three common chimpanzee populations: eastern (
Pan troglodytes schweinfurthii), central (
Pan troglodytes troglodytes), and western (
Pan troglodytes verus)
[1]. Genetic studies have confirmed the distinctiveness of the chimpanzee populations
[2],
[3],
[4], and have also documented striking differences among them; for example, central chimpanzees harbor ~2.5 times as much genetic variation as western chimpanzees, more than is observed in any human population
[3],
[5],
[6],
[7],
[8],
[9]. Allele frequency differentiation among some pairs of chimpanzee populations—for example western and central chimpanzees—is also known to be higher than between any pair of human populations
[9].
In contrast with studies of human history—for which there is a rich fossil record that can complement and inform genetic studies—the dearth of chimpanzee fossils
[10] means that nearly all information about chimpanzee demographic history must come from genetic data. The best current understanding of chimpanzee history comes from small collections of genomic loci amplified by polymerase chain reaction (PCR). The two largest data sets of this type were collected by Yu et al.
[8], who studied ~23 kilobases in 9 bonobos, 2 eastern, 5 central, and 6 western chimpanzees, and Fischer et al.
[9], who studied ~22 kilobases in 9 bonobos, 10 eastern, 10 central, and 10 western chimpanzees. Analyses of these data sets by fitting the data to an Isolation and Migration (IM) model have resulted in important inferences about chimpanzee history
[11],
[12]: that bonobos and common chimpanzees separated ~1 million years ago (Mya); western and central chimpanzees separated ~0.5 million Mya; there was a ~3-fold expansion in the central chimpanzee population size since the western-central population separation; and there has been migration between western and central chimpanzees since they separated. While these analyses provide a baseline set of parameter estimates that can be used to understand the relationships among the chimpanzee populations, the estimates also have substantial uncertainty. We aimed to generate a new kind of data and a model for analyzing the data that would increase the precision of previous estimates and be sensitive to different features of demographic history.
We sequenced 26,495 reads from a bonobo (B) and 36,083 from an eastern chimpanzee (E), using a standard plasmid end-sequencing technique that obtains pairs of reads each about 800 base pairs in length (up to 1,600 base pairs when both ends of the clone are considered together) and separated by about 4 kilobases. We then combined these data with publicly available data from the chimpanzee and macaque genome projects: 1,193,115 reads from three central chimpanzees (C), 20,632,928 from three western chimpanzees (W), and 13,810,571 from macaque (M)
[5],
[13]. By aligning all reads to the human reference sequence (H)
[14], we generated nine different data sets, defined by the combinations of samples in the alignments. The five-sequence data sets are designated C
1C
2WHM, W
1W
2CHM, CWBHM, and ECWHM, where letters are used to indicate the species that are included (for example, C
1C
2WHM denotes two central chimpanzees, a western chimpanzee, a human, and a macaque). The four-sequence alignments are designated C
1C
2WH, W
1W
2CH, CWBH, and ECWH, and the three-species alignment is designated WBH (). (We used the alignments of smaller numbers of individuals to obtain more precise estimates of certain demographic parameters.) These data sets contain much more alignment of chimpanzee sequence than have previously been available. For example, the CWBHM data set (598,814 bp) includes >20 times more alignment of central, western, and bonobo DNA than any population genetic data set studied to date
[6],
[8],
[9].
| Table 1Genetic divergence between pairs of chimpanzee populations. |
The genome sequence alignments we used in our analysis are not only large, but also different in nature from traditional population genetics data sets. While genome sequence alignments have the advantage that they include orders of magnitude more alignment compared with traditional population genetic data sets (and are becoming increasingly practical to generate with new sequencing technologies), they have the disadvantage that only a few individuals are available for each region of alignment and so there is limited information about allele frequencies. A methodological question in population genetics is whether this different type of data can provide new information about history. Here, we demonstrate that genome sequence alignments can be used to provide insights about population history that are not accessible from the analysis of traditional, smaller data sets.