Our first test was to assess how accurately individuals with full Jewish ancestry (all four grandparents) could be distinguished from those with no Jewish ancestry using the score on the first principal component axis (PC1). We found that the individuals with full Jewish ancestry formed a clearly distinct cluster from those individuals with no Jewish ancestry (Figure ). Strikingly, if we look only at the position on the first principal component, in this dataset, every single individual with self-reported full Jewish ancestry has a higher score than any individual with no Jewish ancestry. Interestingly, for the two subjects that appear intermediate between the clear 'Jewish' and 'Non-Jewish' clusters, one of them reports two Jewish grandparents of Sephardic origin, and one declares full Jewish ancestry, but without country of origin for their grandparents. These analyses imply the possibility of perfect or near perfect resolution of full Jewish ancestry using only genetic data. We should note, however, that if one were to attempt inference about Jewish ancestry it would be necessary to have a 'training set' such as that described here to determine the appropriate divisions between individuals with and without Jewish ancestry since the 'clusters' fall next to each other. This implies that, in practice, resolution of full Jewish ancestry would likely be less than perfect, but that the fact that we observed non-overlapping distributions on the first principle component implies that both specificity and sensitivity would be high.
PC1 scores for Jewish and non-Jewish subjects. The score on PC1 plotted against the score on PC2 for Jewish (blue) and non-Jewish (red) subjects.
We went on to assess whether participants with one, two or three Jewish grandparents could be statistically distinguished from one another and from individuals with either full or no Jewish ancestry. As expected, most of these subjects were positioned in between the non-Jewish and the full-Jewish subjects on PC1 (Figure ).
PC1 versus PC2 for people with or without Jewish ancestry. The score on PC1 plotted against the score on PC2 for people with four, three, two, one and no Jewish grandparents.
All but two (36/37) of the subjects with two Jewish parents scored between 0.03 and 0.08 on PC1, all four subjects with three Jewish grandparents scored between 0.08 and 0.1 on PC1, and 496/507 subjects declaring no Jewish ancestry scored below 0.3. The subjects with only one Jewish grandparent were not distinguishable based on PC1 position. The subjects that did not score within the expected range for their self-declared ancestry are shown in Table , along with their ancestral information where known. The majority of informative subjects with no Jewish ancestry that scored most highly on PC1 were either of Italian or Eastern Mediterranean descent. This indicates that in a mixed American context, these populations may not be easily distinguishable from subjects with a single Jewish parent.
Subjects that did not score within the PC1 range expected for their self-declared Jewish ancestry group
Finally, we used one-way ANOVA to determine which groups were significantly different by PC1 score from non-Jewish subjects. We found that all four groups with Jewish ancestry were significantly different, on average, from those with no Jewish ancestry: 4 versus 0 grandparents, p = 8.5 × 10-256; 3 versus 0, p = 4.77 × 10-41; 2 versus 0, p = 6.8 × 10-96; 1 versus 0, p = 7.8 × 10-10. This shows that even with only a single Jewish grandparent there remains a statistically definable signature of Jewish ancestry amongst Americans of European ancestry, although the perfect genetic discrimination of Jewish versus non-Jewish ancestry present in comparing full Jewish to no Jewish ancestry is lost at an individual level.
To address the question of whether this axis may not be predicting Jewish, but rather (contemporary) Middle Eastern ancestry, we used the genome-wide single nucleotide polymorphism (SNP) data from the CEPH Human Genome Diversity Panel [9
]. We added to our European-Americans nine other populations reflecting North Europe (Orcadian, n = 15; Central Europe (French, n = 28; French-Basque, n = 24; Northern Italian, n = 12); Southern Europe (Tuscan, n = 8; Sardinian, n = 28); and Eastern Europe (Russian, n = 25; Adygei, n = 17, an ethnic group of the Russian Caucasus). We also included Palestinian (n = 46), Druze (n = 42) and Bedouin (n = 45) samples as groups that might be similar to ancestral Jewish 'source' populations [10
]. We found that the Middle Eastern populations clustered separately from the European and European-American populations, as expected, and the subjects with four Jewish grandparents clustered close to (but separate from) the Adygei and lay between the Middle Eastern and the European and European-American populations (Figure ). This is an important finding for a number of reasons. Firstly, the Jewish subjects remain in a separate cluster when mixed with both European and Middle Eastern populations, suggesting that the original principal component axis seen in the European-Americans is indeed a Jewish-specific axis, at least in the context of the populations considered here. Secondly, the Jewish cluster lies approximately midway between the European and the Middle Eastern clusters, implying that the Ashkenazi Jews may contain mixed ancestry from these two regions. This is consistent with the Y chromosome and mitochondrial DNA genetic evidence that has been interpreted by some to suggest a stronger paternal genetic heritage of Jewish populations from the Middle East and stronger maternal genetic heritage from the host populations of the Diaspora [10
]. Finally, the proximity of the Jewish cluster to the Adygei is of interest, but the small sample size of the Adygei and sparse availability of Central Asian populations makes interpretation of this proximity difficult.
Figure 3 PC1 versus PC2 of Eigenstrat analysis including European and Middle Eastern subjects from the CEPH Diversity Panel. Subjects with one, two or three Jewish grandparents were excluded. Four subjects with outlying scores were excluded for better visualization (more ...)