Several recent studies have reported global statistics regarding the numbers and locations of SNVs31–33
within both healthy and disease genomes. The work reported in this study extends these studies to the domain of clinically relevant trends in SNVs within healthy genomes, focusing on the impact of sequencing platform and ethnicity for genome-based prognosis. As makes clear, ethnicity has a powerful impact on the distributions of SNVs within personal genome sequences. It also provides a means to assess the global impact of sequencing technology on variant identification. Despite the six different sequencing platforms used to produce the dataset, in no case is the basic trend of ethnic similarity disrupted. For example, although five different platforms were used to produce the white genomes in our dataset, all five of these genomes form a clade. This indicates that accuracy of every platform is high enough to reveal ethnic relationships. Thus, differences in base-calling accuracy among the different platforms do not invalidate cross-platform genome comparisons for purposes of basic anthropological investigations. Moreover, we find that although the ABI SOLiD and Illumina versions of the NA18507 genome have 575,099 and 526,836 unique positions genome wide relative to each other sharing a total of 77% of their variants in common relative to the union of their sets that at locations corresponding to known disease-causing alleles, agreement is much better: 90% of variants are shared. These results make it clear that cross-platform analyses requiring great accuracy will remain problematic until sound probability models are available for base calling in every platform or at the very least until a standard set of equivalences is established between the quality values produced by the different sequencing platforms. It also has to be noted that these genomes include the first genomes to be published on all platforms, and most genomes were sequenced to relatively low coverage. Although much progress is being made in this area,34, 35
our results show that clinical prognoses cannot yet be made in a platform neutral fashion. However, because our knowledge about disease-causing mutations is heavily biased toward loci in coding regions and because these same regions tend to produce higher quality variant calls relative to the genome as a whole, the current technologies are clearly sufficient for a wide array of nondiagnostic cross-platform analysis.
We find that, similar to SNVs in general, disease-causing alleles are also distributed along ethnic lines, with Africans almost twice as likely to be homozygous for disease-causing or predisposing alleles as Eurasians. One likely explanation for this trend is background effects,36, 37
i.e., alleles with deleterious consequences in one ethnic background may well prove harmless in another. This unequal distribution of variants relative to ethnicity is likely compounded by an ascertainment bias in existing databases of variation. This bias exists due to the overrepresentation of Eurasian populations in current studies of disease. The approach taken herein of looking at the variation across broad classes of genes provides a key insight into our view of human genetic variation—the forces affecting mutational load seem to be largely constant across human populations. In contrast, we see a strong signal separating ethnicities when viewing these genomes in light of known, clinically relevant variation as cataloged in OMIM. These findings indicate that failure to adequately control for ethnicity will jeopardize the prognostic accuracy of sequence-based diagnoses; unfortunately, the impact of ethnicity on the penetrance and phenotypic severity of many disease alleles is still unknown. The need to extend studies in other clinical areas to include women and diverse ethnic populations is well established.38
Our data demonstrate that similarly inclusive studies of personal genome sequences will be needed to assure equitable prognostic accuracy across ethnicities. Interestingly, given a larger set of personal genomes, the analysis techniques used herein would provide a means to identify and quantify background effects. Our results also suggest that further substructure in personal genome SNV distributions still awaits analysis. The structure of the white clade in , for example, is suggestive of deeper population substructure within the whites.
In this study, we introduce the concept of personal genomic load by disease. Our disease-gene classification system has made it possible to measure the genomic load of variants within different disease categories and to carry out cross category comparisons. Genomic load varies between disease categories (). Furthermore, as genes are assigned to categories independent of their genomic location, these deviations are unlikely to be due to shared haplotypes acting to modulate variant numbers within a category in a coordinated fashion, especially in light of the diverse ethnicities in our dataset. One possible explanation for this observation then is a kind of ascertainment bias due to the fact that all the genomes in our dataset are from healthy adults. Simply surviving to adulthood may be correlated with restricted variation for some disease categories but less for others.
Still unknown, however, is the relationship between the magnitude of an individual’s intercategory deviation from the average observed load within that category and the prognostic impact of that deviation. More genomes will be required to establish these baselines for personal prognosis. The answer to this question should be forthcoming, however, as additional genomes will allow significance thresholds to be applied to each category and for ethnicity. One genome in our dataset, however, does show a very large deviation. Genome 9 has a much increased genomic load within the aging category due to stop codons in the CDC27 gene. On further inspection (data not shown), we believe that this is a false positive. Several CDC27 pseudogenes exist, and it seems that for this genome, there may have been a systemic error during the read-alignment phase of the variant-calling procedure. That such events occur is itself an important point—not every gene or region of a genome will be equally accessible for prognosis by genome resequencing. These regions will need to be cataloged for accurate prognoses. Despite these facts, that this individual so stands out does demonstrate the utility of high-level summaries of genomic load made possible by an ontology-based approach, indicating that tools such as our disease-gene classification system will play a useful role in the future of whole-genome data management, analyses, and clinical prognosis and diagnostics.