We identified and validated more than 3 million unique SNPs in rhesus macaques through pairwise comparisons of SNP data sets (Table ). Our starting point for these analyses was a list of 4.3 million inferred heterozygous positions in the original Sanger sequencing data for the reference animal [23
]. Comparison of these Sanger data with corona_lite SNP calls from SOLiD re-sequencing of the reference animal resulted in validation of 1,056,266 heterozygous positions (Figure ). Similarly, re-sequencing of two additional Indian-origin animals, followed by comparison to the reference animal SNP list, resulted in the validation of an additional 873,552 SNPs.
Bioinformatically validated SNPs
Figure 1 Validation by category, direct comparison examples. Only the largest data sets are shown. (A) Validation in a single animal with multiple chemistries. All data for the reference animal 17573 are displayed by chemistry. SNPs falling into the overlap region (more ...)
We were able to validate 1,058,581 additional SNPs among just three rhesus macaques using e-genotyping (http://is04607.com/~drio/egenotype/
]). E-genotyping is a novel in silico
method to survey mapped read sequence data using 31bp probe sequences created from a list of known or suspected SNPs. Unvalidated potential heterozygous positions from the original Sanger reference animal data were used to create uniquely mapping e-genotyping probes. Using these probes we screened each of the SOLiD read sets (i.e. the reference animal and the two additional animals). This identified 527,425 validated SNPs that were not validated using the Sanger data and corona_lite analyses of SOLiD data. Probes created from the corona_lite analyses of SOLiD data were used to identify and validate 531,156 more SNPs.
Finally, we used the smaller published datasets to complete a comprehensive list of validated SNPs. We validated 49,747 SNPs by comparing the results from the corona_lite re-sequencing analyses with previously published datasets. A small number of SNPs (n = 20) were validated through comparisons among the previously published data. These 20 SNPs were not found in our new SOLiD re-sequencing data (Figure S1, Additional file 1
The genomic distribution of the validated SNPs appears to be random across the autosomes (Figure ). SNP distribution by chromosome was consistent with chromosome length, except for the X chromosome. X chromosome SNP discovery is limited in this study because one of the three study animals was male (r02120). Across the autosomes, the density of SNPs per Mb fits within one standard deviation (195.3 SNPs/Mb) from the overall means (1,057.1 SNPs/Mb). We retained lists of the unvalidated SNPs in order to facilitate future efforts at SNP discovery in genomic/genic regions where coverage is relatively sparse. This data is provided in our Genboree website (http://genboree.org/java-bin/project.jsp?projectName=Rhesus%20SNPs%20using%20Next-Gen%20Sequencing&isPublic=Yes
]) in the "Rhesus SNPs" database in a track called "Unvalidated:SNPs".
Figure 2 Genomic distribution of SNPs. (A) The total number of SNPs identified in each chromosome. (B) The second bar graph shows the relative concentrations of SNPs validated in SNPs/Mb by chromosome. Only chromosome X (**) displays a SNP concentration that deviates (more ...)
Unidentified nearly identical duplications of sequence (e.g. unrecognized segmental duplications) in the rhesus whole genome assembly are one potential cause of false positive SNPs. If nearly identical duplications were mistakenly collapsed in the genome assembly, then wherever there are single base differences between two copies of a sequence, the reference animal would appear heterozygous for a SNP. The other two animals would also appear heterozygous for the same SNP, thus providing false validation. We addressed this issue in several ways. First, we note that this problem would produce false heterozygote calls across all three animals, and we found that only ~10% of all of our validated SNPs are scored as heterozygous in all three animals (note however, that the animal with the lowest coverage, r02120, limits the number of SNPs identified as heterozygous in all sequenced animals). Among the 306,782 SNPs that are heterozygous in all subjects, we can evaluate the impact of cryptic sequence duplication in two ways. We compared Sanger read coverage for these 306,782 SNPs against an equivalent number of SNPs that were scored as heterozygous in two animals and homozygous in one animal. Unrecognized duplications should have higher than average read coverage in the original 5.2 × Sanger whole genome read data. We found (Figure S2, Additional file 2
) that the mean coverage and read coverage distributions for the two sets of SNPs is virtually identical. If a substantial fraction of the "three heterozygote" SNPs were false positives due to unrecognized duplications in the genome assembly, we would expect a higher mean read coverage for that set compared with the control SNPs that are not scored as heterozygous in all three animals. The percentage of SNPs with read coverage of 11 or greater (i.e. twice the genome-wide average) was only 3.8% for SNPs detected in all three individuals and 3.5% for SNPs detected in only two individuals (0.38% and 0.35% respectively of all of our validated SNPs).
An alternative hypothesis is that duplicated sequences within the rhesus assembly were not assembled in the chromosome scaffolds and were instead placed in the "chromosome unknown" or chrUr bin. The tendency for duplications to be included in unknown chromosomes has been noted in other whole genome shotgun assemblies [31
]. Therefore we wished to determine whether our validated SNPs had homology to assembled contigs placed in chrUr, which might represent unrecognized duplications. To test this, we split chrUr into 150 bp length "reads" and scanned this simulated "read" data using e-genotype probes created from all of our validated rhesus SNPs (3,271,622 probes, representing the reference position at all locations as well as the one or two non-reference validated alleles). None of the probes matched sequence from the unmapped reference sequence provided in chrUr. Given these various results, we conclude that there is no clear evidence for a significant number of false positive SNP calls due to duplicated genomic regions in our SNP data.
SNP annotation and Polyphen Analysis
All the validated SNPs were further examined to characterize their position relative to known coding regions and possible functionality (Figure ). Due to the incomplete annotation of the rhesus reference genome, the annotations of rhesus SNPs are based in part on mapping to homologous human transcripts. As expected, most validated SNPs map to intergenic regions (84.2% of validated SNPs). Intronic SNPs were the second most prevalent at 15.3%. Very few (601 or 0.02%) of the SNPs fell into known or predicted splice sites. Among coding SNPs, about equal numbers are synonymous changes (4,605, 0.15% of total) and non-synonymous (4,472, 0.15% of total). A number of SNPs fell into regions of alternative splicing such that the genomic placement of the SNP varied by transcript. All such SNPs (n = 418) are associated with a coding sequence of at least one transcript, and with an alternative transcript where the placement differed: synonymous coding (21.5%), non-synonymous coding (19.1%), 5'UTR (28.9%), 3'UTR (30.4%).
Figure 3 SNP annotation. Genomic context of annotated validated SNPs. All annotations were obtained from Ensemble build 57 (released March 2010). The number of SNPs that fit into each category represented by a bar is listed over the respective bar. The bar for (more ...)
Potentially functional polymorphisms in rhesus macaques are of particular interest, especially those in genes that may correlate with human disease. The rhesus genome annotation is incomplete and based primarily upon gene prediction algorithms that rely upon human-to-rhesus homology [32
]. Therefore, we relied upon a homology-based method (see Methods) to generate hypotheses about the functionality of non-synonymous SNPs (nsSNPs) (Table ). A total of 4,439 SNPs from the full list of nsSNPs (4,472) were successfully converted from rheMac2 coordinates to hg18 coordinates. After removing all SNPs where the hg18 reference allele matched the new variant macaque allele, 4,177 nsSNPs were submitted to Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/
]. Of those, 411 were scored as probably damaging and 325 scored as possibly damaging. SNPs that affect binding sites or are located in transmembrane regions can often be expected to alter protein function. Twelve of the probably damaging mutations are annotated as falling in transmembrane regions, two affect a metal binding site, and one affects a modified residue. Of the possibly damaging mutations, six are in transmembrane regions, one is in a propeptide, two affect modified residues, and one affects a carbohydrate binding site. We performed GeneGo (http://www.genego.com/
]) pathway analysis to determine which diseases were associated with the genes containing predicted deleterious rhesus SNPs from the PolyPhen-2 analysis. These genes were involved in one or more critical organism process such as apoptotic pathways, DNA repair, development or inflammation. Table displays a non-exhaustive list of diseases associated with the relevant pathways, many of which are due to a few well-known genes: Brca2
, and Casp7
Diseases associated with genes containing nsSNPs
Finally, we compared the list of validated Indian-origin rhesus macaque SNPs to the locations of known SNPs in the human genome (dbSNP, build 132). The evolutionary lineage leading to rhesus macaques diverged from the human lineage about 22-26 million years ago, and the two genomes have diverged more than 6% in overall DNA sequence since that time [23
]. Consequently, except in unusual loci such as HLA, homologous basepair positions that are polymorphic SNPs in both species are highly unlikely to represent shared ancestral polymorphism that has been retained for more than 20 million years in both lineages. Rather, most shared SNP positions will reflect parallel mutational events that created variation at the same site in both species. We converted all 3,038,166 validated rhesus SNPs from this study to hg19 human coordinates, and 2,775,850 of those SNPs successfully converted using the Galaxy hosted UCSC liftOver tool. Of these converted SNPs, 90,086 exactly matched the location of human SNPs in dbSNP Build 132, representing a 3.2% overlap. Given that there are 29,100,846 human SNPs (chromosomes 1-22, X) in build 132, about 1% of the human genome, if the specific locations of SNPs in the human and rhesus genomes were entirely uncorrelated then we would only expect about one-third as many polymorphic basepair positions to be shared in the two species. This correlation between the locations of SNPs in rhesus monkeys and humans is almost certainly influenced in part by constraints on which sites within each genome can tolerate polymorphism. But other mechanisms, such as correlations in mutation rates at homologous sites across primate genomes, may also be involved.