We propose the term “Single Nucleotide Difference” or SND for these artifactual polymorphisms () and suggest that up to 8.32% of the biallelic coding SNPs in the NCBI dbSNP database are likely to be SNDs ( and Supp. Table S2
Summary of the SNPs analyzed using bioinformatics and those found to be SNDs
We used a comprehensive bioinformatic approach ( and ) to identify potential SNDs in the NCBI database (Build 129). The 14,319,123 reported SNPs were parsed to select a total of 119,932 biallelic coding SNPs for SND analysis. The sequence surrounding the SNP, as reported in the rs record of dbSNP, was used to align each SNP against the human genome. Only those alignments including the SNP position with at least 90% sequence identity (SID) over 20% of the full length of the SNP sequence or sequence coverage (SC) were selected as mapped loci.
A small number of SNPs (0.22% of the biallelic, coding SNPs) could not be mapped to the genome for reasons discussed in the Material and Methods section. SNPs producing only one alignment (82.12% of the biallelic, coding SNPs) with 90% SID and 20% SC were considered correct SNPs since they could be mapped unambiguously to only one chromosomal position. SNPs with two or more alignments (17.67% of the biallelic, coding SNPs) were considered to be potential SNDs and examined further.
Of the 21,184 SNPs, aligning with multiple loci, 9,979 were SNDs, 10,357 correct SNPs, and 848 undetermined SNPs (). In total, 9,979 SNPs were determined to be SNDs (8.32% of the biallelic, coding SNPs), 108,842 SNPs were determined to be accurate (90.75%), and the remainder 1,111 SNPs were undetermined (0.93%).
The SNDs were examined further for additional evidence by comparison with heterozygosity (H) data. In particular, when the heterozygosity value of the SNP is between 0.4 and 0.5, then the SNP is more likely to be a SND. In fact, in many SNP discovery procedures the co-amplification by polymerase chain reaction (PCR) of two paralogous genes can be misinterpreted as a SNP with H>0.4. In our analysis there were 3,121 SNDs with exactly two genomic mappings; of these 456 (14.61%) had heterozygosity scores in excess of 0.4. We further classified these 456 SNPs into two groups depending on the validation code associated with the SNP ( and Supp. Table S2
). The first group of 189 SNDs are classified as “very strong SNDs” as they have validation code <4, while the second group of 267 SNDs are “strong SNDs” with validation code >4. A lower validation code indicates less evidence for the existence of the SNP and validation code 4 indicates that at least one of the cluster of submitted SNPs (ssSNPs) associated with the reference SNP (rsSNP) has been experimentally validated.
The list in includes all the reported SNPs in the NCBI database for AKR1C1 and AKR1C2 genes. We classified as SNDs all the SNPs reported in the SNP database that could not be found in a population of 100 non-related individuals and that had as variant allele the same nucleotide as the contig of the paralogue in the corresponding position (i.e. a SND corresponds to a mismatch in the alignment of the two genes). Of the 31 SNPs experimentally analyzed, we found 22 to be SNDs and 9 non-SNDs ().
The experimental and bioinformatic analyses were compared using the 31 SNPs experimentally analyzed (). If the experimental data above is used as a reference for the classification of SNPs as SNDs, the bioinformatic predictions have a low false positive rate (2 false positives among 9 experimentally confirmed non-SNDs) and a very low false negative rate (1 false negative among 22 experimentally confirmed SNDs).
PCR amplifications of exons containing the SNPs reported in were performed using gene-specific primers (Supp. Table S1
). PCR conditions used to produce gene-specific target regions were obtained through optimization of PCR reactions. The parameters adjusted were annealing temperature, MgCl2
concentration, number of cycles, and annealing time.
In general, specific amplification is complicated when the target gene has one or more highly similar genes (paralogues) in the genome. Non-specific products, commonly known as bias, are usually due to annealing of the primers to regions in the genome different from the target. Promiscuous primers anneal non-specifically when their sequence has 100% identity with two or more chromosomal locations and/or when PCR conditions are such to favor non-specific primer annealing (e.g.
and low annealing temperatures). The success of gene-specific amplification relies on primer design. In we show how SNDs could originate when designing non-specific or “promiscuous” primers (Supp. Table S1
primers that do not discriminate between the two paralogous genes. In another experiment, we used specific primers for the AKR1C1
gene and showed that even with good primer design one could obtain co-amplification (mixed PCR products) if PCR conditions are sub-optimal ().
SNDs in exon 2 of the AKR1C1 gene and exon 4 of the AKR1C2 gene.
Figure 5 AKR1C1 and AKR1C2 co-amplification of exon 2 and exon 4 respectively in suboptimal PCR conditions using AKR1C1 specific primers for AKR1C1 exon 2 (Supp. Table S1). Panel A and B show electropherograms of mixed AKR1C1/AKR1C2-PCR products using sub-optimal (more ...)
In order to assess the potential practical implications of SNDs, we also examined the presence of SNDs for Illumina and Affymetrix SNP submissions to NCBI. The list of SNPs and their flanking sequence was downloaded from the dbSNP database. We found 35,785 SNPs from our set of biallelic coding SNPs on the Illumina array. Of these 582 were SNDs (1.6%). Similarly, we found 8,647 SNPs from our set on the Affymetrix array and, of these, 215 (2.5%) were SNDs. There are thus real, practical implications associated with the presence of SNDs in the database.
Moreover to identify specific examples of SNDs having been associated with disease, we examined several potential sources of information associating SNPs with disease: Online Mendelian Inheritance in Man (Amberger et al., 2009
), the database of Genotypes and Phenotypes (dbGaP, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap
), MedRedSNP (Rhee and Lee, 2009
), the Office of Population Genetics (OPG) catalog of genome wide association studies (Hindorff et al., 2009
), and SNPedia (www.snpedia.com
). Of these, only the latter two provided information associating individual SNPs with disease in a format suitable for automated analysis. The OPG catalog contains SNP-trait associations with p-values < 1 × 10−5
drawn from PubMed literature searches and other sources. SNPedia provides information on associations between SNPs and disease drawn from a range of sources, including literature reports and manual extracts from other databases such as OMIM.
Of the 6,344 SNPs available for download from SNPedia (www.snpedia.com/files/gbrowse/snpedia
) and the 1,979 SNPs available for download from the OPG catalog, there were 6,486 unique SNPs, indicating a significant overlap between the two sources of data. Of these 6,486 SNPs, 1,849 were among the biallelic, coding SNPs examined in this study. Of these, 50 were SNDs and 1799 were not SNDs. Thus 0.5% of SNPs identified as SNDs have a disease association in SNPedia compared with 1.7% of those SNPs identified as not SNDs.
The fifty SNDs were only found in the SNPedia database. Manual examination of the SNPedia records for the fifty SNDs, showed 32 with either cross-references to OMIM or direct references to the scientific literature for range of “disorders” varying from the benign (e.g. blue eye color) to severe, highly prevalent disorders such as melanoma, schizophrenia, age-related macular degeneration, and diabetes. Thus, even in this small set of disease-associated SNPs, practical issues resulting from the presence of SNDs in the database do occur.