SNPedia is a Semantic MediaWiki site (1
) that is edited and updated by both automatic and manual means. It is intended to be formatted in a manner supporting automated report analysis and generation by associated software (such as Promethease, discussed below), while still retaining a level of legibility for both casual and frequent users.
The NCBI dbSNP database (2
) catalogs over 10 million non-redundant Reference SNPs. Each is assigned an identifier which begins with the two-letter code rs
and then a unique number. This identifier is in widespread use, such as on large-scale microarrays, and is now common in scientific literature. It also became accessible to non-scientists via Direct To Consumer (DTC) genotyping services. The rs# identifier provides a precisely defined location in the genome and is easily parsed from the scientific literature. It is stable across genome builds, so it does not require researchers to periodically remap old coordinates onto newer builds. While most commonly representing a single nucleotide polymorphism (SNP), dbSNP is not limited to single nucleotide variants and thus an rs# may also represent indels of varying size. For example, rs333 is an indel of 32
nt. In contrast, dbVAR (2
) covers larger variants. In terms of species coverage, while dbSNP is species agnostic, SNPedia focuses on human data at this time but does contain some non-human rs#s to ensure future support.
Types of content
The basic units of information in SNPedia are single or multiple nucleotide positions in the human genome known to vary in either germline or somatic contexts. SNPedia currently collects information on single nucleotide variants such as SNPs and mutations, or more specifically, on genotypes composed of one or more variant loci. A page of content is created based on a single rs-identifier (as maintained in dbSNP), and then typically three associated pages are created, reflecting the three possible genotypes for that SNP (homozygous for the major allele, heterozygous and homozygous for the minor allele at that locus). To the extent possible, a summary of the odds ratios for one or more associated medical conditions is reported for each genotype. Sets of genotypes from unlinked loci, known as genosets, are also defined in SNPedia. Additional content types include the genes, phenotypes, medical conditions, and drug interactions reported to be associated with these variants.
Users contribute both structured data and free text comments using a combination of standard Mediawiki syntax and Semantic Forms. Example properties include ‘Magnitude’, which gives a qualitative summary of significance on a 0.0–10 scale, and ‘Repute’ for consequences that can be classified as clearly ‘Good’ or ‘Bad’. Unlimited wiki text with hyperlinks, images and formatting provides the ability to communicate subtlety lacking from more structured data formats.
Sources of content
Data is collected from both bulk and individual sources. As a wiki, users add data on a continuous basis, and these additions are augmented by periodic updates text mined from public data sources. Sources of content cite publications, in particular, Pubmed PMID or DOI identifiers. SNPedia is committed to maintaining free access for personal users to all contributed information.
Criteria for inclusion
While SNPedia casts a broad net with regard to the creation of ‘rs-pages’ defining individual SNPs, genotype-specific pages are primarily only created for variants that have significant medical or genealogical consequences based on published meta-analyses, studies of at least 500 patients or two or more independent studies (i.e. replicated findings), or other historic, statistical or medical significance. This allows software creating personal genome reports, which is based on the genotypes carried with an individual, to create more robust reports. With some exceptions, SNPedia's genotype-specific pages generally do not include variants that are unreplicated or from studies with less statistical power. However, variants with high penetrance, such as ones that might also be reported in OMIM (3
) or in LSDBs, are increasingly being added to SNPedia. Often these variants are so infrequent that they have not been observed in any populations sampled for variation, and therefore they are not present in dbSNP. In such cases, SNPedia submits their genomic data directly to dbSNP in order to have rs numbers assigned. After release in dbSNP, these variants are then added to SNPedia. An example of such a variant is the del-F508 mutation representing the most common cystic fibrosis-causing variant (4
), now represented in dbSNP (and therefore SNPedia) as rs113993960. Anyone interested in having specific variants added to dbSNP in this fashion is encouraged to contact SNPedia.
There are at least four levels of data curation. First, all additions and changes to the SNPedia are reviewed by wiki users (including the editors). Second, Semantic Mediawiki templates flag certain warnings (such as SNPs with no assigned chromosome, or known by older/expired names) to bring them prominently to the readers and editors attention. Third, at least two independently developed software bots frequently crawl the entire site from the outside. They add supplementary information such as chromosome, position, gene and allelic data. While designed to be cautious and not replace any information entered by a human, these bots often are the first to detect irregularities. Fourth, SNPedia content is used in Promethease personal genome reports read by a diverse audience able to recognize problematic assertions. They are uniquely able and motivated to detect and report errors or nuances missed in the original research.
The SNPedia wiki has been accessible online since 2006, pre-dating the advent of the DTC genomic testing companies. Starting from under 1000 SNPs, it now has approximately 25
000 SNPs (), 10
000 genotypes and 200 genosets. SNPs are associated with 200 or more medical conditions and 150 or more drugs. More importantly, the growth in both new entries per week as well as in edits per page over the last 5 years has been steady and consistent. Additional statistics that are tracked include the number of SNPs associated with each disease category (for the 10 with the highest number of associated SNPs, the range is from 114 to 229; for all diseases, the average is 27 SNPs/condition), and the number of Pubmed PMIDs per SNP (currently ranging from 29 to 60 for the 10 SNPs with the highest number of associated references).
SNPedia Growth. The total number of individual SNPs described in SNPedia for the last 4 years. Note that the y-axis is logarithmic.
Many interesting phenotypes are dependent on more than a single variant. As mentioned above, in order to accurately model this, SNPedia has introduced a notation for such sets of genomic variants, known as ‘genosets’, and implemented a parser in Promethease. For example, to recognize the two SNPs which define homozygosity for APOE4 the genoset criteria are noted exactly as follows:
and the possibility of having at least one APOE4 allele is represented as:
Genosets can refer to other genosets, and cyclical references are resolved. The boolean operators (‘and’, ‘or’ ‘not’) as well as ‘atleast(N, list)’ are also supported by Promethease. The present nomenclature makes it difficult to distinguish linked SNPs from the same strand (i.e. in cis), which while appropriate for data from microarrays and from unlinked loci, will require enhancing given the advent of full genome sequencing and phased data.