|Home | About | Journals | Submit | Contact Us | Français|
Recent years have seen near exponential growth in knowledge regarding genetic and genomic variation as more genomes have been sequenced, and corresponding advances and economies of scale in sequencing and genotyping technologies have reduced their relative costs. In parallel with these developments, discoveries of genes contributing to monogenic and complex diseases have rapidly advanced, and bioinformatics databases and software relating to the collection and analysis of genetic data have increased in number, size and scope. Single nucleotide polymorphisms (SNPs), comprising the most abundant type of genetic variation, are now the principal raw material underlying most genetic studies and databases. While other types of variation including indels, microsatellites, copy number variants, and epigenetic markers remain important to consider and can impact disease, SNPs are largely the easiest to ascertain, and the most useful and widely applied markers in genetic studies in the modern age.
Researchers and clinician-researchers are confronted with a dizzying array of software choices and increasingly large and complex datasets and databases relating to SNPs, sometimes working without assistance from a geneticist or bioinformatician to help guide them. The principle aim of this review is to provide a comprehensive overview of available bioinformatics resources relating to human genetics research, with an emphasis on SNP centered resources. The review also provides a resource for students seeking an introduction to SNP genetics resources and for wet lab molecular biologists conducting SNP-centered research who want to expand their knowledge on ways to apply SNP tools and databases. A number of important issues that affect users and developers of SNP bioinformatics resources are discussed throughout along with practical examples.
While many of the resources described have relevance and origins in the study of non-human species, this review focuses on human clinical applications. The review discusses basic SNP bioinformatics issues, critical databases and their uses, basic strategies and queries using APOE examples, software and tools relating to association studies, the prediction and validation of functional SNPs and miscellaneous SNP resources. The focus is primarily on academic resources that are widely available. Supplemental Table 4 provides URL links for all resources described in the text sections in order of their appearance. Key abbreviations and definitions often encountered in this paper, and other SNP-related papers, databases and informatics tools are given in Table 1.
There was a time when the existence of a reliable, comprehensive, centralized and public resource on genetic variation was uncertain. That time passed with the progressive development of NCBI's dbSNP into the definitive resource for this purpose, and its integration with other popular resources1. However, even with the establishment of reliable databases there are a number of central issues to SNP bioinformatics that still exist. These issues can create problems for both users and designers of SNP tools and databases. A core issue is the need for updates to SNP databases or tools to keep up with current information and discoveries of new variants, which dbSNP addresses through periodic, sequentially numbered releases (called “builds”). Any user of SNP resources should be aware of when those resources were developed and last updated. In some cases it will be important to tailor input information to a particular SNP database build.
A desirable feature in tracking SNP-related information would be to have a unique identifier associated with each SNP which does not change over time and would be universally applied in all databases and publications. Identifiers known as reference SNP identifiers (SNPids), or rsIDs, exist in dbSNP and partially address the issue of unique identifiers, also including identifiers for indels and repeat polymorphisms. However, as dbSNP grew due to additional submissions of SNP discoveries and improved mapping of SNPs to a more complete reference sequence it was realized that in some cases multiple rsIDs referred redundantly to the same SNP, resulting in the need to merge alias rsIDs. The result is that depending on when a paper was published or software designed, a query using a particular rsID may be unsuccessful if aliases for that SNP are not taken into account. Tables which detail such historic merges are available from dbSNP. A web-based tool, SNAP, takes aliasing into account and also has a feature that allows users to translate lists of rsIDs between current and historic dbSNP builds2.
Some SNP bioinformatics tools and databases are only queryable via gene identifiers. Gene identifiers also suffer from potential aliasing and versioning problems. Users can consult the HUGO gene nomenclature committees' online resource (http://genenames.org/) to translate their queries if necessary3. Finally, SNP databases and bioinformatics tools do grow obsolete, and sometimes are no longer stably maintained at the original URL. This can be due to lack of utility, interest, additional funding support, or simply because the resource migrated to a different URL or was re-released under a new name or version4. The next sections discuss specific SNP databases and strategies and considerations for their use and navigation.
There are >800 databases of human genetic variation, but only a few central databases that are most widely used. These data sources can be split into a few categories, including: 1) common genetic variation, 2) rare genetic variation (discussed in the Supplement), and 3) databases of variation with additional functional or curated information added or integrated.
The largest database of common genetic variation is NCBI's dbSNP1, created after the Human Genome project discovered a significant number of common variants. dbSNP has grown exponentially in its lifetime, at the time of this submission encompassing information on ~18.4 million human variants, and ~34.9 million variants in >30 other species. With few exceptions the other databases, bioinformatics tools and experiments described in this review rely heavily on the underlying information from dbSNP. The database provides a central, freely available resource for tasks including but not limited to: 1) mapping known variation to the human genome, 2) providing identifiers for known and novel variants, 3) ascertaining known variation within or around a gene(s), and estimating the functional effects of variants, 4) designing assays to measure specific variants, 5) estimating prior support and validity that a variant to truly exists, and 6) estimating population allele frequencies of a variant in a variety of populations. The dbSNP variants are mapped to the genome and included in genome browsers (NCBI, UCSC, EMBL) allowing users to integrate SNP information relatively easily with other features of genome annotation. dbSNP also features haplotype predictions, snpBLAST which allows users to query sequences against dbSNP, and targeted databases including dbMHC, dbLRC and dbRBC. Information on individual SNPs can be retrieved including gene-related annotation, information on sample assay types and validation, the SNP submitters, and allele frequencies in measured populations. Batch querying of many SNPs and download of all information in dbSNP is also available.
Another database of central importance to SNP bioinformatics is the International HapMap project. The HapMap project began as a collaborative effort to comprehensively survey allele frequency and linkage disequilibrium (LD) patterns among common human genetic variants across worldwide populations, and now provides a critical platform of information for large-scale genetic association projects. The project has now progressed through 3 phases: Phase I5, Phase II6 and Phase III. The Phase III data release of HapMap currently contains information for ~1.6 million SNPs in 1,115 samples from 11 worldwide populations, assayed on DNA derived from immortalized lymphoblastoid cell lines. This genotype information is available for download and can be viewed through the HapMap browser, other genome browsers and within dbSNP records. The HapMap information is valuable in a range of uses including but not limited to validating the presence and relative allele frequency of many SNPs in the genome, estimating and delineating haplotype blocks, estimating recombination rates and hotspots, providing the basis for genotype imputation, guiding selection of SNPs for genome-wide arrays, and providing reference samples for genotype assay design and in vitro experiments. An initiative underway, the 1000 Genomes Project, aims to sequence >1,000 individual human genomes including many HapMap samples. This project began releasing data in 2009 and will provide an even deeper resource on human genetic variation, capturing common variation but also discovering more rare variation than ascertained in earlier HapMap phases.
The HapMap and dbSNP provide a view of worldwide similarities and differences in allele frequency of human variation. There are a number of databases aimed at characterizing variation within or across human populations, including the Japanese SNP database (JSNP)7, the ThaiSNP database, the Taiwan-Han Chinese SNP database, SNP@ethnos8, the CEPH genotype database and ALFRED9. Many of these databases rely completely on or extend upon SNP information from dbSNP and HapMap. ALFRED is notable because although it contains information for only ~18,000 variants, it has the most diverse sample encompassing >680 populations. Databases reporting on diverse samples have a variety of potential uses including estimating expected population control frequencies for SNPs of interest, deriving power calculations for SNP studies, and estimating population ancestry measures.
Users of the common SNP databases above have multiple options to retrieve and organize SNP-centered information, often starting with simple downloads or tools available at the source websites. A number of “marts” allow relatively fast retrieval of SNPs that meet user-defined criteria (e.g., population MAF thresholds), including 1) HapMart at the HapMap website, 2) BioMart developed by the OiCR and EBI, 3) SPSmart10, and 4) Genome Variation Server (GVS) at NHLBI. Given a list of SNPs a user can also conduct a “batch retrieval” from dbSNP to retrieve information available there. The UCSC Genome Browser also features easy viewing and downloading of SNP-centered information. For more complex SNP queries, BioMart or the Table Browser at the UCSC Genome Browser provide potential solutions. While BioMart is currently limited to an older dbSNP version, it can provide filtering based on SNP validation status and SNP function (e.g., all stop codon SNPs in the genome). The UCSC Table Browser allows users to construct open-ended queries based on UCSC annotations, for instance: retrieving all SNPs that are found in human microRNAs, all SNPs in conserved transcription factor binding sites, or all SNPs found in Affymetrix U133 gene expression array probes. These queries are carried out based on the relational data tables that underlie the UCSC Genome Browser annotation tracks. For individuals with interest in deeper analyses, most of the major databases of common variation (e.g., dbSNP, HapMap, UCSC) include an option to download all data with minimal restrictions on use. An important consideration in any SNP informatics project is that each SNP data source contains potential ascertainment biases.
There are a range of resources that provide useful information relating to specific variants, often integrating information from the literature, or multiple databases or datasets. The OMIM database is an excellent example, combining expert curated summaries of the literature with information on allelic variants, and searchable by SNP identifier. The Human Genome Epidemiology (HuGE) Navigator provides flexible mechanisms to query for genetic associations in the literature based on phenotypes (Phenopedia) or genes (Genopedia)11. Likewise, the Genetic Association Database (GAD) at the NIH also provides a resource to search over 40,000 association studies via many mechanisms, providing information for some studies on populations studied, statistical associations with specific variants and study conclusions.
Genome-wide association studies (GWAS) based on large-scale SNP genotyping have resulted in the generation and analysis of a previously unprecedented scale of data in the genetics literature, with more than 350 estimated GWAS published at this time and billions of genotypes analyzed. A number of recent efforts have made available access to an extended, albeit rather incomplete, proportion of GWAS results. The most extensive and centralized resource to date is NCBI's database of genotypes and phenotypes (dbGAP), although access to many results requires formal application. Separately, a list of top results from GWAS studies is currently maintained by the NHGRI. Another resource HGVbaseG2P, an expansion of the prior HGVbase12, provides an informatics structure and “mart” for querying GWAS results with some restrictions. A number of available catalogs of GWAS scans for association with gene expression are publicly available and are highlighted in the Supplement.
We recently published a survey of the characteristics of top GWAS results and created an open access database of more than 56,000 SNP associations based on available results from 118 GWAS representing scans for more than 400 phenotypes13. The survey indicated that there may be significant insights to be made by open sharing of such genomic results, particularly by allowing them to be annotated in a standardized fashion to allow for additional analyses, but also showed many investigators have chosen not to share extensive results. At the same time a recent effort revealed that it may be possible to identify the presence of individual participants in a cohort given availability of GWAS results, which has prompted caution and even retraction of the release of such results, particularly where the results included information on population allele frequencies14.
This raises important issues not always at the forefront in bioinformatics practice, namely ethical, social and legal obligations to protect participants who have contributed data. SNPedia is an online tool that gives people information on risk based on their individual genotypes and an algorithm run over information from the literature. However, such initiatives are likely premature and possibly misleading given our current level of understanding and the modest known risk contribution of most SNPs present on genotyping arrays15. Given the continued growth in large scale genetic association studies, the pending release of 1000 Genomes Project data and the expected imminent wave of cheaper personal sequencing, researchers will have to struggle with ethical questions when and how to inform participants of genetic results, as well as ways to protect personal information while enabling the appropriate storage and access of large amounts of data for research purposes.
The bulk of SNP-related software relates to genetic study design, collection and management of genetic information, and the statistical conduct, analysis and interpretation of genetic studies. It is beyond the scope of this review to address the complement of software available in this area. An excellent online list is regularly updated, currently containing links and information for more than 480 programs (http://www.nslij-genetics.org/soft/). Tools to conduct statistical genetic analyses and to analyze LD patterns among markers and predict haplotypes are among the most frequently developed software areas. Here I summarize popular and useful software and recent developments with a focus on SNP association software rather than linkage software. Many statistical geneticists and bioinformaticians also implement their own code for analyses, often relying on packages in the R programming language (e.g., haplo.stats_R for haplotype association analysis). An extended version of this section highlighting additional software is found in the Supplement.
A first, and sometimes last, consideration in genetic analyses is a power calculation. While such calculations are implemented in some genetic analysis software, an excellent stand alone site exists for this purpose: http://pngu.mgh.harvard.edu/~purcell/gpc/16. When samples and markers are determined, if pre-built genotyping strategies are not applied the next step is often assay design and validation. Careful use of software to assist in assay design as well as lab information management systems (LIMS) can help reduce errors and cut genotyping costs. Many programs exist to aid in assay design for various genotyping approaches, including some with specific components for SNP design: PrimerBatch3 which includes multiple SNP assay types17 and a popular general tool Primer3 18. Good genotyping assay design principles should be applied when possible including consideration of potential confounding effects from repetitive regions, SNPs that may hybridize to probe sequences, GC-rich regions, poly nucleic acid stretches, and potential tri-allelic variants. For labs handling high volumes of genotyping results a LIMS may be a desirable informatics capability.
For those undertaking GWAS analyses an early concern is the careful application of genotyping calling algorithms. These algorithms have progressed over years with original algorithms largely displaced by algorithms which demonstrate improved accuracy. The major algorithms and software are largely platform specific (e.g., Affymetrix versus Illumina) and in some cases array-specific. Birdsuite19 supports SNP, CNV and CNP calling for the Affy 6.0 array. Current genotyping algorithms applicable to Illumina arrays include Illuminus20 and GenoSNP21. Those conducting a DNA pooling approach to conserve samples and funds may apply pooling-specific calling algorithms including GenePool22. For groups interested in integrating and managing genotype calls across Affymetrix and Illumina platforms, IGG is specifically designed for this purpose23. Identifying overlapping SNPs and LD proxies across commercial arrays can also be done easily with SNAP2.
Once genotypes have been collected, cleaned and called, depending on the scope of the project (e.g., GWAS, candidate gene or replication) a number of additional steps may be taken. Calculation of straightforward population genetic measures such Hardy-Weinberg equilibrium (HWE) may be informative. Such statistics are included in many programs, or separate routines like SNP-HWE24 are also available. With genotypes fixed another step can be to examine and potentially adjust for population structure and stratification, which can be a source of confounding in association analyses. Implementations are available for parametric approaches (STRUCTURE25) and non-parametric approaches (EIGENSTRAT26) which have gained favor in recent years. The PLINK whole genome association toolkit also includes a module for correction based on identical by state calculations for whole genome genotyping data27. Another approach often applied in whole genome level analysis is the use of genomic control calculations for adjustment28.
The use of inference based on measured SNP genotypes to estimate untyped SNPs, or allele dosages, has been an active area of development and application in recent years. While also applicable to local and regional contexts, imputation is generally applied on a genome-wide scale. The methods for imputation generally take similar approaches, relying on LD relationships between SNPs in the HapMap, and are relatively computer intensive. Popular imputation programs include MACH29, IMPUTE30, PLINK27, BEAGLE31, BimBam32 and TUNA33. Application of these programs to most genome-wide genotyping datasets currently results in estimates for more than 2 million SNPs, increasing genomic coverage and allowing groups with distinct starting genotyping platforms to compare results or conduct meta-analysis. A review of imputation-driven meta-analysis gives a more detailed overview of important considerations34. Two recent empirical comparisons of imputation software have favored the use of MACH, IMPUTE and BEAGLE35,36.
When a final genotyped or imputed set of SNPs is ready, the selection of appropriate tools for statistical association is a critical step. The selection of software and routines is influenced by many factors including the nature of the phenotype(s) studied, the availability and selection of covariates, the extent of missing data, family structure and pedigree availability, cohort or case-control design, population stratification, the level of expertise of researchers involved, and the extent to which information can be harmonized if multiple populations or studies are combined. The implementation and sharing of association test routines in the R programming language is popular, with many available via Bioconductor (http://www.bioconductor.org/). Many specialized genetic analysis tools exist; I highlight only a few. The PLINK toolkit is arguably the most comprehensive and well documented freely available system for conducting large scale genetic analysis, including options for population-based tests under different models, family-based testing, haplotype tests, conditional tests, imputation, stratification, and annotation34. Additional comprehensive linkage and association software packages include Mendel37, MERLIN38, and Genomizer39,GHOST40 (family-based) and GenAbel (genotype based) and ProbAbel (imputed based) for GWAS analysis. Family-based association tests are implemented as stand-alone software or as part of larger packages, including FBAT41 and QTDT42. Two association software packages aimed at being relatively user-friendly with Windows GUI implementations are PowerMarker43 and FamHap44. Combining evidence for association across multiple studies can provide evidence for replication of genetic effects. Considerations of power and design in the studies, nature and harmonization of the phenotype measurements and statistical tests, and matching of the genetic alleles modeled and direction of effect are all important meta-anaylsis considerations34. METAL (unpublished) is widely used to conduct genetic meta-analysis including on the genome-wide scale.
Particularly in the conduct of GWAS where the scope of results handled is large, there is often more informatics to do after the primary analysis or meta-analysis is complete. One of the critical questions that arises when a significant association signal is detected in a GWAS or other study is what are the responsible, functional gene(s) and variant(s)? The peak marker associated is likely not the functional explanation, and may even be located in or near a gene that does not have a role in the phenotype studied. A likely scenario is that the associated markers are in LD with one or more other markers, known or yet unknown, that are the functional explanation for the association signal. An immediate task is plotting results (e.g., WGAViewer45, GWAS GUI46, AssociationViewer47), particularly regional LD and association plots that can be generated with SNAP via a web interface2 or the popular tool Haploview48. Consideration of such plots can be helpful in evaluating the approximate genomic boundaries likely to contain functional variants. Identifying strongly associated variants and those in LD informs further efforts like re-sequencing, molecular experiments on candidate genes in the region, and the prediction and validation of potential functional variants. The prediction of “functional SNPs” is an active and evolving area of SNP bioinformatics. Readers are encouraged to read the Supplement which detail bioinformatics tools and servers aimed at predicting functional protein and regulatory polymorphisms, respectively, along with important considerations for their use and interpretation. Functional prediction tools are described in detail in Supplemental Tables 1–3. Practical bioinformatics examples are also discussed in relation to APOE variants in the Supplement along with additional areas of SNP bioinformatics including tools relevant to sequence data, pathway mining and literature searching.
Bioinformatics has been an integral part of genetics and genomics since relatively early studies on the effects of protein coding SNPs, and the challenge of assembling and annotating early genome sequences. The growth in size and scope of databases has been met with a growth in bioinformatics resources, and as a result new opportunities for data analysis and integration have followed. The future impact of bioinformatics on SNP-related research is likely to continue to be great as decreased sequencing costs, technological advances and large bio-bank projects lead to further insights and opportunities but also present difficult data management challenges.
Funding sources Dr. Johnson is supported by an NIH IRTA position within the NHLBI (National Heart, Lung and Blood Institute).