|Home | About | Journals | Submit | Contact Us | Français|
SPOT (http://spot.cgsmd.isi.edu), the SNP prioritization online tool, is a web site for integrating biological databases into the prioritization of single nucleotide polymorphisms (SNPs) for further study after a genome-wide association study (GWAS). Typically, the next step after a GWAS is to genotype the top signals in an independent replication sample. Investigators will often incorporate information from biological databases so that biologically relevant SNPs, such as those in genes related to the phenotype or with potentially non-neutral effects on gene expression such as a splice sites, are given higher priority. We recently introduced the genomic information network (GIN) method for systematically implementing this kind of strategy. The SPOT web site allows users to upload a list of SNPs and GWAS P-values and returns a prioritized list of SNPs using the GIN method. Users can specify candidate genes or genomic regions with custom levels of prioritization. The results can be downloaded or viewed in the browser where users can interactively explore the details of each SNP, including graphical representations of the GIN method. For investigators interested in incorporating biological databases into a post-GWAS SNP selection strategy, the SPOT web tool is an easily implemented and flexible solution.
Due to corrections for multiple testing and limited sample sizes, genome-wide association studies (GWAS) often lack the statistical power to discover statistically significant associations between phenotype and genotype (1). Therefore when a single nucleotide polymorphism (SNP) shows relatively strong evidence of genetic association, that is, is among the top signals from the study, the next step is to genotype the SNP in additional independent samples in order to prove the association is not simply due to chance. The strategy for selecting SNPs for additional genotyping could be simple, such as ranking the SNPs by their P-values from statistical tests for association, or somewhat complex if certain biological considerations are taken into account (2). Once a set of SNPs has been confirmed to be associated with the phenotype, the next logical steps include functional experiments that attempt to isolate the precise molecular genetic mechanism, such as the effect of the genetic variant on transcription, which may act by a direct modification to the amino acid sequence or structure of the protein product, or by an effect on a regulatory mechanism. Functional experiments can be very costly, raising the question of how to prioritize SNPs to maximize returns.
Even when there is only a single confirmed SNP, linkage disequilibrium (LD) with other SNPs can make it very difficult to isolate the true causal polymorphism. When a single SNP is confirmed to be associated with a phenotype and is in strong LD with several other SNPs, the genotypes at these so-called ‘LD proxies’ will be very similar. While this property is often exploited by manufacturers of commercial SNP microarrays (3) to reduce the number of SNPs required for genotyping, in this case it proves to be a serious problem because all the LD proxies show the same evidence for association and no amount of further genotyping will resolve this ambiguity. Instead, expensive and time-consuming functional experiments, possibly involving model organisms, must be conducted in an effort to identify the true causal variant.
In these scenarios, the known biological properties of the variants can prove to be useful in formulating a prioritization strategy. When selecting SNPs for further study after a GWAS, such as genotyping in a replication sample, investigators may choose to prioritize solely based on the statistical evidence for genotype–phenotype correlation; in other words use a single P-value threshold. If there are sufficient resources to select all SNPs above a maximum desired P-value, which could be determined by the combination of a minimum desired effect size and specification of statistical power under some reasonable transmission models, then prioritization by P-value alone is logical. When resources are limited, certain genes and genomic regions may be given higher priority, such as selecting all SNPs with P < 10−4 and SNPs in certain genes with P < 10−2. Because these ranges of P-values may involve many false positives, care should be used in determining the thresholds, and a careful evaluation of statistical power should be used to determine the goals of the study, such as specifying a desired minimum effect size. When the evidence for association boils down to a single SNP and all its LD proxies, the P-values are all nearly identical and so biological prioritization becomes particularly attractive. This kind of prioritization strategy allows investigators to maximize the return on resources while implementing their specific biological priorities.
We recently introduced the genomic information network (GIN) model for systematically incorporating biological databases into the prioritization of SNPs after a GWAS (4). SPOT implements the GIN prioritization method using a secure, interactive web-based approach. The user uploads a list of SNPs and may optionally include P-values from statistical tests of genotype–phenotype correlation. The user may also upload a list of genes, genomic regions or specific SNPs with custom prioritization scores that determine how SPOT uses the GIN method to prioritize the SNPs. The results may be viewed directly in the web browser or downloaded in various formats. The GIN prioritization method is not designed to predict causal variants, and the prioritization results are not intended to be used to interpret the statistical significance of the GWAS results. Rather, it is intended to assist users in incorporating a broad range of biological hypotheses into the prioritization process using a transparent method of specifying their specific biological priorities and to provide results that are easily interpreted in terms of what went into the model and exactly how that information was used to prioritize the SNPs.
Given a list of SNPs and P-values from a statistical test for genetic association, the goal of the GIN prioritization method (4) is to combine biological information with evidence for genetic association to prioritize SNPs for further study so that SNPs with biologically relevant properties receive higher priority. This is done by first specifying a non-negative prioritization score for each SNP. If S is the GIN prioritization score, the weighted P-value Pw is defined by Pw = P/10S (4–5). SNPs are ranked for further study by Pw where smaller values of Pw have higher priority.
A GIN prioritization score represents an order of magnitude change in P-value from a test for association. For example, a SNP with an overall score of 1 and a P-value of 0.01 has the same priority in the GIN model as a SNP with a score of 0 and P-value of 0.001. This allows the user to specify their priority for testing certain biological hypotheses by weighing that priority against evidence for genotype–phenotype correlation. Typically there are few P-values from a GWAS <10−8, so a score of ≥8 essentially guarantees a SNP will have extremely high priority after GIN prioritization. The effect of a score on rank, however, depends on the overall distribution of scores the user has specified for their set of SNPs. For example, there is no effect on rank if all SNPs receive a score of 8; the rankings are the same as by P-value alone [for more information see the discussion of normalized weights in (4)]. When deciding on a score, it is helpful to use a frame of reference. By default, we prioritize SNPs in genes one order of magnitude higher than those not in or near genes, and missense SNPs one order of magnitude higher than those in introns. In Saccone et al. (4), we conducted a sensitivity analysis to demonstrate that the GIN prioritization results are not sensitive to small changes in prioritization scores.
Figure 1 shows a screenshot from SPOT displaying the GIN diagram for the SNP rs3762611, which is in the example data provided on SPOT’s main page. The diagram, created dynamically by SPOT using GraphViz (http://www.graphviz.org), shows how the overall GIN prioritization score was determined for rs3762611. When determining the score, SPOT takes into account all possible LD proxies—SNPs with r2 above a certain threshold in a specific HapMap (6) sample. We used genotype data from HapMap Public Release 27 (http://hapmap.ncbi.nlm.nih.gov) and the program Haploview (7) to estimate the r2 LD coefficients in each of the 11 populations from this release. Figure 1 shows the GIN calculations for the LD proxy rs16859826, which was determined using the HapMap European American sample. The overall prioritization score is computed by traversing the network diagram from left to right and adding up scores from the different nodes as described in (4). SNPs with higher biological relevance, such as those in genes, particularly user-specified priority genes, and in conserved regions, receive higher scores.
The process is repeated for all LD proxies. The highest scoring LD proxy is used to determine the overall score of the original SNP, rs3762611, provided the score of the proxy is greater than the score of the original. In this case, the LD proxy rs16859826 was used to determine the overall score of rs3762611. This computation takes into account the strength of the LD proxy, so that proxies with smaller values of r2 have smaller scores and are less likely to be used. Also, a fixed penalty is applied to the score of the proxy in order to ensure LD proxies are only used when they have better scores than the original, even when r2 = 1. The default LD penalty of 0.9 is moderate so that LD proxies will be given significant attention when prioritizing SNPs. This parameter can be configured on SPOT’s main page.
Following the convention introduced in (4), the ‘Gene’ node has a score of 1 for all genes and its contribution to the overall score takes into account SNP/gene functional properties using a mechanism called the ‘link index’. In general, the link index is used to describe the strength or the manner in which a connection is made to a node. Missense mutations, for example, have a default link index of 2 and so the contribution of the ‘Gene’ node becomes 2*1 = 2. The overall score is the sum of the contributions S =∑ LiSi over the link indices Li and scores Si of the nodes. The link index of the gene node is completely determined by the user for SNP/gene transcript functional properties. These include nonsense, frameshift, missense and 5′and 3′-UTR designations. The SNP/gene transcript functional properties we used are from dbSNP build 130 (8). The 5′- and 3′-UTR specifications imply only that SNPs are located in these ends of the transcribed region; there is currently no additional information on putative promoters, transcription factor binding sites or other regulatory data.
We have incorporated information from the PolyPhen method of predicting the effect of an amino acid substitution on the properties of the protein product (9–10). PolyPhen predictions for SNPs in dbSNP build 126 were downloaded from the PolyPhen web site (http://genetics.bwh.harvard.edu/pph/data; the PolyPhen 2 tool is now available at http://genetics.bwh.harvard.edu/pph2 but at the time of writing did not yet appear to provide a comprehensive set of predictions for download). When PolyPhen predictions are enabled by the user on the main page, the SNP/gene functional property will be replaced by the PolyPhen prediction when a prediction for that SNP exists. The prediction can be ‘benign’, ‘possibly damaging’ or ‘probably damaging’, and users may specify a different prioritization score for each prediction. By default, these are 1, 2 and 3, respectively, which correspond to the default values for intron, missense and frameshift, so that the PolyPhen prediction may increase or decrease the prioritization score compared to the default value of 2 for missense SNPs.
The user may prioritize specific genes by providing gene symbols and prioritization scores as input to SPOT. This information is represented by the ‘User Gene’ node. Similarly, the user may prioritize genomic regions and single SNPs in the ‘User Region’ and ‘User Special SNP’ nodes, respectively. When a gene is specified two different ways, or when two user-specified regions overlap, SPOT will combine the prioritization scores according to the user-specified option ‘Multiple Query Method’, which may be maximum (the default), minimum, sum or average. When a SNP is in an evolutionary conserved region (ECR) with fractional conservation P (0 ≤ P ≤ 1), the corresponding ECR node contributes F*P to the overall score where F is a factor currently set to 0.75. We used ECRBase for data on human/mouse ECRs (11).
The final results of the GIN prioritization process may be viewed in an interactive table within the web browser (see the SPOT User’s Guide, https://spot.cgsmd.isi.edu/doc/user_guide.pdf, for screenshots and additional details). The table shows the original P-values and their ranks in the column ‘Rank: p-value’, and the SPOT rankings from the GIN prioritization method in ‘Rank: SPOT’ which are determined by the ‘GIN weighted p-value’ column. The ‘Rank: SPOT’ column is the most important item in the table as it reflects the priority of the SNP from the GIN prioritization method. A graphical column selection tool shown can be used to configure the output tables. The user may select which columns are displayed and their order. This allows the user to view additional information such as detailed gene/mapping properties, HapMap allele frequencies and commercial SNP microarray LD tagging properties. Columns labeled with an asterisk refer to the LD proxy when one is being used, and information about the original or ‘source’ SNP may be viewed using the column selection tool.
The main page consists of a web form for the primary user input. Users must first either upload a list of numeric SNP identification numbers as a file or enter the list directly into a web form. The user may optionally provide P-values from statistical tests for genotype–phenotype correlation. In the section labeled ‘Prioritization of specific genes and other genomic regions’, the user can specify a list of genes and prioritization scores to be used in the ‘User Gene’ node of the GIN model.
A number of methods can be used to specify genes and genomic regions. For example, the query ‘ENTREZ_GENE_QUERY = Dopamine and Receptor,1.0’retrieves all genes from the Entrez Gene database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene) that match the terms ‘dopamine’ and ‘receptor’ and assigns them all a prioritization score of 1.0. SNPs from an entire region can be prioritized with a query of the form ‘REGION = Chr15:76600000.76700000, 1.25′, which would add a score of 1.25 to all SNPs in this 100 Kb region on chromosome 15 via the ‘User Region’ node of the GIN model. More information on the different kinds of queries is provided in the ‘User’s Guide’ (https://spot.cgsmd.isi.edu/doc/user_guide.pdf).
MySQL (http://www.mysql.com) is used via the Perl DBI module (http://dbi.perl.org) to process and store user input and implement the GIN prioritization method. During this execution, biological information is integrated from a local MySQL relational database. All results are stored as tables in a MySQL relational database which is then accessed by the web interface.
In the following runs of SPOT, the default LD settings were used so that LD proxies from the HapMap CEU (European–American) sample are used with a threshold of r2 ≥ 0.8. The default setting of a maximum of 1000 SNPs limits the number of SNPs sent to the output tables, but does not limit SNPs used as input, and the P-value threshold is set to 0.05 so that only SNPs with P ≤ 0.05 are used in the GIN model. The example data provided on the main page (10 SNPs and 4 queries retrieving 10 genes) took 3 s to run, and a more typical run of 1 million SNPs with randomly selected, and therefore uniformly distributed, P-values and 1000 prioritized genes took 57 s (in this case, about 50 000 SNPs were used for prioritization given the P ≤ 0.05 filter). A more ambitious run of 1 million SNPs and 10 000 prioritized genes took 3 min and 13 s.
The results of a GWAS are sensitive information and SPOT takes several steps to ensure this information is protected during a session and is destroyed when the session is complete. SPOT uses a DigiCert encryption certificate (http://www.digicert.com) so that all communication between the user and the server are secure. Any information the user uploads is destroyed after 3 h or immediately at the user’s request. Since the results depend only on the relative order of the SNPs, the user may scale the P-values by a fixed factor prior to uploading them so that the true values are never transmitted. Nevertheless, if the user does not wish to upload P-values they may select the option ‘SNPs only, no p-values’ on the main page. In this case, SPOT provides an Excel file with the prioritization scores into which the P-values can be pasted. SPOT includes a calculated column in the Excel file that computes the weighted P-values (Pw = P/10S where S is the overall prioritization score) which are used to determine the prioritized rankings from the GIN method. This process takes place on the user’s computer so that the results are obtained without ever transferring P-values to the SPOT server.
The GIN prioritization method is designed to be flexible and allow a variety of biological databases to be incorporated while remaining viable and interpretable. Future iterations of SPOT will include additional biological information such as expression quantitative trait loci (12), transcription factor binding sites, micro RNA target sites and other GWAS results [see (13) and (14) for some proposed methods on implementing these kinds of data]. A useful feature would be to add predefined gene sets to SPOT’s query tool. For example, the NeuroSNP database (3) (http://nidagenetics.org/neurosnp/index.htm) consists of genes related to addiction-related diseases. Future iterations will allow the user to specify such a database, either with predefined prioritization scores or a user-defined score for the entire set. This feature would be added for convenience only as users can currently accomplish this by directly entering these gene lists into SPOT. Similar to the idea behind disease-related gene databases is the Human Variome Project (15) that aims to develop novel methods of cataloging human genetic variation and its relation to disease, and like gene databases the results of this project could be incorporated directly into SPOT. As next generation sequencing becomes the standard in following up on GWAS, the discovery and analysis of numerous and possibly very important rare variants (16) may require biological prioritization to design follow-up studies, and we will be studying ways of modifying SPOT to deal with this challenging problem.
SPOT is designed to provide investigators with a tool for systematically incorporating their specific biological hypotheses into the post-GWAS prioritization process. The testing of biological hypotheses is a reasonable study design, even if there is no clear evidence of a predictive mechanism. While a GWAS is often touted as being ‘hypothesis free’, this is not exactly the case. Typically a GWAS tests only a particular subset of variants, and designs differ from array to array with some arrays taking into account biological information (3,17). Prior to the availability of GWAS technology, researchers tested candidate genes—a logical and reasonable experiment given the available resources whose success or failure moves the field closer to the goal of testing all known variants with acceptable statistical power. Other examples of biological study designs are GWAS using only non-synonymous SNPs (18,19) and exome sequencing (20). One of the advantages of studying variants with clear biological, although possibly not phenotypically causal, consequences, such as missense SNPs as compared to SNPs not in or near genes, is the potential for conducting functional studies such as knockouts in animal models (21). SPOT mimics these study designs by allowing researchers to specify the particular biological hypotheses they wish to test so that SNPs related to these hypotheses receive additional priority when resources are limited. As has been submitted elsewhere in the literature (1,5,22), this strategy is reasonable with the stipulation that biological priorities should be established a priori in order to avoid post hoc arguments for biological plausibility, because given a gene it is not difficult to use Internet databases to mine connections between that gene and a phenotype. As we described in (4), the GIN prioritization method is well suited for establishing a specific, quantitative a priori plan for implementing biological hypotheses into the post-GWAS prioritization process, and this method is now implemented in SPOT.
SPOT is a useful tool for GWAS investigators because it allows them to test these reasonable biological hypotheses by defining specific biological priorities as a way of pursuing the ‘high hanging fruit’ in a GWAS. A GWAS often has limited statistical power and must rely on replication genotyping to establish the statistical significance of any remaining non-significant associations that appear promising. Even when GWAS experiments successfully replicate SNP associations, the totality of these variants often explain only a fraction of the genetic variance (23). We then look to the ‘high hanging fruit’ for answers—variants with smaller effect sizes that require greater power to be confirmed as a statistically significant association (23). These may require substantial further investigation such as large meta-analyses conducted by consortia with sample sizes often in the hundreds of thousands (24–26). It has also been argued that even with missing variance, one of the benefits of GWAS is, in addition to predicting individual risk, their ability to expose biological pathways that underlie human disease (2,27). Given the extremely high resource-consuming nature of these follow-up studies, a clear and precise plan for the prioritization of variants is critical. Clearly those variants with the strongest statistical evidence of association will be pursued first, but as that evidence dwindles, some signals with no evidence of biological relevance may be traded for those meeting the biological priorities of the study when the difference in evidence for genotype–phenotype correlation is modest. As shown by Saccone et al. (4), this is what occurs when the GIN prioritization method is implemented after a GWAS: the difference between the set of SNPs selected by the GIN prioritization and straight P-value methods is roughly only one order of magnitude in association with P-value. The difference applies mainly to SNPs with moderate evidence for association—the potential ‘high hanging fruit’.
An example of biological prioritization that could be interpreted as a ‘success’ is the following discovery of a novel genetic association with nicotine dependence. In Saccone et al. (28) the investigators conducted a large candidate gene study of nicotine dependence. Although there were no statistically significant results after correcting for multiple testing, the fifth smallest P-value was the missense SNP rs16969968 in the nicotinic receptor gene CHRNA5 and was highlighted as the most promising signal. A GWAS of nicotine dependence (29) was conducted in conjunction with the candidate gene study, and although the missense SNP ranked 199 in the combined GWAS/candidate gene results, it was the top priority for further study in the overall project. It has since been replicated in numerous studies of nicotine dependence, heavy smoking, lung cancer and COPD (30–36), including three very large meta-analytic studies (24–26). Furthermore, there have been very few replicated associations other than the original SNP and some SNPs in nearby genes. Clearly this example alone is not evidence of the general predictive power of biological information, in this case a missense SNP in a gene whose protein product binds to the drug of interest, to predict true genetic association. Nevertheless, the results from a study by Saccone et al. (3) show that commercial SNP microarrays may miss a significant amount of coverage in some genes. The fact that an overall GWAS could be negative due to the omission of a single SNP that could be discovered by another study targeting a small number of highly biologically relevant SNPs [rs16969968 was in the top 10 over all of dbSNP when we used the GIN method to prioritize for nicotine dependence (4)] is something for GWAS researchers to consider, both for the post-GWAS selection of SNPs for further study and for the pre-GWAS supplementation of commercial arrays (3).
SPOT is not intended to be used to predict true causal variants or to statistically interpret the results of GWAS. Furthermore, the problem of establishing such predictive properties for SPOT, such as through assessments of false-positive rates and receiver operating characteristics, is ill posed due to the fact that one of the core features of SPOT is that it allows the investigator to prioritize specific genes and genomic regions. Since these parameters depend on the particular biological priorities of the investigator, the general predictive properties of SPOT cannot be established. Another issue is the extreme diversity of phenotype and disease. For example, few genetic associations for psychiatric disease have been validated by replication in independent samples (37). Therefore, in order to conclude that any evidence of correlation between the biological information used by SPOT and existing confirmed genotype–phenotype associations would transfer to the prediction of true genetic associations for psychiatric disease, one would have to make the unlikely assumption that the underlying genetic structure of psychiatric disease shares substantial common elements with other types of human disease in general.
A third problem in assessing the predictive properties of SPOT is the extraordinary challenge of assembling a sufficiently sized collection of validated ‘true’ causal variants for common complex disease on which an assessment of prediction must be based. Even after ‘true’ genotype–phenotype correlations have been validated using the rigorous standards initiated by the onset of GWAS, including statistical significance after correcting for genome-wide multiple testing (1), proper adjustment for population stratification verified by an acceptable genomic inflation factor (18) and replication in independent studies (1), a critical issue that remains is distinguishing the actual causal variants from a potentially large number of LD proxies (1,38). Association statistics are virtually indistinguishable for SNPs in strong LD so that the actual causal variants must be identified by other means such as functional studies. The problem is that the result of a prediction algorithm could be ‘positive’ for an associated non-causal SNP and ‘negative’ for a causal LD proxy. This could artificially inflate the estimate of prediction. The confirmation of pathogenesis may be more straightforward for highly penetrant mutations causing rare Mendelian disorders. Tools that predict the effect of amino acid substitutions, such as PolyPhen (10) and SIFT (9,39), have been shown to have predictive power for Mendelian disorders. PolyPhen predictions have been incorporated into SPOT and their influence on priority can be configured by the user. GWAS, however, are more directly aimed at common complex disease (40,41), and it is an enormous challenge to assemble a collection of causal variants for common complex disease that meets these rigorous criteria for validation, in particular the disambiguation of LD proxies, that is of sufficient size to assess predictive power. While projects such as GEN2PHEN (http://www.gen2phen.org) and the Human Variome Project (15) aim to solve this problem [see (42) for a general review], it would appear that currently the resource closest to meeting these validation criteria is the database maintained by the National Human Genome Research Institute containing published GWAS associations with P < 10−5 from the statistical test for association (http://www.genome.gov/gwastudies/) (38), although this information does not resolve the LD disambiguation issue.
Since the SPOT prioritization results are not intended to predict causal variants, we recommend that the statistical significance of GWAS results be evaluated based on genotype–phenotype correlation data alone and be corrected for genome-wide multiple testing in accordance with current standards (1). In particular, this practice will help to guard against bias from reports of positive association in the literature, as well a bias from the biological priorities of the investigators which may lead to a misinterpretation of the results of the GWAS.
At the time of writing, there are very few publicly accessible web-based tools that perform the same function as SPOT, namely taking GWAS results as input and as output providing a table with rankings that take into account user-defined measures of biological relevance. A number of web tools dealing with SNP biological properties are shown in Table 1. GenePipe (43) is the closet to SPOT in purpose and functionality. It takes association results as input, as well as user-defined weights for various forms of genomic annotation, and provides an annotated table as output. Currently, GenePipe incorporates more databases while SPOT offers greater transparency in conveying the prioritization method by providing side by side GIN versus P-value rankings, graphical representations of the GIN calculations and tables showing the details of the prioritization process step by step, all presented interactively in the web browser.
In addition to GenePipe, there are a number of other web tools such as F-SNP (44), FastSNP (45), Panther (46), PolyPhen2 (9–10), SIFT (39) and SNPs3D (47) which assess the biological relevance of a SNP independently of genotype–phenotype correlation results (Table 1). Half of the eight tools shown in Table 1 deal exclusively with non-synonymous SNPs. Of the remaining four, only SPOT and FastSNP attempt to combine evidence from multiple sources of information and return a single measure of biological relevance suitable for systematically prioritizing GWAS results. FastSNP, while providing a great deal of information in very informative diagrammatic format, does not account for LD proxies. The evidence that the tools in Table 1 can be used to predict causal variants that influence general common complex disease, subject to the aforementioned validation criteria for causal variants, is limited. PolyPhen (10) and SIFT (39), which deal exclusively with non-synonymous SNPs, have found evidence of prediction using information on disease-related variants from the UniProt database (60) for Mendelian disorders flagged by terms such as ‘lethal’ and ‘complete loss of function’ (PolyPhen is incorporated into SPOT). FastSNP and GenePipe, which can be used for arbitrary SNPs, each tested the predictive properties based on a single disease study. A feature of some of these other tools that may be significantly appealing to investigators when compared to SPOT is the fact that they offer substantially more biological information, as shown in Tables 1 and and2.2. However, with the exception of GenePipe, due to limited input features these tools appear to be more geared towards assessing the biological plausibility of a small number of SNPs and perhaps their LD proxies (although these would have to be obtained from a different source, such as the LD web tool SNAP (61), and submitted manually). This could be applied to a small number of ‘top hits’ from a GWAS, as opposed to incorporating biological information into a full-scale GWAS for the purpose of prioritizing a large number of replication experiments, which is the purpose of SPOT. We are currently exploring ways of integrating additional biological information, such as from the sources in Tables 1 and and2,2, in order to provide users with a more comprehensive palette for establishing biological hypotheses.
In Saccone et al. (4), we conducted a sensitivity analysis of how changes in the user’s prioritization scores effect the rankings, and we performed simulations to assess the difference between selecting SNPs after a GWAS using just P-value rankings and using the GIN method. Neither of these analyses was designed to use a ‘training set’ to establish predictive properties. We found the rankings are not very sensitive to changes in the scores and that in general with default scoring parameters there is about a one order of magnitude difference between SNPs selected by P-value and those selected by the GIN method.
The SNP/gene transcript properties used by our algorithm are currently limited to a sophisticated prediction provided by the PolyPhen (9–10) algorithm on the impact of an amino acid substitution, and those that can be observed directly from DNA and RNA sequence such as coding regions, untranslated regions, missense and nonsense amino acid substitutions and frameshifts. Given the amount of experimental human genomic data available, this is a relatively limited amount of information. However, when integrating biological information into a GWAS, even with this relatively limited set, there are several aspects to consider. First, a SNP may be associated with many genes, whether it be in one gene and near another, or in the intersection of multiple genes or perhaps in a gene with several known transcripts due to alternative splicing and having different functional consequences of the SNP on each transcript. SPOT considers all known SNP/gene transcript associations, and selects the one with the highest priority to ensure no biologically promising association signal is missed. Furthermore, when genes overlap, SPOT will take into account specific genes prioritized by the user. Finally, the task of checking for LD proxies while taking into account all of the previous elements is not only a complex algorithm to implement, but requires the processing of an enormous amount of information (our LD database alone contains data on 343 million pairs of SNPs) that must be implemented on a genome-wide scale when applied to GWAS data. While we are planning to incorporate additional biological information into future implementations of SPOT, researchers should find SPOT useful even with this relatively limited amount of information.
The problem of interpreting the results of a GWAS and planning follow-up experiments is formidable. Integrating information from biological databases can aid the decision making process in order to maximize resources. SPOT is an easily implemented and flexible tool that will aid researchers in applying a biological prioritization strategy when selecting SNPs for further study after a GWAS and is designed to remain an interpretable and viable solution as additional sources of biological information are integrated.
National Institutes of Health (DA024722 to S.F.S., MH068457 to J.A.T., AA008401 to J.P.R., NIDA Contract HHSN271200900012C to J.A.T.); American Cancer Society (IRG5801050 to S.F.S.). Funding for open access charge: National Institutes of Health Grant DA024722.
Conflict of interest statement. None declared.
We are very grateful to the following individuals for testing SPOT: William Howells, Chun-Nan Hsu, Peng Lin, Richard C. McEachin, Nanette Rochberg, Sharon Ryan, Nancy L. Saccone and Andrew Schrage. We would also like to thank Gary Stormo for valuable advice concerning certain aspects of the design. OGPET was developed by Rafael Torres Jr., Yash Dayal, Ming-Ying Leung, and Igor C. Almeida, Border Biomedical Research Center (BBRC), Departments of Biological and Mathematical Sciences, University of Texas at El Paso. Finally, we are very appreciative of the time and effort invested by the reviewers, whose comments and suggestions resulted in numerous improvements to the manuscript and enhancements to the web tool itself.