The system described above was utilized to generate a list of genes which were then used to select SNPs in a study of childhood-onset Systemic Lupus Erythematosus (SLE). SLE is a debilitating multi-system autoimmune disorder affecting ≈ 0.1% of the North American population. An initial search using a set of 31 keywords (consisting of biological functions and chromosomal regions) selected by expert knowledge returned 6798 genes with various contributions from the three databases used (Table ). It is important to note that the results obtained are temporally-sensitive; as databases are updated different sets of genes will be returned. In every case a single database did not retrieve all the genes found by other databases, demonstrating the need to query multiple databases. The substantial contribution made by each database in identifying the candidate genes demonstrates that each of the databases is required to maximize the number of candidate genes discovered, though there are likely results which are still not captured by the set of databases queried. As new databases come into prominence, Function2Gene can be extended to query them as well. The top 1204 genes (of which 836 were returned by GeneCards, 699 by Harvester, and 135 by NCBI) were used to select 9412 SNPs. The number of genes to select was dictated by the capacity of the chip (≈ 10,000 SNPs), and a decision to have approximately ten SNPs per gene on average. The choice of SNPs to genotype within the selected genes was based on available information from databases including the Human Haplotype Mapping Project (HapMap) with priority given to SNPs with high heterozygosity in two or more relevant ethnicities and to SNPs representing amino acid coding variants. The selected SNPs were then cross-checked against the accumulated SNP validation test results available at our industrial collaborator (ParAllele Biosciences), an active participant of the International HapMap project.
Results of gene selection process
Using the selected SNPs, 251 nuclear families consisting of both parents and the affected child (full trios) were genotyped. The analysis of the genotypes of the 251 trios using Transmission Disequilibrium Test (TDT) followed by False Discovery Rate (FDR) multi-test correction yielded 9 noteworthy genes, that are associated with SLE with FDR less than 0.5; two of these genes were highly significant, with FDR less than 0.05 [10
Using Bayesian methodologies, the impact of pre-existing knowledge of a disease on the discovery of genes associated with the disease can be increased, as the posterior probability of association with the disease can be modified in accordance with its prior probability as reported by function2gene. The False Positive Report Probability (FPRP) measure is one such method which uses the prior probability of association, which can be calculated from the results of the keyword-based gene selection, to modify the posterior probability of association. Using Bayes' theorem
, FPRP determines the probability of the null hypothesis (no association) being true given a test statistic greater than Zα
(that is to say p
≤ α), knowing power (1 - β), the prior probability of association (π), and the probability of the measured data given that the null hypothesis is true (p
One method of calculating prior probability based on the keyword based gene selection is to order the SNPs according to number of times they were returned by different keywords, taking into account the biological relevance of the SNP, and then apply a continuous function such that the higher ranked SNPs have a greater prior probability of association than the lower ranked SNPs, and the sum of the probability of association is the prior estimate of the total number of SNPs in the search believed to be associated with the disease. An alternative method is to order the SNPs in the same manner, and then place them in to different groups, assigning the same prior probabilities to each SNP in a group while controlling the sum of the prior probabilities assigned. For example, assuming 10,000 SNPs, 10 of which are believed to be associated, assign priors of π = 0.025 for the top 1%, 6.25 × 10-3
to the next top 4%, 1.25 × 10-3
to the next top 20%, and 3.33 × 10-4
to the remaining 75%. In this manner the multiple testing effect is controlled while maximizing the effect of the prior available information. Applying FPRP [11
] to the results of the TDT test with a prior assumption of 8 associated SNPs yielded 12 noteworthy genes, including all 9 obtained with the FDR corrections, and the same two significant genes [10
An existing web-based program which is functionally similar to the methodology presented here is the Disease Candidate Gene search of SNPs3d. Using the keywords chosen by SNPs3d for three diseases, diabetes, pancreatic cancer, and Alzheimer disease, we have compared the results obtained by SNPs3d and Function2Gene in Table . The majority of high ranking genes returned by SNPs3d are also returned by Function2Gene, but Function2Gene returns a far greater number of genes.
Comparison of Function2Gene and SNPs3d
Future advancements of the approach presented here could be made by the use of more powerful literature mining techniques which would reduce (or even eliminate) the need for expert information on the nature, pathology, and biology of the disease to generate a list of keywords and discard spurious results. Such approaches would also reduce the reliance of this approach on the contents of stewarded fields in the databases, enabling novel associations as well as incorrect associations to be discerned. For example, Named Entity Recognition (NER) and Relationship Extraction (RE) could be used in tandem to elucidate connections between diseases and genes directly. NER identifies biologically-relevant entities (like genes and proteins) from literature using techniques such as hidden Markov models and dictionaries. Once entities have been identified, RE can identify the relationship and/or connection between entities using the proximity of entities (and the re-occurrence of entities in close proximity), along with rule base systems and full predicate/subject grammars [12
]. It would then be possible to walk the relationship tree, using the probabilities between each node of the tree connecting specific genes and a disease (with intervening genes, proteins, and biological pathways in between), and then ordering the resultant genes by the probability of their connection which should be directly proportional to the prior probability of association.