We have implemented in a public web server a method that allows the ranking of genes in a region of the human genome according to their possible relation to a disease. Both the region and the disease can be defined by the user. Since the method is computationally very intensive, mostly due to the amount of genes used for the sequence similarity analysis, we introduced limitations in the maximum size of the genomic region to scan and in the number of candidates to report. Still, the analysis might take a few minutes depending on the load of the server.
We have updated the method and its benchmark with respect to the original version (G2D, [
2]) using newer database versions, observing an improvement in performance almost certainly due both to the increased accuracy of the human genome sequence, and to the continuous functional annotation efforts on human sequences and their homologues in other organisms.
In the current version, a test with 100 diseases of known genetic cause indicated that G2D finds the responsible gene in 87 cases out of a pool of 300 genes (on average), the target gene being among the 8 best scoring genes in the 47 of the successful 87 cases.
It must be noted that the identification of candidate genes by G2D relies partly on the sequence similarity comparison of (query) proteins to parts of the genome, and that it is advisable to examine the extent and position of this similarity. For example, the region of similarity could be restricted to a fragment of the query protein not being responsible for the functionality that might be associated to the disease. Moreover, the method could be pointing to a pseudogene. To support the human examination of the results, we indicate the positions of the sequence similarity match both in the query protein and in the genome. We also took advantage of existing information on pseudogene prediction, and we added links to the UCSC genome browser. This allows putting the candidate genes in the context of the latest genetic knowledge, which has been shown to be of great help when identifying genes involved in disease [
5].
Although G2D was originally devised for the analysis of single genetic regions, we encourage researchers working on complex diseases to apply the system independently for each of the genomic regions associated with a complex disease (as illustrated in the Results section), provided that the quality of the linkage analysis is good. Actually, it has been shown that when multiple associations of genetic variation to a disease are demonstrated there is a great likelihood that each of them separately constitutes a risk factor contributing to the disease [
29]. Accordingly, it is gaining wider acceptance that the classification of a disease as monogenic might be more the result of our lack of knowledge of all the genes involved in that disease than to reality [
1,
30]. The conceptual separation between monogenic and complex diseases might be illusory.
As far as we know, we have created a unique resource. Other efforts applying data-mining to the study of genes associated with diseases have a different focus. Mainly, none of them uses sequence similarity searches to assess candidate genes so that, in principle, if a disease-related gene lacks functional or protein domain annotation, or if it is not even predicted to be a gene, it will not be detected by such methods.
For example, Freudenberg and Propping [
31] also use phenotype information associated with diseases and the GO annotations of genes, but they cluster diseases with similar phenotypes and pool the GO terms of the genes associated. This means that their method cannot distinguish whether a gene will be related to one particular disease but to a pool of diseases. Although the method was tested with a leave-one-out cross-validation in a set of 878 diseases from OMIM with already known associated gene, their results are not as good as ours with 1/3 of cases having the disease-related gene among 160 candidates and the remaining 2/3 being among 1600. However, positional information is not taken into account. The system is not accessible through a web server.
Turner et al. [
32] developed POCUS, a method for the prediction of genes related to diseases linked to more than one genomic region. This is exclusively based on the common GO terms and protein domains found among the genes from multiple loci. The performance of the method is shown in a set of 29 diseases and the genes associated with each of them (between three and eleven), by using variable sizes of artificial loci generated around the target genes. Even when using the smallest size (with an average of 20 genes in each loci), the rate of identification of disease genes is of 60 out of 163 disease-genes, with 56 false positives. For the other two larger sizes assayed they find a candidate for only 5 and 4 of the 29 test diseases (using an average of 94, and 187 genes, per loci, respectively) with an increase in the number of false positives. This poor performance in comparison to the methods above is surely due to the fact that they do not use phenotypic information. This, together with the requirement for multiple loci, makes this method complementary to the one above and hard to compare. The method is not available as a web server though some computer programs are given as supplementary material [
32].
Finally, we mention the work from van Driel et al. (GeneSeeker, [
5]) more focused on the linkage of information related to genes and disease from multiple public databases. GeneSeeker relies on positional information of genetic linkage (to one region), and includes genetic expression information that is extracted for the genes in the region from their entries in sequence databases and MEDLINE references linked therein (but not from ESTs). The method is tested in only ten human malformation syndromes for which the associated gene is known, using an ad-hoc list of organ terms for each one. Obviously, the method would not be able to find a disease-related gene lacking a link indicating expression in an organ. The average of genes in the loci examined was 165. The results vary greatly depending on how the list of terms is applied. In the least restrictive test the gene is found in all the ten cases but among an average of 22 candidate genes. Again, this method is not using any functional information about the genes analyzed or of the disease phenotype, so it is not surprising that its performance is inferior to G2D. Contrary to the methods previously discussed, this method is accessible through a web server.