A common feature of many genetically orientated RA studies is to find genes responsible for, or contributing to, one or several RA-related phenotypes. Typically, a genomic region might be known to be associated with a phenotype, but still there are usually many genes within such a region that might be possible candidates. Specifically, when employing QTL analysis in rats, selecting gene candidates has become a recurrent part of the data analysis. An important part of the search for candidate genes is checking the available bioinformatic resources; most often the written information describing gene function is very informative. The aim of this study was to facilitate this data mining by generating a web-based tool called Candidate Gene Capture (CGC), whose purpose is to identify potential candidate genes associated with experimentally induced arthritis phenotypes in rats.
In brief, the CGC application makes it possible to retrieve a large number of QTL regions previously described in the literature. For each rat QTL, the homologous genomic region in humans is automatically displayed. All genes included in the corresponding human genomic interval can be queried for up to 49 default keywords and up to 10 keywords selected by the user. Each keyword is given a value based on an algorithm that estimates how closely related a keyword is to the term 'arthritis' according to their simultaneous occurrence in PubMed abstracts. OMIM records for human genes in a selected genomic region are ranked by their total keyword values; that is, the sum of the values for all keywords that hit a record. The higher the total keyword sum is, the more likely it is to be a gene candidate. The application can be accessed from the RatMap home page [7
] or directly at http://ratmap.org/cgc
Comparison of manual evaluation with CGC ranking
To estimate the ability of the CGC application to rank candidate genes in a fashion similar to human evaluation, an independent manual inspection was made. Four randomly selected collagen-induced arthritis QTLs were used (Cia4, Cia10, Cia14 and Cia17 ). The OMIM records used in the CGC prediction were surveyed manually and rated on a scale from 1 to 5. Comparing the manual and CGC ratings, it was found that the two highest-ranked candidate genes in the CGC application for all QTLs studied were rated as high in the manual evaluation, with the exception of one gene, CD74 in Cia17 . However, CD74 turned out to be a very likely gene candidate when additional literature was surveyed (see below).
In an extended literature search for the two highest CGC-ranked genes of Cia4, Cia10, Cia14 and Cia17, it was confirmed that seven of eight genes were clearly associated with RA. Literature not covered by the OMIM reference lists revealed that three of these genes (IL5, CD74 and HMOX1 ) had a strong association with RA. Many different keywords fitted each of the OMIM records associated with these three genes. Although none of these keywords had a very high keyword value (ranging from 1.6 to 9.7), the resulting keyword sums (IL15, 27.3; CD74, 22.3; HMOX1, 13.5) still clearly diverged from the keyword sums of other genes within the same QTLs. Thus, the CGC application is able to predict candidate genes from OMIM records even though the association with RA is not explicitly mentioned in the text.
In addition to the two highest-ranked genes in the four QTLs evaluated, we also designated a middle group of candidate genes that were ranked in positions 3 to 6 by the CGC application (except for Cia4, in which the middle group comprised genes ranked in positions 3 and 4). The remaining genes for each investigated QTL formed a separate group (the low group). Comparing the mean values of the CGC ranking with the manual ratings for these three groups (the two highest, the middle group and the low group), a general agreement was found in the ranking of candidate genes (Table ). The only exception was the relatively low manually rated 'best two' group for Cia17, which is fully explained by the low manual rating of CD74 . As described above, on closer inspection the manual rating of CD74 turned out to be too cautious.
Comparison between manual evaluation and Candidate Gene Capture (CGC) rating
Finally, gene records without any keyword hits at all were not found to be associated with RA in the manual inspection.
Thus, when the CGC prediction is compared with manual inspection, the conclusion is that the application makes a reliable evaluation of the OMIM records for the four QTLs studied in detail. For three genes (IL5, CD74 and HMOX1 ) the CGC application estimated the gene records as being more interesting than the manual inspection, an estimation confirmed by recent papers not yet included in the OMIM reference list. This shows that the CGC application is a very helpful tool for finding gene candidates contributing to RA. Furthermore, the CGC application also seems to follow our manual interpretation for genes that might be of interest (referred to as the 'middle group') as well as for genes with no evident connection to RA.
No clear-cut connection can be made between the absolute sum of keyword values and the relevance of candidate genes. However, our evaluation of the four Cia QTLs implies that the ranking of the genes within each QTL based on the keyword sums provides a good prediction of the best candidate genes. For example, in QTL region Cia12
has been shown by Olofsson and colleagues to be involved in the regulation of arthritis severity in rats [29
]. As expected, NCF1
also obtains a very high keyword sum (225.6), mainly because of the description of Olofsson's findings in the OMIM text. When this description is excluded from the OMIM record, the NCF1
keyword sum decreases to 10.8. This still made NCF1
the highest-ranked gene in this QTL region. As exemplified above, the CGC application is able to find candidate genes even though their relatedness to RA is not explicitly mentioned in the text investigated. In the paper describing Olofsson's findings, the authors stated that they found the candidate gene approach distracting, even though they were facing a region that contained a small set of genes. This could very well be so, but when analysing the genes within a QTL it seems reasonable to start with the most likely candidate genes rather than with randomly picked ones, especially if the region contains a large number of genes. The CGC application makes an unbiased evaluation of genes within a region, indicating which are the most favourable ones to start analysing. Looking at the NCF1
example retrospectively, CGC would in fact have suggested NCF1
as the most probable candidate gene, although this might be a fortunate case.
Among the selected keywords, occasionally there were a few that gave false positives. One example is the word 'joint' (point 24.2), which at times referred to other terms, such as 'joint maximum LOD score'. For example, this caused the gene KEL to be ranked highest (28.7) for the Aia2 QTL. Another example is 'T cell' (points 2.8), which can produce results such as mutant cell or that cell, as found in the OMIM record for EDG1 (Cia10 ). In addition, it was found that some keywords can be misinterpreted as author names. EDG1, for example, was falsely predicted as a candidate gene partly because the term 'HLA' matched an author (Hla T. Maciag T. J Biol Chem 1990;265:9308-13).
Forty-nine keywords were selected, based on PubMed MeSH terms and other terms frequently found in the literature on RA. However, this might not be a completely exhaustive set of keywords and a user of the CGC tool might want to extend or exchange parts of this keyword list. To make this possible, the user can add up to 10 keywords of his or her own and can automatically obtain the corresponding keyword values calculated. These keywords can be used alone or together with the whole or parts of the default keyword list. It should be emphasised that there is really no harm in using a large number of keywords, because irrelevant keywords, such as 'and' or 'is', will get almost no keyword values, thus not disturbing the selecting process. In addition, the user is allowed to overrule all keyword values if preferred and enter values of his or her own choice.
Comparison with related databases
To our knowledge there are three databases other than CGC that address the problem of finding candidate genes for complex disorders.
GeneSeeker is a web-based tool that permits the user to search different databases simultaneously, given a known human genetic location and an expression or phenotypic pattern(s) [30
]. Moreover, data from syntenic regions in mouse can be included in the queries. The tool is a general instrument that has its strength in the range of databases covered. However, GeneSeeker has no means for prioritizing between the genes retrieved. Because the CGC tool is specifically adapted for arthritis models, much more keywords relevant to this phenotype are available here although both applications permit the user to enter his or her own keywords.
POCUS (Prioritizing Of Candidate genes Using Statistics) is an application that rates genes on the basis of their similarity to a set of genes generally considered to be associated with a given complex trait [31
]. The similarity is quantified by measuring the number of functional annotations (Gene Onthology terms or InterPro domain ID) and/or expression pattern terms and IDs in common (Unigene or NCBI). Although POCUS prioritizes between the gene candidates, the strategy is different from that used for CGC. The genes associated with a given trait are not restricted to a specific genomic region. However, the authors claim that the application might be extended to work in such a way. POCUS is not a web-based tool but can be downloaded.
G2D (candidate Genes To inherited Diseases) is another database accessible from the web [32
]. G2D is built on a strategy resembling that of CGC. In brief, chemical terms have here been given scores calculated in a similar fashion to that in CGC; that is, the simultaneous occurrence of chemical terms (MeSH-C) and pathological conditions (MeSH-D) in PubMed. For a given disease several pathological conditions were selected on the basis of a set of representative papers. These pathological conditions were then related to functional descriptions (Gene Ontology terms) by using RefSeq annotations (RefSeq-NCBI) as mediating links, and the degree of relatedness were represented by 'GO-scores'. A gene can be related to a given disease by calculating the average GO-score annotated for that gene. In many ways this approach resembles that described in this paper, although G2D depends on Gene Onthology terms instead of a full text. Moreover, G2D uses the mean GO-score for rating genes rather than calculating the sum. As a consequence, a gene with a GO-score based on just a single Gene Ontology term is rated higher than a gene that is annotated for the same term together with additional Gene Ontology terms with lower scores. Furthermore, in contrast to CGC, the GD2 database is a static database in which no data input from the user is possible, and at present no information on RA is available.
As our next step we plan to evolve the CGC application to include other text-based resources, such as PubMed abstracts, Swiss-Prot descriptions and, as a complement, Gene Ontology terms. In addition, we are currently extending the CGC tool to include rat QTLs for metabolic disorders, mainly focused on diabetes mellitus type II. The long-term goal is that the CGC tool will be able to predict candidate genes for any given type of rat QTL, such as multiple sclerosis, blood pressure or obesity. The strategy used in CGC could also be applied on QTLs in other species, such as mouse or human.