Ceaseless advances in biotechnology, along with the growing experience cumulated by researchers in recent years, has allowed a continuous and faster blooming of the number of genomes being sequenced and, most importantly, annotated. The consequent necessities of storing, retrieving, sharing and, in particular, understanding this vast amount of data led to the creation of genome databases, an open source of genetic information for scientists worldwide.
Whereas the genomic era opened the doors to the very existence of such large and comprehensive (omic) data repositories, the strongest urgency of the post-genomic era is now to interrelate various sources of biomedical information.
Several parallel efforts are currently underway to achieve a better understanding of the human genome. These actions are turned to the extraction of high-throughput information from global approaches such as the International HapMap Project (1
) for identification of single nucleotide polymorphisms or the prosecution of the ENCODE [ENcyclopedia Of DNA Elements (2
)] for the identification of all functional elements in the genome sequence. Therefore the integration of various existing and upcoming efforts is going to be a key element for the full comprehension of the cellular machinery.
We focus here on one of the many fields that will strongly benefit from such an integration: the study of hereditary diseases. Often more than one gene is involved in life threatening misfunctioning of cellular functions. To characterize such diseases the identification of all the responsible genes is eventually a crucial requirement. This process usually involves costly, time consuming and difficult tracing of large family lineages to follow the line of transmission of genes and thus to define the linkage areas where genes responsible for the disease could be located.
Computational technologies can appropriately be employed to integrate available data and can, in principle, be used to save on the expensive process of candidate genes selection. In this article we describe TOM [Transcriptomics of OMIM, (3
)], an automated pipeline for the extraction of the best candidate genes for a given genetic disease. The procedure is based on two possible starting points. On one instance the accepted input is a list of one or more genes (called the seed(s)
of our search) plus the chromosomal area where the unknown gene is located. This option (One Locus option) is suited for cases where the disease is minimally characterized and at least one responsible gene is known. The algorithm performs the necessary steps to extract from the linkage area, listed in the input, the genes that have the highest chances of being functionally related to the seeds. The second option (Two Loci option) is designed for poorly characterized diseases, when no specific gene is a priori known. At least two bona fide
linkage areas need to be present. It is therefore possible to query the two genome tracts associated to the same pathology. The algorithm extracts the lists of genes annotated on each linkage region and searches for pairs that have similar expression or functional profiles.
The scientific rationale behind TOM is rooted on three characteristic gene features: gene mapping, expression profiling and functional annotations. The combination of these three features enables the selection of genes that have desirable characteristics, and meanwhile the filtering of possible candidates that do not share them.
The first step, gene mapping, often a bottleneck in past times, is now inherent to the human DNA sequence. Since the decoding of the human genome, it is in fact possible to select genes matching any specific area of the genome. The definition of one or two genome regions of interest represents obviously the first selection criterion for the candidates to a given disease. TOM then proceeds to retrieve the list of annotated genes that are located in the regions.
Second, genes that have a common transcriptional regulation (resulting in coherent expression profiles), also have a high likelihood of being involved in the same cellular process (4
). Expression profiling allows to record the activation/silencing of number of genes across different experiments or conditions and it is most commonly performed by means of DNA microarrays. Measuring with appropriate metrics the distance in the transcriptional profiles (i.e. the mRNA levels of two or more genes across a set of conditions) enables the identification of gene sets sharing similar behaviors. Applying this second selection criterion to the queried genetic interval(s), and extracting only the messengers with significant correlation to the seeds, leads to a strong reduction of the list of candidates.
Finally, proteins involved in the seed's cellular processes are likely to be encoded by candidate genes as well. Functional analysis consists in characterizing a gene's cellular role. Functional views of a protein can be obtained using the controlled vocabulary defined by the Gene Ontology Consortium [GO (8
)]. The GO in fact enables the identification of a gene's molecular function and, in particular, of the biological process/processes in which it is involved. This third and final selection step then focuses one more time on the potentially relevant genes, selecting only those that are functionally related to the seeds (One Locus) or that share the same function/s (Two Loci). To allow more flexibility to the user, it is possible to disjoin the second and third step.
Dealing with the human genome represents a complex challenge and several efforts have been undertaken with the specific aim of facilitating the understanding of genetic diseases, namely the extraction of candidate gene lists for a given disease. These approaches can be divided in two broad categories, one mainly based on the use of ontologies (structured vocabularies for the classification of items) and the other that relies more on structural characteristics of the genes or their products. For both approaches several interesting strategies have been implemented. In the first category, approaches range from (i) the estimate of the association between terms from controlled vocabularies, that define a disease, and genomic sequences associated in literature to (ii) the evaluation of similarity among genes based on phenotypic descriptions (9
). In the second category algorithms (11
) are based on structural characteristics such as protein size or degree of conservation across evolution. Other approaches rely on the information obtained performing composite queries across several databases, including mouse and human gene expression (13
). Besides their meaningful rationale, both approaches also have their disadvantages. The first class suffers from incompleteness owing to the still ongoing annotation process. The second class makes use of characteristics that are less biased than annotations, and relies on rules that have broad applicability, but may miss the specificity of the gene-by-gene discovery that is conversely available through ontologies.
TOM de facto merges the strategies used by the different approaches described above and its main advantage can be defined in its flexibility. On one hand, it is not strictly designed for queries on disease-related genes, but can be used for any type of gene(s)–locus or locus–locus inquiry. Furthermore, it can be used both by an investigator with a good level of a priori knowledge on the disease (One Locus option) and when the research is still at early stages (Two Loci option). Finally, the user can modulate the filtering ability, since he can decide whether to apply unsupervised expression neighborhood analysis, taking advantage of the unbiased transcriptional information, or the functional annotation, relying on the careful curation of the GO vocabulary.
We describe here in details the differences between TOM and two other applications that, among the published works, appear to be more related to our approach. The work of Tiffin et al
) makes use of gene expression information and constitutes an interesting approach based on the association (performed on Medline abstracts text-mining) between diseases and anatomy terms, based on frequency of occurrence. The top ranking associations are then used to mine ENSEMBL (http://www.ensembl.org
) in search of the disease genes annotated to the corresponding anatomic sites. This approach lacks three features that are advantageously implemented in TOM: first it is presented as a method and not offered as an online service to geneticists; second, gene expression data are obtained as ENSEMBL annotation, thus generating the drawbacks described above for the annotation methods and third, TOM extracts genes coexpressed in any anatomic site or tissue, in diseased and non diseased samples. This opportunity allows a broader search and the identification of genes potentially difficult to capture in the disregulated processes, but that still share strong synergistic activities, present in the normal status.
A second program named POCUS, designed by Turner et al
) takes advantage of a sophisticated use of both GO and InterPro domain (16
), evaluating the enrichment for any annotation of a set of genes localized on two given chromosomal areas defined by the user. The enrichment represents an indicator of commonalities among the genes. This approach is partially related to our Two Loci option, offering the advantage of allowing researches with very little a priori knowledge. However, it misses two issues addressed by TOM: (i) our One Locus option allows a more focused and targeted approach for better known diseases and (ii) Turner's approach may miss information that could be obtained by mining on other databases. In fact the authors state that the quality and number of annotation is biased by the number and quality of previous findings made on the genes (that appear as annotations IDs).
Finally, a very recent work from Franke et al
) approaches the problem of the relationships among genes of interest that can in some cases overlap with hereditary disease research. This work describes a complete integrated approach, making use of Bayesian networks, based on GO, gene expression and protein–protein interaction, as well as pathways and information from human yeast two hybrid (Y2H) experiments (18
). This tool allows the prediction of genes that are functionally related to candidate chromosomal areas and is thus related to the Two Loci approach described in TOM. The interesting strategy described in this paper could then be used to integrate our tool.
In conclusion, we believe that TOM, which parallels the successful work of Mootha et al
) who devised an integrative approach to gain insight into cytochrome c
oxidase deficiency, can be a valuable and efficient resource to help in genes extraction. Certainly, more options can be integrated to take better advantages of the existing but partial information sources, i.e. to detect unpredicted or non-coding genes.