) is a rare lipid-storage disease that leads to a complex combination of neurologic dysfunctions including cerebellar, pyramidal and extrapyramidal signs, neuropathy, dementia and psychiatric disturbances, as well as extra-neurological manifestations (chronic diarrhea, cataracts, tendon xanthomas, premature arteriosclerosis) [1
]. CTX is caused by mutations in the gene CYP27A1
, which encodes the mitochondrial enzyme sterol 27-hydroxylase, deficiency of which causes an excess of intermediate metabolites such as cholestanol to accumulate in virtually every tissue. Like many neurodegenerative disorders, CTX is characterized by an insidious onset, progressive course and variable combination of clinical manifestations in each patient, which, together with the rarity of the disease, hampers correct and early diagnosis. Therapeutic delay is especially catastrophic in CTX, since there is a specific treatment (chenodeoxycolic acid), which is effective in reducing the plasma levels of cholestanol but has not been demonstrated to improve established neurological deficits. Mutation analysis of the CYP27A1
gene is a key step in the diagnosis of CTX and is routinely performed. The availability of comprehensive genotype to phenotype data sets will be crucial in order to promote early recognition and optimize the diagnostic process.
In any disease but most especially in rare diseases the possibility of accessing detailed patient datasets from research and clinical studies, including genetic variants and phenotypic manifestations, would significantly improve diagnosis and treatment. Electronic patient records are able to gather diverse types and growing amounts of phenotypic data, while the use of genome-scale DNA sequencing techniques allows the collection of an increasing number of genetic variants per individual. Thus, integrating complex phenotype descriptions with genetic testing records has become one of the main challenges of biomedicine [2
]. As the number of openly accessible datasets continues to rise, the integration of research repositories and patient clinical data will be more viable. However, bioinformatics tools are needed to help explore complex genotype-phenotype relationships. Geneticists would request software tools able to retrieve and analyze the data produced in diverse clinical settings and associated to a new given genetic variant; that is, answering questions like what are the phenotype traits that have been identified in patients with this genetic variation?
Clinicians, on the other hand, would see their work greatly facilitated by being able to answer queries like what genes or genetic variants are associated with this particular combination of observable features?
The development of locus-specific mutation databases (LSDBs) and tools to build them such as the Leiden Open Variation Database (LOVD) [3
], and the Universal Mutation Database (UMD) [4
] started to pave the way to solve the problem of collecting genetic datasets produced by diverse experimental methods in different laboratories. However, the phenotype description in most LSDBs is very scarce. The Human Variome Project (HVP) [5
] is an international initiative aiming ultimately at the worldwide collection and harmonization of all human genetic variations and associated phenotypic data. The GEN2PHEN project also represents an international attempt to undertake the logistical and technical challenges to join disparate genotype-phenotype resources in a shared mode [6
]. In order to achieve that goal, communication standards are needed to allow interoperability between clinical and genetic datasets. Standards to represent genetic findings are already available, such as those produced by the HUGO committee (http://www.genenames.org/aboutHGNC.html
), gene relationships provided by Gene Ontology [7
] or the nomenclature for description of sequence variants proposed by the Human Genome Variation Society (HGVS, http://www.hgvs.org/mutnomen
). However, such a level of consensus on the best descriptors for phenotypic information is far more complex and has not been reached in clinical medicine.
Although the term phenotype
covers an extensive range of information varying from molecular to organism level observable characteristics [8
], in this work phenotype is meant only as any observable human trait, such as an anatomical abnormality (e.g., juvenile cataracts
) or a clinical feature (
e.g., tendon xanthomas
). Currently, the most useful catalog of human Mendelian disorders is OMIM, the Online Mendelian Inheritance in Man [9
], a text-based knowledge source of human phenotypes and related genes. OMIM describes phenotypes using narrative sentences (e.g. normal to slightly elevated plasma cholesterol
). Although these textual descriptions are highly expressive, capturing phenotype information using free-text fields in databases hampers computational processing and inference [10
]. The use of a standard terminology provides a more appropriate method of expressing unambiguous, computable, and interoperable phenotype descriptions. Standard terminologies organize the concepts of a particular domain into a taxonomy (e.g., epilepsy
is a seizure disorder
, which is an abnormality of the central nervous system
), assigning them identifiers which do not change with new versions. They also address the issues of different synonyms for the same concept (e.g., convulsions
vs. epileptic seizures
Patient data from clinical and research settings are usually stored in different formats, from simple spreadsheets to relational databases, being extremely difficult to integrate genotype-phenotype data across multiple formats. The semantic web technology provides an adequate instrument of recording phenotypes in a standardized fashion and with a high degree of expressivity. Using this technology to represent data will ensure the compatibility of them with the future knowledge and data resources. Additionally, one of the main challenges of articulating queries on phenotype-genotype relationships is discrepancy in the level of abstraction between phenotype descriptions and patient clinical data. The semantic web technology provides a layer of abstraction that makes it simple to use. Moreover, this technology is based on open world assumption: everything we do not know is undefined. Hence, unknown relationships will be interpreted as not computable instead of false. This approach naturally deals with incomplete information, which is very usual in biomedicine, and it is able to refine knowledge when new information comes along.
One option to deal with the phenotype complexity can be to define a minimum set of phenotype template fields [11
]. In contrast, an ontology-based technology would provide a more open and flexible representation mechanism [10
], thus facilitating the continuing incorporation and interpretation of new phenotype characteristics. An ontology is a data model that represents a set of entities in some domain and the relationships among those entities. One of the benefits of using ontologies is the potential to apply reasoners (logical inference tools), which can infer new data to subsequently facilitate query answering and statistical analysis. In the present work, we used patient data from a specific rare genetic disease (CTX) to formally represent phenotype descriptions using the ontological paradigm [15
]. We then engineered the patient data in an ontology-based patient model and finally executed queries on genotype-phenotype relationships with a Semantic Web approach.