|Home | About | Journals | Submit | Contact Us | Français|
Visualizing relations among biological information to facilitate understanding is crucial to biological research during the post-genomic era. Although different systems have been developed to view gene-phenotype relations for specific databases, very few have been designed specifically as a general flexible tool for visualizing multidimensional genotypic and phenotypic information together. Our goal is to develop a method for visualizing multidimensional genotypic and phenotypic information and a model that unifies different biological databases in order to present the integrated knowledge using a uniform interface.
We developed a novel, flexible and generalizable visualization tool, called PhenoGenesviewer (PGviewer), which in this paper was used to display gene-phenotype relations from a human-curated database (OMIM) and from an automatic method using a Natural Language Processing tool called BioMedLEE. Data obtained from multiple databases were first integrated into a uniform structure and then organized by PGviewer. PGviewer provides a flexible query interface that allows dynamic selection and ordering of any desired dimension in the databases. Based on users’ queries, results can be visualized using hierarchical expandable trees that present views specified by users according to their research interests. We believe that this method, which allows users to dynamically organize and visualize multiple dimensions, is a potentially powerful and promising tool that should substantially facilitate biological research.
Visualizing relations among biological information to facilitate understanding is crucial to biological research during the post-genomic era, in which the volume and complexity of available biological information is increasing at an accelerating rate. While visualizing molecular networks is intensely pursued by the community, visualizing gene-phenotype relations, the phenome (Freimer and Sabatti 2003), is of equal importance, especially for the approach of systems biology (Tao, Liu et al. 2004). Although some systems have been developed to view gene-phenotype relations for specific databases, to our knowledge, very few have been designed specifically to meet the requirements for a general flexible tool for visualizing multidimensional genotypic and phenotypic information together. A general tool of information visualization over multiple databases is needed in the postgenomic era and should include the following basic requirements:
In this paper, we present a general visualization tool, called PGviewer, which meets the five basic requirements mentioned previously. Our aim is to develop a general method for visualizing multidimensional genotypic and phenotypic information, and a model to unify interfaces of different databases. Our method uses a tree structure to visualize the clustering relations of the multidimensional biological information across multiple databases according to users’ queries. We demonstrate its flexibility and generalizability over two sets of data.
In the rest of this paper, we will first review existing approaches for browsing, querying and visualizing biological data. Then, we will discuss the details of our system’s components, interfaces, algorithms, and our evaluation process. Next, results from the evaluation will be given. Last, we will discuss the advantages and limitations of our methods and future work.
PGviewer is based on our previous work on 1) organizing phenotypes across genomic databases and on 2) visualizing clinical phenotypes. The former methods infer relationships across heterogeneous phenotypes in distinct databases using structured ontologies or computational terminologies (Cantor and Lussier 2003; Cantor and Lussier 2004; Lussier and Li 2004). The latter method consists of another tree viewer called DynTreeViewer, which was designed to flexibly display associative relations between the components of clinical terms obtained from narrative text (Liu and Friedman 2000; Friedman, Liu et al. 2003). For example, it could display a problem-oriented view of clinical terms occurring in patient reports or a body location-oriented view. Its tree organization is similar to that of PGviewer. However, PGviewer is more flexible than DynTreeViewer. In DynTreeViewer, to modify a tree view users can change the clustering order only by bringing a level or dimension of a tree to the top level of that tree. Users cannot specify the order of dimensions below the first level. PGviewer provides full flexibility by allowing permutation of dimensions’ ordering in all levels of a tree. Another difference is that PGviewer uses a relational database to manage data instead of native XML in order to improve efficiency and scalability and to take advantage of standard database query functions.
The implementation concept of PGviewer is from the n-dimensional data cube (Gray, Bosworth et al. 1996), an established method for organizing multi-dimensional databases, and an important interface for data cube, Pivot Table (Graefe, Goetz et al. 1998). The Pivot Table allows the data cube to be rotated, or pivoted, so that different dimensions of the dataset can be arranged into a two-dimensional table. PGviewer inherits the Pivot Table’s feature of flexible data definition. PGviewer differs from the Pivot Table in that it displays results using a hierarchical expandable tree instead of using tabular results. Another difference is that the Pivot Table is more suitable for analysis of numeric values but PGviewer is designed to show associated relations of nominal data.
There are currently a number of systems aimed at browsing, querying and visualizing biological entities and their relations. The differences between our system and these existing systems are summarized in the following.
This group of systems returns output in predefined views according to users’ searching criteria. Searching results are formatted in pre-defined tables. Obtaining information across different databases is implemented by the hyperlinks embedded in searching results. Actually, this approach is taken by most of the databases, such as National Center for Biotechnology Information (NCBI) (Wheeler, Chappey et al. 2000), Mouse Genome Informatics (MGI) (Bult, Blake et al. 2004), Flybase (FlyBase_Consortium 2003) and GeneCards (Rebhan, Chalifa-Caspi et al. 1998). Technically, this browsing approach is very flexible and can be extended to any number of dimensions just by selecting available hyperlinks. Different databases are easily coordinated by URL links. However, the disadvantage is that the search interfaces focus on one fixed dimension and the returned information is organized according to a predefined view. Users are required to integrate the related information manually by selecting all the hyperlinks laboriously when they need to retrieve related information. The associative relations of objects across multiple databases are not easily seen. This process is likely to be inefficient due to excessive number of branches to obtain the complete data. Our system differs from these systems because it allows users to define their information needs in one step without multi-screen browsing. Furthermore, relations among dimensions from different databases are visualized in a tree structure within the same view so that patterns are easily perceived.
This group of systems attempt to avoid the disadvantages of predefined visualization by a centralized platform and allow flexible queries using special querying languages, such as TAMBIS (Baker, Brass et al. 1998), Kleisli (Wong 2000), and TINet (Eckman, Kosky et al. 2001) or using query generation interfaces (Chen, Kosky et al. 1998; Kasprzyk, Keefe et al. 2004). Because of the flexibility of query scripts and query generation interfaces, in this approach a user can freely define informational dimensions in queries across different databases. Thus, this group of systems meets the requirement for dealing with a large number of dimensions, allowing flexible queries, and coordinating heterogeneous databases. However, they concentrate on flexible queries but not on flexibility in visualizing the resulting relations of the biological entities because most of them use flat tables as the format of the query result. Therefore, when a result is large, associative relations are hard to discover within a large table. In addition, in the approach of using special querying languages, the requirement for understanding special syntaxes as well as database schemas may affect its broad use. Our method maintains the feature of using a graphical query generation interface to generate flexible queries. The major difference from these systems is that our system visualizes retrieved results in an organized manner in order to facilitate better understanding.
Graphic visualization systems for molecular network have been extensively investigated (Kolpakov, Ananko et al. 1998; Koike and Rzhetsky 2000; Jenssen, Laegreid et al. 2001; Karp 2001) but they are not designed to visualize gene-phenotype relationships. A few systems do display the relations of genes and phenotypes graphically, e.g. SemGen (Rindflesch, Libbus et al. 2003) and g2p (Bodenreider and Mitchell 2003). However, these systems visualize only two dimensions of entities, namely, genes and phenotypes and no other related information. There is a general tool, called BITOLA, for exploring user-specified classes of bio-medical terms from MEDLINE in a large scale (Hristovski, Peterlin et al. 2003). It is flexible in that users can specify the classes of dimensions they are interested in. However, the input is focused on one database and the output is tabular by design and no more than three dimensions can be displayed at one time. There is another group of graphic tools for visualizing Gene Ontology (GO) (Ashburner, Ball et al. 2000) annotation information based on a large number of input genes (Zeeberg, Feng et al. 2003; Zhong, Li et al. 2003; Al-Shahrour, Diaz-Uriarte et al. 2004; Zhang, Schmoyer et al. 2004). These tools concentrate on visualizing collective profiles of phenotypic annotation based on a group of genes rather than individual genes. The purpose is different from the one discussed in this paper.
The proposed visualization methods (PGviewer) are described below.
The basic idea behind our system is the following: databases contain objects and objects are described by attributes. All attributes within all the objects in all the databases constitute the dimensions in the whole data space. Users’ queries can be formed by selecting an ordering of these dimensions with filtering criteria on each dimension. To be presented to users, the results of a query are arranged in a tree structure so that users can explore the result space clustered through associative relations according to their needs. It is important to note that the tree structure we use represents an ordering or clustering of information, and should not be associated with a hierarchical classification, which is a typical use of a tree when specifying an ontology or taxonomy. Based on these methods, the PGviewer user interface consists of two parts, namely, 1) a query definition interface, and 2) a presentation interface of the query result.
The architecture overview of our system is illustrated in Figure 1.
PGviewer operates over a denormalized database (Figure 1). In order to generate this denormalized database, we integrate in a semi-automated way independent databases using PERL scripts, cross-indexes and “SQL join” commands. We then denormalize the relevant fields. Two datasets (human genomics, mouse genomics) are used to demonstrate that our method is generalizable. The human genomics dataset shows gene-phenotype relations collected in OMIM, which were obtained by human manual curation. The mouse genomics dataset shows gene-phenotype relations extracted from a subset of MEDLINE related to the mouse model organism. This collection consists of information extracted from Medline citations using a revised version of a natural language processing (NLP) extraction and encoding system called BioMedLEE (Chen and Friedman 2004). BioMedLEE was developed based on components of two established NLP systems, the components of MedLEE (Friedman, Alderson et al. 1994) enhanced with a small number of additional grammar components from GENIES (Friedman, Kra et al. 2001; Krauthammer, Kra et al. 2002). MedLEE has been used operationally in the clinical domain to encode information in textual patient reports since 1995, and has been shown to actually improve patient care. GENIES, which is an adaptation of MedLEE, extracts biomolecular interactions from the literature. It is a component of the GeneWays system (Rzhetsky, Koike et al. 2000; Rzhetsky, Iossifov et al. 2004), and has been used to process over 100,000 full journal articles, in order to populate the GeneWays knowledge base.
The human genomics dataset was obtained from the entire OMIM Gene Map table downloaded from the OMIM website, which contains 9,042 entries of gene-disorder relations. For this dataset, we extracted gene name, gene location, disorder and OMIM ID from this table. We also obtained the bibliographic information for each OMIM entry using a script to read OMIM’s website. To disclose the molecular mechanism of human hereditary diseases, we added GO terms for each OMIM entry via LocusLink (Maglott, Katz et al. 2000). The files we used are mim2loc and loc2go downloaded from the OMIM website. We have nine dimensions in our human genomics dataset: 1) OMIM_ID (including OMIM title), 2) gene location, 3) gene, 4) GO_term, 5) disorder, 6) PubMed_ID (including article titles), 7) year, 8) journal and 9) authors.
The mouse genomics dataset comes from three databases: 1) a subset of MEDLINE citations related to the mouse model organism, 2) gene and phenotype relations extracted from these articles using BioMedLEE, where the phenotypes are encoded using identifiers of the Unified Medical Language System (UMLS), and 3) a UMLS-GO mapping database (Sarkar, Cantor et al. 2003) which map terms from UMLS (Lindberg 1990) to GO terms.
We collected bibliographic information, including PubMed ID, article title, journal, publication year, and authors, from the MEDLINE subset. There are over 1,200 citations in this subset. Because the original files from MEDLINE were in XML format, an XML parser written in PERL was used to flatten the files before they are imported into our database.
Genotypic and phenotypic information were extracted from the titles and abstracts of the MEDLINE subset. BioMedLEE was used to process the titles and abstracts and to extract the relevant information. Extracted information includes gene names, phenotypes and phenotype-related biological structures. BioMedLEE can encode phenotype and biology structure into various terminology codes, and in this particular paper we used UMLS codes. The output is in the structured format of XML. A simplified version of the output from BioMedLEE is shown below for the sentence from a MEDLINE abstract “Tsc2 heterozygote display 100% incidence of multiple bilateral renal cystadenomas, 50% incidence of liver hemangiomas, and 32% incidence of lung adenomas by 15 months of age”. The tag represents the type of information whereas the attribute v represents the value. Note that, the value for gene tags display the full form of the gene. The outermost tags represent the primary type of information (e.g. gene, phenotype); nested tags represent modifiers of that information (e.g. genemod, anatomy, and region). The tag phenotype is a semantic type associated with diseases and other abnormalities. The tag sid is a tag identifying a sentence. For example, the last phenotype tag in the example below has the value “adenoma”, which is modified by a body organ “lung”, measurement information “32%” and a sentence ID “s1.1.1”:
<gene v = “tuberous sclerosis 2”><genemod v = “heterozygote”> </genemod><sid idref = “s1.1.1”></sid></gene> <phenotype v = “cystadenoma”><anatomy v = “kidney”><region v = “bilateral”> </region></anatomy><measure v = “100 %”></measure><sid idref = “s1.1.1”> </sid></phenotype> <phenotype v = “hemangioma”><anatomy v = “liver”></anatomy><measure v = “50 %”> </measure><sid idref = “s1.1.1”></sid></phenotype> <phenotype v = “adenoma”><anatomy v = “lung”></anatomy><measure v = “32 %”> </measure><sid idref = “s1.1.1”></sid></phenotype>
Similarly, a PERL script is written for parsing the XML into a flat file so that it can be imported into our database. In the above example output, the gene tuberous sclerosis 2, three phenotypes (cystadenoma, hemangioma and adenoma), and their anatomy modifiers will be imported into the mouse genomics dataset.
To demonstrate the possibility that our method could be used to find GO annotation terms using our NLP system’s output, we incorporated a terminology mapping database previously developed, which maps 17,256 UMLS terms to GO terms (Sarkar, Cantor et al. 2003). For example, the UMLS term “apoptosis” (C0162638) is mapped to two GO terms: “apoptosis” (GO:0006915) and “cytolysis” (GO:0019835) in the UMLS-GO database.
In summary, we obtained eight dimensions in our mouse genomics dataset, namely, 1) PubMed_ID, 2) journal, 3) year, 4) authors, 5) gene name, 6) anatomy, 7) phenotype and 8) GO terms.
Two user interfaces, a querying interface and a presentation interface, were developed using JAVA to interact with the database. These interfaces are general in that they interact with any database that has been created. The querying interface shows users the dimensions of the databases, and allows users to specify the desired dimensions and the desired clustering order. In addition, the interface generates the appropriate database queries and sends them to the database. The presentation interface processes the returned dataset using a tree generation algorithm and displays the generated tree reflecting relations among dimensions according to user’s specifications.
A screen-shot of the querying interface applied to the mouse genomics dataset is shown in Figure 2. Users utilize this querying interface to select, arrange and apply filtering criteria on their interested dimensions that will be shown in the presentation interface later. The candidate dimensions are listed in the leftmost column automatically based on the columns of the denormalized database table. Users can select each desired dimension by pressing one of the four selection buttons to the right of the column. Selected dimensions will be removed from the leftmost column and be added in the second column, where users can arrange their order arbitrarily using the “Up” and “Down” buttons. This list determines the clustering order of the levels where each dimension will appear in the tree view. In Figure 2, the user chose the dimension GO term, Phenotype, Gene, and PubMed ID and requested that they be clustered in this order. In this interface, users can also specify the sorting order (i.e. ascending or descending) and apply filters for every dimension (i.e. “=“, “>“, “<“ and “like”). In the example in Figure 2, the GO term dimension will be filtered by “like cell” to display GO terms containing the string “cell”. The user could easily obtain a different view, e.g. a Phenotype-GO_term-gene-PubMed_ID view, by re-arranging the order of selected dimensions in the second column.
The generation of a tree view makes use of the sorting and grouping function of a DBMS SQL query. After users choose the dimensions and filtering criteria, a SQL query representing this query is constructed automatically to fetch the data from the database. If the denormalized table in the database is called “crosstable”, the SQL query corresponding to Figure 2 will automatically generate the following query: “select GO_term, Phenotype, Gene, PubMed_ID from crosstable where GO_term like ‘%cell%’ group by GO_term, Phenotype, Gene, PubMed_ID order by GO_term asc, Phenotype asc, Gene asc, PubMed_ID asc”. The DBMS will sort the dataset and apply filters according to the user’s definition. For example, if the original dataset in the database is as Table 1, then the retrieved dataset by the SQL query will be as Table 2.
After retrieving the sorted dataset, an algorithm is used to merge the adjacent duplicated values of a particular dimension if the values of the previous dimension are also duplicated. In the case of Table 2, the merged data is displayed in Table 3. The corresponding tree view generated from data in Table 3 is shown in Figure 3.
After the generation of a tree, the presentation interface presents the tree and related information using the JTree class of the JAVA Swing package. Figure 4 shows a screen-shot of the presentation interface based on the query illustrated in Figure 2.
To speed up the response time of the presentation interface, we used a database paging technique, in which the database returns 1,000 rows of data at a time. The nodes of the tree view represent the selected dimensions, including their values and the numbers of their direct children nodes (within the square parentheses). Users can expand and collapse a node by clicking the “+” or “−” signs in the beginning of the node. This presentation interface also provides a detailed view of some special dimensions. For example, if a node associated with a PubMed ID is selected, the corresponding article’s title and abstract will be displayed in a text area or in a popup window (The popup window is not shown in Figure 4). For the mouse genomic dataset, the relations between gene and phenotype information extracted by BioMedLEE are shown in the table below the text area. The table contains two columns. The first column shows paired gene-phenotype relations based on text and the second column shows corresponding paired terminology codes. When a row representing a relation between gene and phenotype is selected, the corresponding original words from the titles and abstracts will become highlighted using different colors (red for gene and blue for phenotype). Thus, users can easily map the captured phenotypic information back to the original text and read the context.
The process of users’ querying a tree and viewing the constructed tree can be repeated. After a user obtains a tree view in his first attempt, he or she wants to explore whether another view will be more helpful. Then the user can alter the tree definition and see the modified tree view immediately or compare it to the previous tree view.
The viewer’s ability to adapt to an increase in the number of dimensions is completely automatic. No modifications are needed for the query interface to display the altered schema. Whenever the centralized table is modified to contain more dimensions (table columns), the interface will automatically read all the columns and populate the list of available dimensions in the interface. A list of dimensions (metadata) is read by the viewer in order to display their names automatically in the user interface with no interventions of the user.
In order to evaluate our method, we inspected our system based on the five requirements for visualizing multidimensional genotypic and phenotypic information we presented in the introduction section.
The integrated denormalized table associated with the human genomics dataset contains 739,985 rows of entries and the table in the mouse genomics dataset contains 22,271 rows. Since the query definition interface allows selecting any dimension in any order, it allows for 623,529 and 109,600 distinct dimension permutations for the human genomic dataset and the mouse genomics dataset respectively. We utilized the database management system (DBMS) of MySQL to sort the datasets. The database schema is straightforward and consists of a single denormalized table. Therefore, the scalability of database mainly depends on the DBMS’ capability of sorting and querying a single large table. Having only one table improves performance by eliminating the need to join different tables, and by simplifying integration of the DBMS with the viewer component. This strategy does not limit the flexibility or the scalability: new dimensions are easily accommodated by adding new fields to the denormalized table. We acknowledge the drawback of the maintenance of the system, as new dimensions would require recompiling the complete denormalized database; additionally as many more dimensions are added it is likely that the database queries would be less efficient. With no changes in dimensions, updating the current databases may require as much as fifteen man-hours.
These results show that our method meets five requirements for a flexible and generalizable information visualization tool for phenomic data as described in the introduction. Therefore, it could be a standard interface model for designing any model organism database, such as MGI and Flybase, because these databases actually contain similar types of information. Users need not spend additional time to learn different user interfaces in different databases. Furthermore, our method provides advantages that are absent in existing databases, and could be a possible solution for database unification in the interface level in the postgenomic era.
The advantages of our method reside in two major aspects. First, it allows users’ arbitrary selection and ordering of desired dimensions visually in its query interface design. This maximizes the flexibility of users’ queries and provides improved efficiency for constructing an intended query. The deceptively simple user-interface of PGviewer conceals a powerful capability for requesting and presenting any selected permutation of dimensions. For example, the view of OMIM disorders organized according to the Gene Ontology illustrated in Figure 6 is a useful presentation of the phenome, which is analogous to those presentations available in MGI and Flybase. Since, to our knowledge, there are no browsers, which currently provide a view of OMIM disorders using a GO query, the human genomic database viewed by PGviewer proposes an original and useful functional genomic approach to organizing human phenotypes. Second, it visualizes the relations among the informational dimensions using a hierarchical expandable tree based on user-defined queries. In a tree view, duplicate information is reduced to one node and similar information is arranged close to each other. Thus, patterns and structures of genotypic and phenotypic information can be easily perceived. In contrast, in a tabular list containing gene and phenotype relations, the relations would not be obvious if the table contains many entries and is not ordered. Our tool will order the list by gene and phenotype and construct a tree. Thus, associative relations between genes and phenotypes are clear. Other advantages include the ability of handling multiple dimensions from different databases. Our method is general and can be used for any type of multi-dimensional data, although in this paper we focused on genotype-phenotype relations. But it should be noted that our visualization method assumes data integration into one database has occurred and is not aimed at a general solution for integrating heterogeneous biology databases in the level of the data source.
We realize that there are also limitations in our method. First, the new relations found in our viewer are suggestive but not confirmative because transitions of relations may not be always correct. For example, relations between GO terms and disorders in Figure 6 are suggestive. Locuslink provides relations from genes to GO terms and OMIM specifies the genes associated with disorders. It is possible that only part of the GO terms defines the real molecular mechanisms for breast caner and others are just possible mechanisms. Second, the tree view design cannot show an overview of the whole tree in one screen due to size limitations. Visualization using graphs with small size nodes, such as in some molecular networks (Koike and Rzhetsky 2000), have been shown to solve this issue. Third, a tree view is not good for showing all the information related to a single object (node) as a graph can, because a node in a tree can only have one parent while a node in a graph can have many different parents as well as different types of relations other than parent-child.
Our future work will involve further refinement and development of PGviewer. The many research issues we will work on will involve 1) developing a more generalizable structure for facilitating the integration of diverse databases and dimensions, and 2) advancing graphical representation of the data so that many different kinds of graphs and views can be obtained.
In this paper, we presented a novel flexible visualization tool, called Phenogenes Viewer, in response to the five basic requirements for displaying multi-dimensional genotypic and phenotypic information. Our work is novel in several ways. First, it allows users to dynamically specify the clustering order of data presentation so that they can focus on a view of the data that is relevant for their research interests. Second, it shows the ability to visualize structured data across different databases and ontologies including coded gene-phenotype relations extracted from text data. Third, it provides a scalable and generalizable interface across both structured and textual databases, and could be used as a standard unified interface model for designing any model organism databases, such as MGI and Flybase. Additionally, the proposed viewer provides a seamless user interface experience across heterogeneous of genomic and post-genomic databases. We believe that this method, that integrates data from multiple sources and allows users to dynamically visualize the multiple dimensions, is a powerful and promising tool that should substantially facilitate biological research.
The authors thank Judith A. Blake, Janan T. Eppig and Joanna Amberger for providing assistance in understanding the MGI and OMIM genomics databases. We also acknowledge the contribution of tools or datasets provided by Jianrong Li, Hua Xu, and Lyudmila Shagina. This study is partially supported by the National Institute for Allergy and Infectious Disease Grant #1U54 AI 57159-01, and by the National Library of medicine Grants # R01 LM007659-01, 1K22 LM008308-01 and by the NYSTAR grant # 5-67674.
Availability: PhenogenesViewer as well as its support and tutorial are available at http://www.dbmi.columbia.edu/pgviewer/