The integrated denormalized table associated with the human genomics dataset contains 739,985 rows of entries and the table in the mouse genomics dataset contains 22,271 rows. Since the query definition interface allows selecting any dimension in any order, it allows for 623,529 and 109,600 distinct dimension permutations for the human genomic dataset and the mouse genomics dataset respectively. We utilized the database management system (DBMS) of MySQL to sort the datasets. The database schema is straightforward and consists of a single denormalized table. Therefore, the scalability of database mainly depends on the DBMS’ capability of sorting and querying a single large table. Having only one table improves performance by eliminating the need to join different tables, and by simplifying integration of the DBMS with the viewer component. This strategy does not limit the flexibility or the scalability: new dimensions are easily accommodated by adding new fields to the denormalized table. We acknowledge the drawback of the maintenance of the system, as new dimensions would require recompiling the complete denormalized database; additionally as many more dimensions are added it is likely that the database queries would be less efficient. With no changes in dimensions, updating the current databases may require as much as fifteen man-hours.
- When the maximum number of dimensions was used without any filters, PGviewer returns and displays the results in approximately 5 seconds in the mouse genomics dataset and 80 seconds in the human genomics dataset. In other conditions, the response time of the system may vary according to the number of rows of the table, the number of selected dimensions, and specific filters and may reach as much as 160 seconds. Generally, large row numbers and dimension numbers will slow down the system. On the other hand, the use of filters will speed up the response.
- The query interface successfully executed all the queries formed by arbitrarily selecting dimensions, ordering them and applying various filters. In both datasets, the presentation interface displayed the corresponding structured tree views correctly.
- For three important dimensions in OMIM (gene symbol, location, and disorder), “Search Gene Map” provides a predefined table in its output which aligns the three dimensions in the order of location, gene symbol and disorder and is sorted by location alphabetically. “Search Morbid Map” provides a predefined table in the order of disorder, gene symbol and location and is sorted by disorder. However, there could be other dimension orders that are of importance to biologists. For example, the order of location, disorder, and gene symbol will cluster disorders under specific locations. Thus hotspot of certain diseases can be discovered easily. This cannot be obtained from OMIM interfaces directly. In contrast, PGviewer could visualize this ordering gracefully in expandable trees (). illustrates the clustering of the disorder “breast cancer” under specific locations found in OMIM and presented in PGviewer. It is clear that chromosome 17 is a hotspot location for breast cancer. In addition, PGviewer can help discover unnoticed new knowledge burried across multiple databases. In this paper, the inclusion of GO provides possible molecular mechanisms for disorders. shows that “ATP binding” might be an important molecular mechanism for breast cancer.
Tree view of gene location, disorder and gene (Human Genomics Dataset)
Figure 6 Tree view of GO term, disorder, gene and PubMed ID (Human Genomics Dataset). The tree structure we use represents an ordering or clustering of information, and should not be associated with a hierarchical classification, which is a typical use of a tree (more ...)
- PGviewer could successfully retrieve data from its component databases and properly visualize the result in a tree view.
- For the efficiency of user interface, we observed that typical queries took approximately one minute or less to perform.
These results show that our method meets five requirements for a flexible and generalizable information visualization tool for phenomic data as described in the introduction. Therefore, it could be a standard interface model for designing any model organism database, such as MGI and Flybase, because these databases actually contain similar types of information. Users need not spend additional time to learn different user interfaces in different databases. Furthermore, our method provides advantages that are absent in existing databases, and could be a possible solution for database unification in the interface level in the postgenomic era.
The advantages of our method reside in two major aspects. First, it allows users’ arbitrary selection and ordering of desired dimensions visually in its query interface design. This maximizes the flexibility of users’ queries and provides improved efficiency for constructing an intended query. The deceptively simple user-interface of PGviewer conceals a powerful capability for requesting and presenting any selected permutation of dimensions. For example, the view of OMIM disorders organized according to the Gene Ontology illustrated in is a useful presentation of the phenome, which is analogous to those presentations available in MGI and Flybase. Since, to our knowledge, there are no browsers, which currently provide a view of OMIM disorders using a GO query, the human genomic database viewed by PGviewer proposes an original and useful functional genomic approach to organizing human phenotypes. Second, it visualizes the relations among the informational dimensions using a hierarchical expandable tree based on user-defined queries. In a tree view, duplicate information is reduced to one node and similar information is arranged close to each other. Thus, patterns and structures of genotypic and phenotypic information can be easily perceived. In contrast, in a tabular list containing gene and phenotype relations, the relations would not be obvious if the table contains many entries and is not ordered. Our tool will order the list by gene and phenotype and construct a tree. Thus, associative relations between genes and phenotypes are clear. Other advantages include the ability of handling multiple dimensions from different databases. Our method is general and can be used for any type of multi-dimensional data, although in this paper we focused on genotype-phenotype relations. But it should be noted that our visualization method assumes data integration into one database has occurred and is not aimed at a general solution for integrating heterogeneous biology databases in the level of the data source.
We realize that there are also limitations in our method. First, the new relations found in our viewer are suggestive but not confirmative because transitions of relations may not be always correct. For example, relations between GO terms and disorders in are suggestive. Locuslink provides relations from genes to GO terms and OMIM specifies the genes associated with disorders. It is possible that only part of the GO terms defines the real molecular mechanisms for breast caner and others are just possible mechanisms. Second, the tree view design cannot show an overview of the whole tree in one screen due to size limitations. Visualization using graphs with small size nodes, such as in some molecular networks (Koike and Rzhetsky 2000
), have been shown to solve this issue. Third, a tree view is not good for showing all the information related to a single object (node) as a graph can, because a node in a tree can only have one parent while a node in a graph can have many different parents as well as different types of relations other than parent-child.
Our future work will involve further refinement and development of PGviewer. The many research issues we will work on will involve 1) developing a more generalizable structure for facilitating the integration of diverse databases and dimensions, and 2) advancing graphical representation of the data so that many different kinds of graphs and views can be obtained.