|Home | About | Journals | Submit | Contact Us | Français|
Coexpressed gene databases are valuable resources for identifying new gene functions or functional modules in metabolic pathways and signaling pathways. Although coexpressed gene databases are a fundamental platform in the field of plant biology, their use in animal studies is relatively limited. The COXPRESdb (http://coxpresdb.jp) provides coexpression relationships for multiple animal species, as comparisons of coexpressed gene lists can enhance the reliability of gene coexpression determinations. Here, we report the updates of the database, mainly focusing on the following two points. First, we updated our coexpression data by including recent microarray data for the previous seven species (human, mouse, rat, chicken, fly, zebrafish and nematode) and adding four new species (monkey, dog, budding yeast and fission yeast), along with a new human microarray platform. A reliability scoring function was also implemented, based on coexpression conservation to filter out coexpression with low reliability. Second, the network drawing function was updated, to implement automatic cluster analyses with enrichment analyses in Gene Ontology and in cis elements, along with interactive network analyses with Cytoscape Web. With these updates, COXPRESdb will become a more powerful tool for analyses of functional and regulatory networks of genes in a variety of animal species.
The construction of a gene network is a fundamental step toward understanding global cellular processes. In addition, recent genome-wide association studies, using high-throughput sequencing technology, have revealed many uncharacterized genotypes associated with a particular phenotype (1,2). To investigate the molecular mechanisms underlying the connections between genotype and phenotype, networks of mRNAs or proteins are useful. Several databases, such as IntAct (3) and STRING (4), have focused on protein-protein interaction network construction. For mRNA network analysis, similarities of gene expression profiles (gene coexpression) of a vast amount of microarray data are constructed. Databases for gene coexpression have achieved great success in the field of plant biology (5–8). On the other hand, however, their use in mammalian fields is still limited, with some exceptional reports (9,10), although several coexpression databases, such as Genevestigator (11), STARNET2 (12), SNPxGE2 (2) and ours, COXPRESdb, have been developed.
To promote the use of coexpression analyses in animals, we have been developing a gene coexpression database named COXPRESdb (coexpression database). We have especially focused on the reliability of coexpression data, by providing comparisons of coexpression among the different species, along with a network view of the relationships between coexpressed genes (13,14). Although the gene network view can provide an overview for the system of interest, the construction of a large-scale gene network is not easy because such a network tends to be too complicated to fully comprehend. Several approaches have been developed to visualize and help the understanding of large-scale gene networks, by controlling the cluster size (15) or combining biological-property–based clustering (16). Another weak point in coexpressed gene network analysis is based on the quality of the coexpression data. The quality of the coexpression data for animals is generally worse than that for Arabidopsis in an assessment using Gene Ontology (GO) annotation (17), probably due to the increased complexity of animal systems (18).
To enhance the performance of gene coexpression analyses, we updated two aspects of COXPRESdb. First, we increased the number of samples for each species and the number of species from 7 to 11 along with an alternative microarray platform for human as summarized in Table 1. In addition, a reliability scoring system was implemented, based on the similarity of coexpression patterns among the species. Second, the network drawing tool was improved. The new tool automatically divides the large complex network into smaller compact clusters. Each compact cluster is then characterized by GO and cis element enrichment analyses. In addition, users can select the Cytoscape web system (19) to interactively modify the network alignment and to work as a bridge to stand-alone Cytoscape (20) for more complex analyses. Furthermore, all of the coexpression data are now available in SPARQL for the semantic web communities, using the Virtuoso Universal Server at [http://coxpresdb.jp/sparql], which will promote building mashup application with various omics data sets.
The calculation procedure for the coexpression data is the same as in our previous report (18). Briefly, GeneChip raw data were obtained from ArrayExpress (21) and normalized by the RMA method (22) for each compressed file, by assuming that each compressed file corresponds to each experimental set. Then the weighted Pearson's correlation coefficient of expression profiles was calculated for every pair of genes in each species. Finally, the correlation coefficient was transferred to mutual rank (MR) (18). A network node corresponds to a gene, and edges are drawn for each gene to the other genes with three most strongly coexpressed genes. The evolutional relationships were determined by using HomoloGene (23) and the edges in the homologous gene pairs, if any, were considered as common edges among the species.
To assess the difference between the previous and new versions, we counted the numbers of common edges (Nc) for all pairs of seven species for each version. These numbers provide a quick measure to evaluate the quality of the coexpression data because similar coexpression from independent microarray platforms may eliminate experimental artifact of gene coexpression. As a result, all pairs of species, except for the human–nematode pair, showed an increase in Nc (Figure 1). The average increase rate of Nc was 1.5, and large increases of Nc were observed for the human–mouse, mouse–rat and mouse–chicken pairs, which may correspond to the large increase in the number of mouse samples. In addition to the data renewal of the previous seven species, we added four new species, monkey, dog and two yeast species, as well as human coexpression from another microarray platform. The numbers of Nc against the human data are summarized in Table 2.
The coexpressed gene list in COXPRESdb provides a comparable view among orthologous genes in other species (14). This comparative view shows the evolutional conservation of the coexpression pattern of the guide gene, which can be a measure of the reliability of the coexpression data (24,25). Figure 2 shows the coexpressed gene list for the human CHEK1 gene. The alternative human platform (Hsa2) and mouse (Mmu) show similar coexpression degrees with the human (Hsa) coexpression, reflecting the high quality of the coexpression data for these species, based on the large amount of microarray data. The conservation degrees with monkey (Mcc), rat (Rno), dog (Cfa) and zebrafish (Dre) are also good. The low coexpression conservation with fly (Dme), nematode (Cel) and the two yeast species (Sce, Spo) seems to be derived from the greater species distance to human and/or the relatively poor coexpression data based on the small amount of microarray data (Table 1). In particular, the chicken (Gga) coexpression data are different from the human data. This may be due to a defective probe for this gene because when we checked the coexpressed gene list for this gene in chicken, almost no orthologous genes showed coexpression conservation.
As seen in this example, the conservation of coexpression can ensure the quality of the guide gene (14), but users should check all of the coexpressed genes in each species to determine the reliability of each orthologous gene. To solve this problem, we introduced a similarity measure COXSIM, which is the weighted concordance rate of the coexpressed gene lists.
where n(i, g, sp1, sp2) is the number of common genes (orthologous genes in the case of different species comparison) found in the top i coexpressed gene lists from a guide gene g in species sp1 and that in species sp2. We set 100 for k, meaning that we check the gene correspondence of the top 100 coexpressed genes, which is a reasonable limitation to design biological experiment (7).
Here, defective probes will show noisy expression patterns, which cause unreliable coexpression that does not show any correspondence with other coexpression data. In other words, the maximal value of COXSIM (coexpression similarity) between the coexpressed gene list from an unreliable gene and that from its orthologous genes should be low. Based on this idea, maxCOXSIM is introduced as the reliability score of a guide gene.
The significance of the maxCOXSIM value is assessed from the null distribution for 10 species comparisons, each containing 10 000 genes. Note that this assumption is a rather severe evaluation and thus this P-value is underestimated for most guide genes because both the larger number of species in the comparison and the smaller number of genes in a genome will cause higher maxCOXSIM values by chance. We show this significance degree by stars on the gene list in COXPRESdb, where single, double and triple stars correspond P-values <1E-4, 1E-12 and 1E-20, respectively. Genes with lower reliability can be filtered out by the Row and Column filters (Figure 2). The numbers and ratios of genes at each significance level are shown in Figure 3.
The coexpressed gene network is especially useful to analyze the large number of genes generated by transcriptome or proteome analyses because the network representation can draw all of the pair-wise gene relationships for the query genes at one time. NetworkDrawer in COXPRESdb is the tool to draw the gene network for the query genes specified by users, by searching for coexpression along with protein–protein interactions among the genes or gene products (Figure 4). In this example, three groups of genes can be identified by visual inspection. To characterize these groups, two new network analysis flows are provided in the new NetworkDrawer, in addition to the marks for KEGG annotation (27) in the previous version of COXPRESdb.
The first analysis flow is composed of automatic cluster detection and characterization (Figure 4A–C). The cluster detection step has two parameters, a clique detection parameter and a clique merge parameter, which are both set to 0.5 as the default values, but can be changed through the text box on the web page, where smaller clique parameter and larger merge parameter produce larger sub-graph. The clustering algorithm has been newly developed for both a rapid response and the detection of a clique-like sub-graph, by merging the node with a higher PageRank value iteratively (28). The details of the clustering algorithm will be described elsewhere. After the clustering, users can easily select a cluster by using the radio button in the cluster summary table, to mark the nodes in the selected cluster by balloon icons (the orange balloons in Figure 4A). The results of the enrichment analyses for each cluster are available from the links in the table (Figure 4B). In addition to the GO enrichment analysis, we have also provided the cis element motif enrichment analysis. Gene coexpression is mainly driven by cis elements in the promoter regions, especially the proximal promoter region (29). In Arabidopsis, large-scale cis element discovery was performed, based on gene coexpression (30). Therefore, we performed enrichment analyses by a hypergeometric test for heptamer motifs on the proximal promoter regions (−200 to +100) around transcription start sites downloaded from DBTSS (31). The enriched heptamers are referred to the reported cis elements in JASPAR (32) (Figure 4C). To further characterize the heptamers, the enriched GO annotations of the genes having the heptamer motif are calculated (Figure 4C).
The second flow of the gene network analysis is the use of the Cytoscape Web system (19) (Figure 4D). This system enables users to interactively modify the network alignment, export the network as an image (SVG, PNG, PDF formats) and save it in the XGMML format. The XGMML file can be uploaded on the same Cytoscape Web system and also used in stand-alone Cytoscape (20) for advanced analyses. This system is also available for gene networks in the locus page and the GO network page in COXPRESdb.
CREST research project of the Japan Science and Technology Corporation [11102558 to T.O.]; Grants-in-Aid for Innovative Area ‘HD Physiology’ , for Scientific Research  and for Publication of Scientific Research Results [228063 to K.K.]. Funding for open access charge: Grants-in-Aid for Innovative Area ‘HD Physiology’ .
Conflict of interest statement. None declared.
The super-computing resource was provided by the Human Genome Center, Institute of Medical Science, The University of Tokyo.