To maintain biological activity, an appropriate gene set essential for cells, tissues, organs and the individual level is selected from the genome and enhanced/suppressed through proper regulatory mechanisms of gene expression. An essential gene set controlled by a biological or physiological process frequently shows similar temporal and/or spatial expression profiles (Gibson et al.
2004, Al-Ghazi et al.
2009, Matsumoto et al.
2009, Swanson-Wagner et al.
2009). Therefore, similarity of gene expression profiles provides important clues for understanding the biological functions of genes, biological processes and metabolic pathways related to genes (Chen et al.
2008, Endo et al.
2009, Yamagishi et al.
2009, Hsu et al.
2010). Microarray technology is a powerful and effective tool for genome-wide gene expression analysis within species (Matsuura et al.
2010, Sakuraba et al.
2010) and between species (Tsaparas
2006, Miller et al.
2010), and is used to identify gene sets showing similar expression profiles among various biological conditions (samples). Recently, large-scale microarray data have been accumulated in publicly available databases such as NCBI GEO (Barrett et al.
2009), EBI ArrayExpress (Parkinson et al.
2009) and DDBJ CIBEX (Ikeo et al.
2003). In addition, combined methods with laser microdissection and microarray have also been applied to isolating specific cells from complicated plant tissues and separate transcriptomes (Hobo et al.
2008, Suwabe et al.
2008, Watanabe
2008). With a wealth of microarray data and a high resolution method for the transcriptome, a comprehensive classification of gene expression profiles is possible, allowing us to elucidate gene functions and families (homologs).
A gene expression network (GEN) is an ideal technique for grasping similarities of expression profiles among genes simultaneously. For GEN construction, the Pearson correlation coefficient (PCC) has been widely used as an index to evaluate similarities of expression profiles for gene pairs (Aoki et al.
2007, Fujii et al.
2010, Matsuura et al.
2010, Soeno et al.
2010). Obayashi and Kinoshita (
2009) suggested that PCCs are significantly overestimated when many replicates of a sample (sample redundancies) are contained in a microarray data set. Thus, instead of the PCC, mutual rank (MR) based on rank transformations of the weighted PCC (wPCC) has been used as a more sensitive index for sample redundancies. However, this algorithm needs a cut-off threshold value in estimating sample redundancy. The cut-off threshold value should be statistically tested to evaluate similarities of gene expression profiles appropriately. On the other hand, correspondence analysis (CA) (Greenacre
2007), which is a multivariate analysis method for profile data, permits concise interpretation of the correspondence between genes and samples in microarray analysis (Yano et al.
2006). CA for microarray data summarizes an originally high dimensional data matrix [rows (genes) and columns (samples)] into a low dimensional projection (space). Scores (coordinates) in the low dimensional space are given to each gene and sample. With the coordinates, genes and samples can be plotted into a two- or three-dimensional subspace. The distance between plots (genes) in a low dimensional space, which is calculated from all or statistically significant dimensions, depends on the degree of similarity of gene expression profiles: a short distance means similar gene expression profiles and a long distance means different expression profiles. Thus, distances can be used as an index for similarity of gene expression profiles. In addition, CA does not require any prior parameters to evaluate similarity, because it only calculates distances between plots. Moreover, the effect of sample redundancies in a data set can be mathematically eliminated by reducing dimensions with the CA algorithm. Moreover, CA takes minimal time even for a large-scale microarray data set (within 30 min for approximately 50,000 probes × 600 samples with a personal computer, such as a MacBook3,1 OS X 10.6.5 with Intel Core 2 Duo 2.2 GHz and 4 Gb memory). The new index presented here, distances obtained from CA (DCA), is suitable for appropriate and quick evaluation of the similarities of gene expression profiles and for construction of GENs.
Rice omics data including genome annotations are available from major public databases. Current genome annotations have been provided by the Michigan State University (MSU) Rice Genome Annotation Project (Ouyang et al.
2007) and the Rice Annotation Project (RAP-DB) (Rice Annotation Project
2008). Since their bioinformatics approaches to genome assembly and annotation are different from each other, the predicted sequences and identifiers (IDs) of loci and mRNA are consequently different between MSU and RAP-DB. For example, a locus Os06g0103700 (Os06t0103700-01) in RAP-DB is named LOC_Os06g01410 in MSU. Other omics databases in rice also employ IDs of either MSU or RAP-DB. Information on metabolic pathways in RiceCyc (Liang et al.
2008) are described by locus IDs of MSU, while the KEGG PATHWAY database (Okuda et al.
2008) provides information on metabolic pathways with locus IDs of RAP-DB. For annotations of microarray probes, IDs of MSU and RAP-DB are used for Affymetrix and Agilent platforms, respectively. Due to the inconsistent IDs among databases, users cannot directly compare omics information from distinct databases, such as RiceCyc and KEGG PATHWAY, as is the case for rice GENs. Information on rice GENs is available from public databases such as ATTED-II (Obayashi et al.
2007), RiceArrayNet (Lee et al.
2009), GeneCAT (Mutwil et al.
2008) and Rice Array Database (Jung et al.
2008). ATTED-II and GeneCAT provide information on GENs constructed from expression data of Affymetrix GeneChip Rice Genome Array. GENs in RiceArrayNet are constructed from the Rice60k Microarray. Rice Array Database provides only a list of gene pairs with PCCs, and does not provide a GEN viewer to grasp similarities among multiple genes simultaneously. Differences in microarray platforms and GEN information formats prevent analyses and comparisons of information on GENs in distinct databases. Moreover, speculation regarding biological features hidden in GENs requires various annotations from omics databases. To overcome such issues, we constructed rice GENs by CA and integrated principal omics information. The information is available from our database OryzaExpress (
http://riceball.lab.nig.ac.jp/oryzaexpress/). OryzaExpress enables us to trace gene IDs from different databases/projects, browse GENs and refer to principal omics data stored in public databases. GENs and annotation data integrated in OryzaExpress thus provide more detailed and comprehensive information.