Plant cell walls (PCWs) are mainly composed of polysaccharides and lignins, forming the major component of plant biomass. Knowing which genes are involved in the formation and remodeling of PCWs is of great importance as they play many critical roles during plant growth, including regulation of cell differentiation, intercellular adhesion and communication, control of water movement, and defense against invasions by pests and pathogens [1
], not to mention that it is the focal point of cellulosic biofuel studies. It is estimated that genes involved in the PCW synthesis, remodeling and turnover may account for about 15% of all ~26,500 protein-encoding genes in Arabidopsis
], i.e., ~4,000 genes. As of today only ~1,000 Arabidopsis
genes have been characterized or predicted to be PCW related according to the Purdue Cell Wall Gene Families database (the Purdue database hereafter) [6
]. Hence, the vast majority of the PCW related genes in Arabidopsis
genes are yet to be identified.
Experimental elucidation of PCW related genes have been mainly done through forward genetic screening [7
], which is time consuming and expensive. The rapid accumulation of genome-scale gene-expression data allows computational prediction of PCW related genes through co-expression analyses. The basic idea is that genes deemed to be co-expressed under multiple conditions tend to be functionally related [9
]; hence genes that are co-expressed with known PCW genes may also be PCW related. A number of studies have been carried out for inference of PCW related genes using this or similar ideas. For example, Brown et al.
and Persson et al.
published the first two studies on prediction of new PCW related genes through microarray data analyses [12
], in which cellulose synthesis (CESA) genes, CESA4, CESA7, and CESA8 were used as the ‘seeds’ to identify additional genes with the similar expression patterns.
A high percentage of the genes predicted to be PCW related in the two studies were later experimentally verified to be indeed involved in PCW biosynthesis [14
], which demonstrated the power of co-expression analyses in identifying potential PCW genes, providing good candidates for further experimental validation.
We present here a study on prediction of novel PCW related genes in Arabidopsis
at a genome scale based on the published gene-expression data collected under 351 conditions [17
]. An unique feature of our study, compared to the previous similar studies, is that we aim to find genes co-expressed with the known PCW related genes under multiple but not necessarily all conditions. This makes our strategy substantially more sensitive and specific in detection of the PCW related genes compared to the published studies [12
]. But this also raised a very challenging technical problem: how to determine which subsets of the 351 conditions should be considered? Clearly it is unrealistic to exhaustively go through all 2351
subsets with at least certain size to search for such co-expressed genes.
To overcome this issue, we have applied a new and generalized clustering technique, called bi-clustering
], to search for gene groups co-expressed under some (to-be-identified) of the 351 conditions. We specifically employed QUBIC, a bi-clustering algorithm that we recently developed for solving this type of generalized clustering problem [21
We have implemented a computational pipeline based on QUBIC to perform bi-clustering analyses of the 351 transcriptomic datasets using the known/annotated PCW related genes (the known PCW genes hereafter) as seeds to generate co-expressed gene modules in Arabidopsis. The predicted co-expressed gene modules were then computationally validated to be transcriptionally co-regulated through identification of conserved cis regulatory motifs in the promoters of genes in the same module. Using this approach we identified 2,438 candidate genes that are co-expressed with 349 known PCW genes under some conditions with high statistical significance. Functional analyses on the candidate genes revealed more detailed functional roles of these genes in PCW synthesis and remodeling. We have carried out detailed functional analyses of the co-expression modules containing the genes related to four major PCW synthesis components, which are likely to encode biological pathways with similar functions but are expressed under distinct conditions. We believe that our overall analysis procedure will be useful for gene expression data analysis in elucidation of other biological pathways in plants in general.