|Home | About | Journals | Submit | Contact Us | Français|
Gene-to-gene coexpression analysis is a powerful approach to infer the function of uncharacterized genes. Here, we report comprehensive identification of coexpression gene modules of tomato (Solanum lycopersicum) and experimental verification of coordinated expression of module member genes. On the basis of the gene-to-gene correlation coefficient calculated from 67 microarray hybridization data points, we performed a network-based analysis. This facilitated the identification of 199 coexpression modules. A gene ontology annotation search revealed that 75 out of the 199 modules are enriched with genes associated with common functional categories. To verify the coexpression relationships between module member genes, we focused on one module enriched with genes associated with the flavonoid biosynthetic pathway. A non-enzyme, non-transcription factor gene encoding a zinc finger protein in this module was overexpressed in S. lycopersicum cultivar Micro-Tom, and expression levels of flavonoid pathway genes were investigated. Flavonoid pathway genes included in the module were up-regulated in the plant overexpressing the zinc finger gene. This result demonstrates that coexpression modules, at least the ones identified in this study, represent actual transcriptional coordination between genes, and can facilitate the inference of tomato gene function.
To elucidate functional relationships between genes, coexpression analysis has proven to be a powerful approach. From a practical point of view, coexpression analysis requires two technical bases. The first is transcriptome data. Several model organisms, including Escherichia coli, yeast, and Arabidopsis, have been regarded as excellent targets for coexpression analysis, since a large amount of microarray data are publicly available. The second technical basis is the development of analytical methods. Generally, once coexpression measures between genes (e.g. correlation coefficients) have been estimated, subsequent coexpression analysis steps include visualization of the coexpression relationships, identification of densely correlated gene groups, and interpretation of biological relevance.1 Among several possible visualization methods, network representation provides an efficient way to depict complex relationships between many genes. To identify densely correlated gene groups, several algorithms have been developed based on connectivity of the network.2–4 For biologically relevant interpretation, data from other types of ‘omics’ analyses (e.g. metabolomics and interactomics) often help greatly. With these analytical methods, coexpression analysis at last allows the function of an unknown gene to be inferred.
Excellent technical bases for coexpression analysis have been established for the model plant Arabidopsis thaliana. To accumulate transcriptome data, various data repositories are now available, including NASCArrays,5 GEO,6 SMD,7 ArrayExpress,8 and AtGenExpress,9–11 which collectively provide more than 4753 microarray data points (3 September 2009). The results of coexpression analysis using this huge data set were combined with comprehensive gene annotations and data from metabolomics analysis, and then functions of several unknown genes associated with glucosinolate and flavonoid biosynthesis were elucidated.12–14 In parallel with the elucidation of function of individual genes, the genome-wide coexpression profile was also investigated and the results are available in databases such as ATTED-II,15 GENEVESTIGATOR,16 and BAR.17
Recently, several attempts have been made for large-scale gene expression analysis of tomato (Solanum lycopersicum). For example, the gene expression profile during fruit development was investigated in detail using a tomato microarray.18,19 Another example is the investigation of tissue-dependent gene expression. In fruit peel, coordinated expression of genes associated with metabolism of cuticular components, metabolism of hormones, and metabolism of cell wall components was demonstrated by hierarchical clustering analysis.20 Hierarchical clustering of gene expression patterns also demonstrated that acquisition of the fleshy fruit trait depends on tight regulation of gene expression.21 These studies suggest that large-scale coexpression analysis can shed light on the molecular mechanisms that control fruit development. However, these studies focused on few specific biological processes, and comprehensive identification of groups of highly correlated genes has not been reported.
In this study, we report on the comprehensive identification of groups of highly correlated genes, or coexpression modules, in tomato. We performed a gene-to-gene coexpression analysis using a network-based approach. We evaluated developmental changes in gene expression profiles in various tissues of tomato plants using an Affymetrix GeneChip Tomato Genome Array. Using gene expression data from 67 hybridizations, gene-to-gene Pearson's correlation coefficients (PCCs) were estimated, and then 199 coexpression modules associated with various biological processes were identified based on an analysis of network topology. Gene ontology (GO) annotation analysis revealed enrichment of genes belonging to common functional categories in 75 modules. We then experimentally verified the coordinated expression of module member genes using tomato plants overexpressing a non-enzymatic module member gene that is a strong candidate for a regulatory gene in flavonoid biosynthesis. This result demonstrates the facilitation by coexpression analysis of the identification of the function of uncharacterized tomato genes.
Miniature tomato, S. lycopersicum cultivar Micro-Tom, was grown as described previously.22 Roots were harvested 5 weeks after germination. Hypocotyl and cotyledon were harvested 3 weeks after germination. Third leaves were harvested 3 weeks after germination. All leaves of tomato plants were harvested 3 and 5 weeks after germination. Fruits were harvested at four developmental stages: mature green (MG, ~30 days after anthesis), yellow (Y, ~35 days after anthesis), orange (O, ~38–40 days after anthesis), and red (R, ~45–48 days after anthesis). S. lycopersicum cultivar Momotaro 8® (Takii & Co., Ltd, Kyoto, Japan) was grown in a greenhouse under natural photoperiod conditions from March to July 2006 in Chiba Prefecture. S. lycopersicum line 27859 was grown under field conditions from March to July 2006 in Gunma Prefecture. Monogenic mutant tomato, Anthocyanin fruit (Aft, LA1996), was provided by the C. M. Rick Tomato Genetic Resource Center (University of California, Davis, CA, USA), and was grown in a greenhouse under natural photoperiod conditions from March to July 2006 in Chiba Prefecture. Fruits of Momotaro 8®, line 27859, and Aft were harvested at MG and R stages. The peel and the flesh of fruits of Micro-Tom, Momotaro 8®, line 27 859, and Aft were separated using a razor blade. Harvested tissues were immediately frozen in liquid nitrogen and stored at −80°C.
Target for hybridization experiments was prepared using GeneChip One-Cycle Labeling and Control Reagents (Affymetrix, URL: http://www.affymetrix.com/) according to the manufacturer's instructions. GeneChip Tomato Genome Arrays (Affymetrix) were used for hybridization. Hybridization, washing, and staining were performed according to the manufacturer's instructions. Scanned GeneChip images were analysed using Microarray Suite version 5.0.1 software (Affymetrix). Normalization and analysis of microarray data were performed using GeneSpring GX 7.3 software (Agilent Technologies, URL: http://www.home.agilent.com/). The data were normalized per chip and per gene to the median value. CEL files of these experiments are available in Gene Expression Omnibus6 (GEO) DataSets (http://www.ncbi.nlm.nih.gov/gds) series record GSE19326.
Before performing coexpression analysis, the probes used for the analysis were screened as follows. First, probes for which flags were ‘A’ (absent) in all of the samples were excluded. Second, the coefficient of variance between biological replicates of a tissue was calculated, and probes were selected if they showed a coefficient of variance <1 in all of the samples. This probe screening procedure left 7644 probes for the following coexpression analysis. Normalized values of the selected probes were used to estimate the pairwise PCC. The data set for the PCC was then analysed using a network-based module-finding algorithm described previously by Ogata et al.4 This algorithm generates coexpression modules from a given ‘seed’ gene in six steps.4 In the first step, a seed gene was arbitrary chosen. In the second step, genes that directly connect to the seed gene with PCC higher than cutoff value (0.6) were selected, and referred to as a highly correlated gene group. In the third step, VB index was defined as VB(i) = e(i)/d(i), where VB(i) is a VB value of ith gene in the group, e(i) the number of edges between ith gene and other group member genes, and d(i) the number of edges between ith gene and all genes irrespective of group membership. VB value was calculated for all group member genes, and a gene that has the lowest VB value was excluded from the group. In the fourth step, from the highly correlated gene group, a subgroup that had the highest NB value4 was selected. NB value is defined as NB = Σe(i)/Σd(i), where definitions of e(i) and d(i) are the same as above. In short, NB represents a ratio of the number of edges within the subgroup to the number of all edges associated with subgroup members. The selected subgroup was referred to as ‘the best kernel gene group’. In the fifth step, VB value was calculated for all non-member genes. If a non-member gene had the ratio higher than threshold value, that gene was incorporated into the group. Finally, the best kernel group genes and genes incorporated in the fifth step were selected as members of a coexpression module. NB values of coexpression modules were calculated again, and coexpression modules with NB values >0.5 were selected. Threshold values were as follows: 0.6 for PCC, 0.333 for VB value, and 0.5 for NB value. For GO annotation of tomato genes, similarity search of the Affymetrix tomato consensus sequences that were used to design GeneChip probes (Tomato Consensus Sequences, downloaded from http://www.affymetrix.com/products_services/arrays/specific/tomato.affx#1_4) was performed against Arabidopsis genes (TAIR8_cdna_20080412, downloaded from the TAIR FTP site, http://www.arabidopsis.org/download/index.jsp) using the BLASTN algorithm. GO annotations of tomato genes were retrieved from TAIR GO Annotation Search (http://www.arabidopsis.org/tools/bulk/go/index.jsp) according to the best match to Arabidopsis genes.
A full-length cDNA clone of zinc finger protein (clone ID: LEFL2003DB10, GenBank accession number AK326277) was provided by National Bio-Resource Project Tomato25 (http://tomato.nbrp.jp/indexEn.html). Protein coding region of LEFL2003DB10 was amplified by PCR using a gene-specific primer set (5′-GGGGGGATCCATGGCAGTTGAGGCAAGACATC and 5′-GGGGGAGTCTTCAAGAAGACATGTTAACATGCAC). PCR product was cloned in between BamHI and SacI sites of pBE2113-GUS.26 Transformation of S. lycopersicum cv. Micro-Tom was performed essentially as described by Sun et al.27 with slight modification. Cotyledon and hypocotyl segments from 7-day-old seedling were used as explants. Explants were dipped in Agrobacterium tumefaciens (strain EHA105) suspension for 10 min and blotted dry on a sterilized paper towel. The explants were then placed on co-cultivation medium [MS salts, 3% (w/v) sucrose, 0.8% (w/v) agar, 1.75 mg/l zeatin, pH 5.8], and the plate was incubated for 48 h in the dark at 25°C. The explants were then cultured and selected on a callus induction plate containing MS salts, 3% (w/v) sucrose, 0.8% (w/v) agar, 1.5 mg/l zeatin, 50 mg/l kanamycin, 125 mg/l carbenicillin, 50 mg/l Meropen (Dainippon Sumitomo Pharma, Osaka, Japan) (pH 5.8). Every 2 weeks, calli were subcultured to a fresh callus induction plate. Subculture was repeated three times and zeatin concentration in the medium was gradually decreased (1.5, 1.0, and then 0.75 mg/l). Regenerated shoots were then rooted on a rooting plate containing half-strength MS salts, 3% (w/v) sucrose, 0.8% (w/v) agar, 50 mg/l Meropen (pH 5.8). Rooted plants were transferred to rock fibre (Nittobo, Tokyo, Japan, URL: http://www.nittobo.co.jp/english/index.htm), and then to a mixture of vermiculite and Powersoil (mix ratio 1:1, Kureha Chemical Ind., Tokyo, Japan, and Kanto Hiryou Ind., Saitama, Japan).
RT–PCR experiments were performed to confirm gene expression patterns observed in microarray experiments. The total RNA samples used as templates in microarray analysis were reverse transcribed using SuperScript III First-Strand Synthesis System (Invitrogen Corp., URL: http://www.invitrogen.com/) according to the manufacturer's instructions. Following reverse transcription, PCR was carried out using rTaq DNA polymerase (Takara Bio Inc., URL; http://www.takara-bio.com/index.htm). Real-time PCR reactions to confirm gene expression were carried out using a DyNAmo™ HS SYBR® Green qPCR Kit (New England Biolabs Inc., URL: http://www.neb.com/nebecomm/default.asp) by a DNA Engine Opticon 2 system (MJ Research Inc., Waltham, MA, USA). Primers used in this study are shown in Supplementary data 1. Elongation factor 1a gene (GenBank accession number X14449) was used as a control.
Transient expression vectors of the zinc finger protein fused to GFP were produced as described in Supplementary data 2. CaMV35S-sGFP(S65T)-nos3′ vector28 was used for transient expression of free GFP. The vectors were introduced to the epidermis of onion purchased from local market. Particle bombardment was performed by using Helios Gene Gun (Bio-Rad Laboratories, URL: http://www3.bio-rad.com/) according to the manufacturer's instruction. Expression of GFP-fusion proteins was monitored by using a confocal laser scanning microscope LSM700 (Carl Zeiss, URL: http://www.zeiss.com/). Image processing was performed using the ZEN 2008 software (Carl Zeiss).
We obtained gene expression data for tomato from 67 hybridizations using RNA derived from roots, hypocotyls, cotyledons, leaves, and fruits (Table 1, Supplementary data 3). To estimate coexpression profiles, we first calculated PCC values for all pair-wise combinations of the 7644 quality-checked probes (see Materials and methods). To find coexpression modules, we first generated network graphs using different PCC cutoff values. Network density became minimal at a cutoff value of 0.91, suggesting that, at a PCC cutoff value >0.91, a decreasing number of nodes are more tightly connected (Fig. 1A). Indeed, the numbers of nodes and edges decreased at PCC cutoffs >0.91 (Fig. 1B and C). However, even at a PCC cutoff of 0.95, the network was still complex, containing ~1000 nodes and 5000 edges (Fig. 1B and C). Thus, we concluded that the use of a PCC cutoff value alone will not efficiently find coexpression modules to an extent allowing the inference of module functions.
Global topology of the tomato coexpression network. (A) Network density, (B) number of nodes, and (C) number of edges, at varied PCC cutoff values. Inserts are magnified curves within a cutoff range from 0.85 to 1.0. Network density showed the minimal ...
Microarray data used for coexpression analysis
We attempted to identify coexpression modules using an alternative module-finding algorithm developed by Ogata et al.4 This algorithm detects coexpression modules not only by using PCC cutoff, but also by evaluating density and connectivity of networks. Each coexpression module was reconstituted from a given seed gene. Genes directly connected to the seed gene were first selected using PCC cutoff value. From this set of correlated genes, a subgroup that had the highest NB value4 (see Materials and methods) was selected, and referred to as the kernel group. Next, VB value (for definition, see Materials and methods) was calculated for all genes not belonging to the kernel group. If the gene had VB value higher than the threshold, that gene was incorporated into the kernel group. Resulting set of genes was defined as coexpression module. Modules with NB value above threshold were selected for further analysis. As a result, generated modules have dense connections within the module and sparse connections to other modules. When member genes overlapped between multiple modules, non-redundant member genes were bundled into a larger module. It has been reported that this approach can detect coexpression modules with better assignment to biological processes (e.g. metabolic pathway) than other algorithms.4 On the basis of this approach, 199 coexpression modules were identified (Supplementary data 4) using following threshold values: PCC cutoff, 0.6; VB value, 0.333; and NB value, 0.5. The number of member probes per module ranged from 3 to 103, with a median value of 7 member probes per module (Fig. 2A). The distribution of the NB value4 showed that more than 40% of the modules have an NB value >0.8, indicating that the modules have high intra-modular connectivity (Fig. 2B).
Distribution of characteristic parameters of the identified coexpression modules. Distribution of (A) number of member probes per module and (B) NB value. NB value is defined as a ratio of a number of edges within the module and a total number of edges ...
Functions of the modules were inferred using GO annotations. GO annotations to tomato probes were provided according to their similarity to Arabidopsis genes using a TAIR GO annotation search. First, we investigated whether or not specific GO terms were enriched in a given module compared with the GO term distribution in all Affymetrix tomato microarray probes. Enrichment of GO categories with significance at the 1% level was observed in 75 modules (Table 2). Enriched GO categories included chloroplast, plastid, cytosol, ribosome, other enzymatic activity, transferase activity, hydrolase activity, kinase activity, structural molecule activity, protein metabolism, and response to stress (Fig. 3). Ribosome-related genes were expected to be coexpressed, since ribosome is a protein complex. Coexpression modules enriched with chloroplast-related genes appear to be classified into several subgroups according to the sub-plastidal localization (e.g. envelope, thylakoid, and stroma) of proteins encoded by the module member genes.
Distribution of GO category significantly enriched within a coexpression module. Note that GO categories of ‘chloroplast’ and ‘plastid’ are frequently associated with the same genes. GO categories of ‘cytosol’, ...
Coexpression modules in which transcription factor genes are present or GO annotations are enriched
Coexpression analysis can facilitate prediction of functions of regulatory proteins that do not have enzymatic, transporter, or structural molecule activities. Modules containing transcription factor genes are of particular interest, since these transcription factors may have a role in controlling the expression of other module member genes. We identified 37 modules containing transcription factors in the 199 modules. In 16 modules containing transcription factors, significant enrichment of certain GO categories was observed (Table 2). For example, two transcription factor genes are found in module 52 (Table 2, Supplementary data 4). Genes corresponding to Les.3716.1.S1_at (GenBank accession number AJ277944) and Les.3517.2.S1_a_at (GenBank accession number BT012879), respectively, encoding Myb-family and TCP-family transcription factors, are tightly correlated with seven protease inhibitor unigenes (SGN tomato unigenes: SGN-U313509, SGN-U312622, SGN-U312829, SGN-U312623, SGN-U312822, SGN-U313508, and SGN-U312824) (Supplementary data 4). This implies that these transcription factors regulate expression of protease inhibitor genes. Another example is module 3 (Table 2, Supplementary data 4). Gene corresponding to LesAffx.69411.1.S1_at (GenBank accession number AW651000) encoding bHLH-family transcription factor is correlated exclusively with plastid-associated genes, implying that this bHLH-family protein is associated with the regulation of plastid function.
Elucidation of the role of transcription factors in regulating expression of coexpressed genes has been well documented.13 However, the role of regulatory protein genes that are not classified as transcription factors in the regulation of coexpressed genes remains unclear. We tested whether a non-transcription factor-type regulatory gene can control the coordinated expression of genes in a given module. To exemplify this, we performed an experimental analysis of module 64 (NB value 0.850, average PCC value 0.822), in which flavonoid biosynthesis genes are enriched (Table 3). Expression profiles of member genes of this module show that they are highly expressed in fruit peel tissues, and are expressed at a lower level in leaf and fruit flesh tissues, an expression pattern that correlates with the localization of tomato flavonoid compounds22 (Fig. 4A). We found that one of the non-enzymatic genes, corresponding to Les.2294.2.A1_at (GenBank accession number AK326277), encodes RING-finger type zinc finger protein by protein domain search using InterProScan29 (http://www.ebi.ac.uk/Tools/InterProScan/), although description of the best match Arabidopsis gene (At1g79110) and SGN unigene (SGN-U323178) indicates ‘expressed protein’ (Fig. 4B, Supplementary data 4). In a network graph of module 64, enzymatic genes of flavonoid biosynthesis are tightly interconnected. The zinc finger protein gene, hereafter referred to as ZnF, has direct links to genes of 4-coumarate-CoA ligase, cinnamoyl-CoA reductase, chalcone synthase 1, flavanone 3-hydroxylase, flavonol synthase, glycosyltransferases, and malonyl-CoA synthetase (Fig. 4C). To test whether this ZnF gene controls expression of flavonoid biosynthetic genes, we overexpressed a full-length cDNA of this gene (clone ID: LEFL2003DB10) in Micro-Tom. Gene expression analysis was performed using leaf tissues, in which expression of flavonoid biosynthesis genes is low in wild-type plants. Expression of genes of 4-coumarate-CoA ligase, cinnamate 4-hydroxylase, cinnamoyl-CoA reductase, chalcone synthase 1, chalcone synthase 2, chalcone isomerase, flavanone 3-hydroxylase 1, and flavonol synthase was higher in ZnF-overexpressing leaves than in control leaves, although PCC values between ZnF and these up-regulated genes were not very high, mainly because of high expression levels in one of the transformant lines (Fig. 4D). On the other hand, the expression of phenylalanine ammonia-lyase, which is not a member of the module, did not change significantly. The expression of flavonoid 3′-hydroxylase genes correlated negatively with overexpression of the ZnF gene. These results demonstrated that the ZnF gene positively regulates the expression of enzymatic genes in the early part of the flavonoid biosynthetic pathway, which is consistent with the coexpression relationship seen in module 64.
Coexpression modules 64
Analysis of intracellular localization demonstrated that the localization of GFP-ZnF fusion protein was the same as that of free GFP protein (Fig. 5). We obtained the same result using ZnF–GFP fusion protein (data not shown). This result suggests that ZnF protein is localized to cytosol, and that ZnF protein is not a canonical transcription factor protein. The RING-finger type zinc finger domain is reportedly involved in protein–protein interaction.30 Thus, it can be hypothesized that the ZnF gene positively regulates the expression of flavonoid biosynthetic genes through interaction with other transcriptional regulator proteins. This example demonstrates the potential of coexpression analysis in inferring functions of unknown regulatory genes that do not belong to transcription factor families.
Intracellular localization of (A) GFP protein and (B) GFP-ZnF fusion protein. N, nucleus. Scale bar, 50 µm. Localization pattern of GFP-ZnF is the same as free GFP, suggesting that ZnF protein is localized to cytosol.
Recently, coexpression analysis was used to predict the function of a transporter gene involved in Arabidopsis glucosinolate biosynthesis.31 The function of this transporter gene, BASS5, was experimentally demonstrated using BASS5 knockout Arabidopsis plants, in which the accumulation of methionine-derived glucosinolates decreased. The results shown in the present study, together with this previous transporter study, suggest that the validity of gene-to-gene coexpression analysis is not limited to genes involved in protein complex formation or transcriptional regulation, but is also applicable to inferring the function of various types of uncharacterized genes. Experimental verification of the functions of several other candidate genes for regulatory protein is in progress.
This work was partly supported by the Research and Development Program for New Bio-Industry Initiatives, and by a grant from Kazusa DNA Research Institute.
CaMV35S-sGFP(S65T)-nos3′ plasmid was kindly provided by Dr Yasuo Niwa (University of Shizuoka). We thank Dr Hiroshi Masumoto (Kazusa DNA Research Institute) for the technical advice for confocal laser scanning microscopy. We also thank Tomomi Ishizaki, Tsugumi Isozaki, Miyuki Inde, and Tsurue Aoyama (Kazusa DNA Research Institute) for plant care and lab assistance.
Edited by Satoshi Tabata