|Home | About | Journals | Submit | Contact Us | Français|
RNA-seq and microarray are the two popular methods employed for genome-wide transcriptome profiling. Current comparison studies have shown that transcriptome quantified by these two methods correlated well. However, none of them have addressed if they complement each other, considering the strengths and the limitations inherent with them. The pivotal requirement to address this question is the knowledge of a well known data set. In this regard, HrpX regulome from pathogenic bacteria serves as an ideal choice as the target genes of HrpX transcription factor are well studied due to their central role in pathogenicity.
We compared the performance of RNA-seq and microarray in their ability to detect known HrpX target genes by profiling the transcriptome from the wild-type and the hrpX mutant strains of γ-Proteobacterium Xanthomonas citri subsp. citri. Our comparative analysis indicated that gene expression levels quantified by RNA-seq and microarray well-correlated both at absolute as well as relative levels (Spearman correlation-coefficient, rs > 0.76). Further, the expression levels quantified by RNA-seq and microarray for the significantly differentially expressed genes (DEGs) also well-correlated with qRT-PCR based quantification (rs=0.58 to 0.94). Finally, in addition to the 55 newly identified DEGs, 72% of the already known HrpX target genes were detected by both RNA-seq and microarray, while, the remaining 28% could only be detected by either one of the methods.
This study has significantly advanced our understanding of the regulome of the critical transcriptional factor HrpX. RNA-seq and microarray together provide a more comprehensive picture of HrpX regulome by uniquely identifying new DEGs. Our study demonstrated that RNA-seq and microarray complement each other in transcriptome profiling.
Transcriptome of an organism represents the entire repertoire of transcripts encoded by the genes as a phenotypic response to the condition in which they exist. The sheer ability to simultaneously quantify the expression levels for a vast number of genes has revolutionized the biomedical research, facilitating the analysis of global gene expression patterns at the genome-wide scale . In the past decade, there has been a tremendous progress in the development of methods to deduce and quantify the gene expression levels at the whole transcriptome level . Among the several transcriptome profiling methods, RNA-seq and DNA microarray stand out as the two widely used genome-wide gene expression quantification methods [1-17].
RNA-seq method involves the conversion of isolated transcripts into the complementary DNA (cDNA), which is then directly sequenced in a massively parallel deep-sequencing-based approach . By mapping the resulting short sequencing reads onto the reference genome, the expression levels of genes relative to the condition of interest or absolute levels can be quantified [9,11]. This method has been implemented in different platforms like Illumina’s Genome Analyzer, Roche 454 Genome Sequence, and Applied Biosystems’ SOLiD . On the other hand, microarray is based on the hybridization of specimen target strands onto the immobilized complementary probe strands. For example, in a two-color microarray, transcripts extracted from different conditions are labeled with distinct fluorescent dyes while being converted to cDNA. These labeled samples are then hybridized to the immobilized complementary probe strands in an array representing the genes. By measuring the light intensity of the distinct fluorescent dyes, the relative abundance of each transcript in the two different conditions can be measured [8,12,13,17,19,20]. Affymetrix and Agilent are the two prevalent platforms in microarray technology [2,14].
Even though, initially microarray has been instrumental in whole transcriptome analysis, currently RNA-seq is becoming a preferred method of choice, since it is considered to effectively surmount the limitations of microarray [1,21-23]. RNA-seq technology, unlike microarray, does not depend on the prerequisite knowledge of the reference transcriptome . Further, RNA-seq data contains very low background signal, a higher dynamic range of expression levels, and also relatively small amount of total RNA required for quantification, when compared to microarray [1,23]. Despite these advantages, the efficiency of RNA-seq is marred with the problem of overwhelming amount of ribosomal RNA (rRNA) in the data, short reads, less base accuracy, and variation of read density along the length of the transcript, posing a challenge for this high-throughput method [21,25,26]. However, in spite of their strengths and limitations, RNA-seq and microarray have become the default popular methods of choices for genome-wide transcriptome studies [1,2,23].
Currently several studies have been conducted to compare the performance of RNA-seq and microarray in quantifying the expression level of genes, by focusing on various aspects like reproducibility, accuracy, statistical issues, technical and biological variabilities [1,15,21,27-30]. The main conclusion from these studies has been that the expression levels quantified by these two methods correlated to a large extent, and overall favored the RNA-seq because of high reproducibility, accuracy, and dynamic range [27,29]. However, none of these comparison studies have addressed if these two methods complement each other in transcriptome profiling given the strengths and limitations associated with them. In order to address this question, we require an already well characterized dataset. The HrpX regulome from Xanthomonas citri subsp. citri (Xcc) serves as an ideal data model in this regard [31-33]. Xcc is a causal agent of citrus canker, one of the serious and destructive diseases in citrus that is resulting in significant losses to citrus industry worldwide , while HrpX is a key global transcription factor that regulates the expression of hrp (hypersensitive response and pathogenicity) cluster of genes, which are considered as the major pathogenicity factors [31,35]. HrpX contains AraC-type of DNA binding domain, which specifically recognizes the plant-inducible promoter (PIP) box (TTCGC-N15-TTCGC) and imperfect PIP box (TTCGC-N8-TTCGT) present in the cis-regulatory regions of hrp gene cluster [36-38]. Since HrpX has a key role in pathogenicity, tremendous progress has been made in cataloguing the target genes of HrpX [39-45]. We therefore assessed the performance of RNA-seq and microarray in their ability to detect known HrpX target genes. We chose Illumina and Agilent as the corresponding platforms for RNA-seq and microarray, as they are the most popular platforms for these technologies [2,4].
In order to uncover the regulome of HrpX transcription regulator by profiling the wild-type and the hrpX mutant strains transcriptome, we had designed a microarray chip covering the whole genome under Agilent platform in our previous study . Here, we conducted genome-wide transcriptome profiling of these two strains by RNA-seq and compared the results to the previously published microarray data, to assess the performance of these two methods. Further, to avoid technical variation associated with RNA isolation, we used the aliquots from the same total RNA samples used for microarray experiments also for RNA-seq.
We obtained 16,431,283, 17,289,220, 18,124,120 sequence reads for the wild-type and 15,084,955, 17,831,920, and 18,115,115 for the hrpX mutant strain with a median sequence length of 74-base pairs (bp) (Additional file 1: Table S1). Raw reads often have high sequencing errors, especially in the 3′ end where there is a high chance of sequencing errors to occur . We therefore filtered the reads for high quality ones by trimming off the base pairs with low quality score assigned to them during down-line processing of RNA-seq. More than 90% of the reads passed the quality filter, as a result, the median sequence length of quality filtered reads subsequently dropped to 68-bp (Additional file 1: Table S1). We then mapped these high quality trimmed reads on to the Xcc genome. Approximately more than 90% of the reads could be mapped on to the reference genome, indicating good sequence coverage (Additional file 1: Table S1). Overall ~97% of the annotated genes had more than one read mapped, while merely ~3% of the annotated genes had no reads mapped, indicating good sequencing depth. Further, we also observed a difference in the sequence coverage between the chromosome and the two endogenous plasmids of Xcc. Annotated coding genes from the chromosome with a size of 5.18 mega base pairs (Mb) had 98% sequence coverage, whereas, it was 78% for plasmid pXAC64 with a size of 0.06 Mb, and relatively lower with only 62% sequence coverage for plasmid pXAC33 with a size of 0.03 Mb (Additional file 1: Table S2).
RNA-seq had coverage for 4323 genes with one or more reads mapped, while by microarray 4349 genes were assigned the fluorescence intensity values after the background correction. Among these 4312 genes (~99% of the total genes) were common to both methods, while merely 37 (0.8%) and 11 genes (0.2%) were uniquely called by microarray and RNA-seq respectively (Additional file 1: Tables S2 and S3; Additional file 2: Figure FS1). We compared the absolute levels of gene expression in terms of RNA-seq counts and microarray fluorescence intensities for all the listed genes called by both the methods. These two independent measures of transcript abundance associated with each gene for all the biological replicates from the wild-type and the hrpX mutant strains were compared separately. The resulting correlation was mapped as a scatter plot, with an average number of counts from Illumina sequencing against the normalized fluorescence intensities from Agilent arrays for each gene in the wild-type (Figure (Figure1A)1A) as well as in the hrpX mutant (Figure (Figure1B).1B). Absolute levels of gene expression correlated well, when estimated in terms of Spearman’s correlation coefficient (rs) with 0.78 (p-value < 0.0001) for the wild-type and 0.80 (p-value < 0.0001) for the hrpX mutant strain. This is in agreement with the previous reports that expression levels measured by microarray and RNA-seq had correlations ranging between 0.62 and 0.8 for prokaryotic and eukaryotic datasets [18,28,29]. However, there seems to be little or no correlation for the genes with low level of expression. We further estimated the correlation for the subset of genes with fluorescence intensity values ≤100 assigned by microarray (~360 genes) with the corresponding expression levels determined by RNA-seq. This subset of genes revealed a very poor rs of 0.2 (p-value <0.0002) and 0.3 (p-value <0.0001) for the wild-type and the hrpX mutant strains respectively. Although the expression levels of these genes did not change much according to microarray, RNA-seq reported them to have different expression levels. This may be attributed to the high sensitivity of RNA-seq method.
We further estimated the correlation between all the combinations of biological replicates for the wild-type and the hrpX mutant strains independently. The resulting rs values of these comparisons are represented in the form of heat maps, for the wild-type (Figure (Figure1C)1C) and the hrpX mutant strains (Figure (Figure1D),1D), which provide a global view of these correlations. Overall, on an average the wild-type with rs = 0.76 (p-value < 0.0001) and the hrpX mutant with rs = 0.78 (p-value < 0.0001) were observed for the biological replicates from all the correlation combinations. This level of comparison strongly suggested that not only the absolute level of gene expressions determined by RNA-seq and microarray highly correlated, but were also highly reproducible, in spite of the technical as well as the biological variability associated with the quantifications.
We also compared the performance of these two methods at relative level of gene expression. For this purpose, we first computed the relative expression level of genes in terms of fold-change (FC) for the hrpX mutant in relation to the wild-type strain, along with p-values to denote the statistical significance and false discovery rate (FDR), for having a good control over the false positives rate. We compared the relative expression levels for 4312 consensus genes both qualitatively and quantitatively, after transforming the FC values to logarithm base 2 (log2) scale without any statistical cut-off thresholds (Additional file 1: Table S2). For the 2587 (~60% of the consensus) genes, the expression levels agreed qualitatively, while 1725 (~40%) genes disagreed between the two methods (Figure (Figure2A).2A). At this point, our comparison was exclusively focused on whether the gene of interest is up- or down-regulated based on the sign of the log2 transformed FC values, but not necessarily on the FC magnitude. We further illustrated the quantitative relationship of log2FC between RNA-seq and microarray in the form of a scatter plot as shown in Figure Figure2B.2B. Genes with no change in expression levels in the wild-type and the hrpX mutant strains (FC = 1) clustered around log2FC of zero (log2 of one is zero) in the scatter plot (Figure (Figure2B).2B). The rs between the log2FCs determined by RNA-seq and microarray was found to be 0.30 (p-value < 0.0001) (Figure (Figure2B).2B). This lower correlation value indicated that the magnitude of FCs between the two methods differed largely that might be due to the background noise resulting from the many imperfections, which are inherent to the high-throughput technologies [47,48].
The correlation coefficient provides an overall estimate of correlation between the expression levels determined by RNA-seq and microarray methods. However, this does not zoom into the data in a detailed manner. For instance, no information is provided about how much of FC magnitude that actually differs between the two methods for a given gene. In order to get an insight into this aspect, we computed the fraction of genes deviating in their FC magnitude values by dividing the FC magnitude value determined by RNA-seq with that of microarray (Figure (Figure2C).2C). Here, the fold difference of one represents the fraction of genes that are determined to have a FC magnitude of ± 0.5 (bin width) by both RNA-seq and microarray methods. When we plotted this frequency as a histogram for the whole 4312 consensus genes, more than 75% of genes were found to have FC magnitude values ± 0.5 by RNA-seq and microarray methods. Since it is a relative expression comparison, genes whose expression values did not change much in the wild-type and the hrpX mutant strains, tend to have FC values = 1. Subsequently, it is more sensible to consider only differentially expressed set of genes for further comparisons.
We therefore applied FDR ≤ 0.05 (5%) in conjunction with FC (absolute log2FC ≥ 0.6) to filter the whole data set. In total, 87 (2%) genes from RNA-seq and 64 (1.5%) from microarray qualified at this cut-off threshold from the 4312 consensus genes (Additional file 1: Table S4). Together, 106 genes satisfied our selection criterion from both the methods (Additional file 1: Table S4). Among them 84 (79.2%) genes were up-regulated, while 22 (20.8%) genes were found to be down-regulated. Further, 45 (~42.45%) genes were common between both the methods, whereas, 42 (39.63%) and 19 (~17.92%) genes were uniquely detected by RNA-seq and microarray respectively (Additional file 1: Table S4; Additional file 2: Figure FS2). We further compared the FC values of the 45 consensus genes both qualitatively and quantitatively. These genes qualitatively agreed 100% by having the same trend of log2 transformed FC values by both RNA-seq and microarray (Figure (Figure2D).2D). Likewise the quantitative comparison was performed by estimating the correlation between the magnitude of log2FC determined by RNA-seq and microarray for the 45 consensus genes as shown in Figure Figure2E.2E. The magnitude of FC values between the two methods were found to be well correlated (rs = 0.76, p-value < 0.0001), indicating that the same trend of variation was observed in FC values between the two methods without any dispersion. Thereby, the magnitude of FC values determined by RNA-seq and microarray agreed to a large extent for the 45 consensus genes. In order to further pinpoint the deviation in the FC magnitude quantified by the two methods, we plotted the differences in the FC values determined by RNA-seq with respect to microarray, and the percentage of genes with that difference for the 45 consensus genes (Figure (Figure2F).2F). Majority of the genes (~98%) were found to have a magnitude of FC within the range of ≤ 1.5, while for the remaining 2% of the genes, it was 4.7-times higher in RNA-seq than the microarray based quantification. Based on these comparisons, we concluded that the relative gene expression levels quantified by RNA-seq and microarray were consistent to a large extent for the statistically differentially expressed set of consensus genes.
Traditionally, quantitative Reverse Transcription PCR (qRT-PCR) is used to validate the gene expression levels quantified by high-throughput technologies like RNA-seq and microarray . Therefore, we compared the relative expression levels quantified by RNA-seq and microarray by qRT-PCR for a subset of 43 (40.6%) genes (Additional file 1: Table S5) that were randomly selected from the 106 significantly differentially expressed genes. Among them, 19 genes were found to be common between both the methods, 12 genes were unique to RNA-seq, while remaining 12 genes were found to be unique to microarray (Additional file 1: Table S4). The expression levels were found to be highly reliable for genes that are determined to be significantly differentially expressed by RNA-seq (rs = 0.94; p-value < 0.0001) as well as microarray (rs = 0.97; p-value < 0.0001). For the consensus genes, microarray had a slightly higher correlation with qRT-PCR than RNA-seq (Figures (Figures3A3A and and3B3B).
We further plotted the percentage of genes that deviated in the magnitude of FC quantified by RNA-seq and microarray with respect to qRT-PCR (Figures (Figures3C3C and and3D).3D). For most of the genes, the magnitude of FC quantified by RNA-seq and microarray were relatively higher, when compared to qRT-PCR (fold difference >1). Overall, the magnitude of FC quantified by RNA-seq was in consistence with qRT-PCR based quantification (Figure (Figure3C).3C). For microarray, the magnitude of FC was observed to be consistent with qRT-PCR for a majority of genes, however, we also noticed outlier genes with a 9-times higher FC magnitude (Figure (Figure3D3D).
For the subset of 12 genes that were found to be uniquely determined by RNA-seq, the magnitude of FC quantified by RNA-seq correlated moderately with qRT-PCR (rs = 0.58; p-value 0.05) (Figure (Figure4A;4A; Additional file 1: Table S5). The 12 genes found to be uniquely detected by microarray had a correlation of rs = 0.92 (p-value 0.002) with qRT-PCR (Figure (Figure4B;4B; Additional file 1: Table S5). These correlations are slightly lower when compared to the consensus genes (rs ≥ 0.94). This indicated that the expression levels are more reliable for the genes that are determined to be significantly differentially expressed by both RNA-seq and microarray rather than by any one method. Moreover, it also indicated that there is a lot of variation in the magnitude of FC quantified by RNA-seq and qRT-PCR. We further evaluated this variation i.e. deviation from the magnitude of FC, by plotting the frequency histogram for the 12 genes unique to RNA-seq (Figure (Figure4C)4C) and microarray (Figure (Figure4D).4D). For the genes unique to RNA-seq, we observed that none of them had the same magnitude of FC, with 50% genes having 0 to 0.5-time lower and for the remaining 50% of the genes the magnitude of FC was observed to be 2 to 3-time higher, when compared to qRT-PCR (Figure (Figure4C).4C). Because of this inconsistence in the magnitude of FC, the expression levels are moderately correlated. For the genes unique to microarray, we observed a good consistence in the magnitude of FC with qRT-PCR (Figure (Figure4D4D).
Extensive and detailed studies have been carried out since past three decades in cataloguing the target genes of HrpX in the genus Xanthomonas using various genetic and biochemical methods [32,38,39,50-55]. HrpX is known to regulate hrp gene cluster that encodes the type III secretion system (T3SS) and effectors [31,56]. T3SS are specialized macromolecular machinery that act as a nano-injector to translocate the effector proteins into the cytoplasm of host plant cells . These translocated effectors manipulate the host cellular processes by altering signal transduction, transcriptional activities like suppression of basal plant defense responses, and protein turnover in host cells for the benefit of the pathogen . The T3SS machineries are evolutionarily conserved across many Gram-negative animal- and plant-pathogenic bacteria .
Xcc is comprised of 25 hrp genes, including 19 hrp-conserved (hrc) and 6 hrp-associated (hpa) genes that encode the T3SS . These genes are clustered in a ~25 kb region spanning from 462712 to 488334 bp of the genome . We applied statistically significant differentially expressed gene list that were derived from RNA-seq and microarray methods to this cluster. We counted for the number of known hrp cluster genes, which passed the FC and FDR cut-off thresholds from RNA-seq and microarray methods (Table (Table1).1). Among the 25 hrp cluster genes, 16 (64%) were detected by both RNA-seq and microarray methods. Six genes were found to be uniquely detected by microarray, whereas, none uniquely detected by RNA-seq (Table (Table1).1). Three genes namely, hrcC, hpa2, and hpaA could not pass our statistical cut-off criteria by any of the methods, although they followed the same qualitative expression pattern. We further quantified the deviation in the magnitude of FC for the 16 known hrp genes, found in consensus between RNA-seq and microarray (Figure (Figure5A).5A). The magnitude of FC for 5% genes found to be same, while for the remaining 95% genes it was found to be between 1.2 to 1.8-time higher in RNA-seq than in microarray. Even though, microarray overall detected more genes from hrp cluster, RNA-seq reported higher magnitude of FC (Table (Table11).
Xcc also encodes 25 putative effector genes regulated by HrpX, which meditate the interaction with the host plant, hence determine the host specificity . Since XAC2785, XAC1210 and XAC1209 were considered as pseudo or inactive genes, they were excluded from our analysis. We tabulated how many of these genes were detected by RNA-seq and microarray methods with their corresponding log2FC values along with p-value and FDR from the respective methods (Table (Table2).2). In total, 10 (45.5%) genes were detected by both the methods. RNA-seq and microarray uniquely detected one and three genes respectively. The remaining 9 genes (36.4%) were neither detected by RNA-seq nor by microarray, since they could not pass both the FC and FDR cut-offs (Table (Table2).2). For the 10 consensus genes, we calculated the fold differences in the magnitude of FC quantified by RNA-seq with respect to microarray. None of the genes had the same magnitude of FC between the two methods. Microarray estimated higher magnitude of FC for ~64% genes than RNA-seq, while RNA-seq estimated 1.2 to 1.8-time higher magnitude of FC for the remaining ~36% genes (Figure (Figure5B).5B). In contrast to hrp gene cluster, where microarray qualitatively outperformed RNA-seq in its ability to detect more genes, here RNA-seq complemented quantitatively with higher confidence by reporting higher magnitude of FCs. Thereby, for the effector gene data set, RNA-seq and microarray complemented each other both qualitatively as well as quantitatively.
Overall, considering T3SS and effector genes, in total there are 47 genes, from which, 26 genes (55%) were detected by both RNA-seq and microarray (Tables (Tables11 and and2).2). RNA-seq uniquely detected 1 gene (2%), whereas, microarray detected 9 genes (19%). Remaining 11 genes (23%) were not detected by either one of the methods by failing to pass the cut-off threshold (Tables (Tables11 and and2).2). Further, considering only the genes that are detected by at least one method, 72% of the known were detected by both methods, while remaining 28% were detected by either one of the methods.
Among the 87 statistically significant differentially expressed genes from RNA-seq, 42 (39.63%) genes were found to be uniquely detected by this method (Additional file 2: Figure FS2). Of these 42 genes, 17 were found to be down-regulated, while 25 were up-regulated (Additional file 1: Table S4). Nearly 98% of these genes (41 of 42 unique) could not pass the FC cut-off threshold by microarray. The only exception is the gene fliO (XAC1945) that encodes a flagellar protein for flagellum apparatus, which passed the FC cut-off, but failed with FDR threshold. The gene XAC0755 encoding KdpF, a component of an integral membrane potassium-transporting system , is down-regulated by a factor of 3 (log2 FC of 1.6) according to RNA-seq, but, microarray could not capture this, as the probes for this gene were missing on the chip. This shows the limitation of microarray, where probes for all the genes need to be defined while designing the chip. Furthermore, four genes uniquely found by RNA-seq are involved in signal transduction and gene regulation, i.e. XAC4116 encoding a serine/threonine kinase, XAC1819 encoding a tryptophan-rich sensory protein, and two regulatory genes XAC3026, and XAC3363, whose function in citrus canker disease development remain to be explored. Furthermore, 21 genes (24%) are currently annotated as hypothetical proteins (Additional file 1: Table S6). Among them, four hypothetical proteins XAC0854, XAC4131, XAC1203, and XACb0064 were predicted to be T3SS secreted while 7 hypothetical proteins, XAC3275, XAC3680, XAC1943, XAC0527, XAC0599, XAC0239, and XAC0755 were predicted to be Type 2 Secretion System (T2SS) substrates (Additional file 1: Table S6) by Effective database . Gram-negative bacteria employ T2SS to transport proteins to the extracellular milieu, where the T2SS exo-proteins containing N-terminal signal peptides are used for inner-membrane translocation through either the Sec translocon or the Tat complex . Genes encoding proteins secreted by T3SS and T2SS have been experimentally proved to be regulated by HrpX [33,62,63].
Among the 64 statistically significant differentially expressed genes from microarray, 19 (29.7%) genes were found to be uniquely detected by this method (Additional file 2: Figure FS2). 18 were found to be down-regulated, while one gene was up-regulated (Additional file 1: Table S4). Unlike that of RNA-seq, nearly 63% genes (12 of 19 unique) could pass the FC cut-off threshold, but failed to pass the FDR threshold by RNA-seq. The remaining 37% genes (7 of 19 unique) could not pass both FC and FDR cut-off threshold. Furthermore, six genes were found to be hypothetical. Among them XAC2876, XAC1241, and XAC2370 were predicted as T2SS substrates. XAC1241 predicted as a T2SS substrate, shared 73% identity with a putative secreted protein from X. campestris pv. vesicatoria strain 85–10. Another T2SS candidate XAC2370 shared 95% identity with a secreted protein from X. fuscans subsp. aurantifolii str. ICPB 10535. XAC1124 shared 100% identity with MEKHLA domain protein from X. axonopodis pv. punicae str. LMG 859 . This domain is found in bacteria associated with plants. It further shares similarity with the PAS domain and might be involved in light, oxygen, and redox potential sensation .
For comparison based on the biological function for the differentially expressed genes from RNA-seq and microarray, we utilized the ClueGO to integrate the Gene Ontology (GO)  terms and KEGG  pathway terms and create a functionally organised GO/KEGG network. Functional annotation with biological processes category resulted in 13 (14.94%) genes found from cluster for RNA-seq, while for microarray it was 12 (19.35%).
The ClueGO overview pie chart highlighted that significant proportion of the genes differentially regulated are involved in “protein secretion by the T3SS” by both RNA-seq and microarray (Additional file 3: Figure FS3A & D). Additionally, RNA-seq also identified genes involved in “secretion activity by cell” as well as “single organism catabolic process” (Additional file 3: Figure FS3A). On the other hand, microarray highlighted the genes involved in “protein transmembrane transport”, “polycyclic aromatic hydrocarbon degradation” and “establishment of localization in cell” (Additional file 3: Figure FS3D). Majority of the genes are involved in “bacterial secretion system”, as shown by both RNA-seq and microarray. Also the differentially expressed genes are found to be significantly involved in the “transport of monovalent inorganic cation” (Additional file 3: Figure FS3B) and “protein transport” (Additional file 3: FS3E). Genes have also been found uniquely by microarray as significantly involved in “polycyclic aromatic hydrocarbon degradation” (Additional file 3: Figure FS3E). Genes from RNA-seq have been found to be involved in “riboflavin metabolism” as well as “single organism catabolic process” (Additional file 3: Figure FS3B). Further, visualization of the functionally grouped annotation network for the differentially regulated genes derived from RNA-seq (Additional file 3: Figure FS3C) and microarray (Additional file 3: Figure FS3F) methods highlighted the relationships between the terms. RNA-seq highlighted “protein secretion by the T3SS” along with the “small molecule catabolic process”, while microarray reflected “polycyclic aromatic hydrocarbon degradation” and “establishment of localization in cell”, as the most significant terms of the group. This analysis also showed that RNA-seq and microarray together provide more comprehensive functional information than the individual methods.
HrpX is known to regulate the target gene expression by specifically binding to PIP box motif present in the cis-regulatory regions. PIP box consists of direct repeats of “TTCGC” with a spacer of 8 to 26-bps in between the repeats, even though ideally 8-bps and 15-bps are considered as the canonical PIP box . We exploited this feature and looked for PIP boxes in the promoter regions of the 106 significantly differentially expressed genes (Additional file 4). All the 106 differentially expressed genes could be assigned to 90 transcriptional units based on MetaCyc database  (Additional file 1: Table S8). However for simplicity, genes under the control of the same cis-regulatory regions were counted separately. Among the consensus 45 genes, 36 (80%) were shown to have canonical PIP boxes (Figure (Figure6A,6A, Additional file 1: Table S7). Of the 42 genes that are uniquely determined by RNA-seq, 13 (31%) genes were confirmed to have PIP boxes; whereas, among the 19 genes that are uniquely determined by microarray 11 (57.8%) genes were confirmed to have PIP boxes (Figure (Figure6A,6A, Additional file 1: Table S7). In this study, we identified newly PIP box motif in 7 (19.4%) genes among consensus, 13 (100%) genes unique to RNA-seq and 1 (9%) gene unique to microarray (Figure (Figure6B).6B). Overall, 60 of the 106 (~57%) significantly differentially expressed genes were confirmed to have PIP boxes in their cis-regulatory regions (Additional file 1: Table S7, Additional file 4). The presence of PIP box confirmed that these genes may be directly regulated by HrpX, while the remaining 46 that do not have PIP boxes may be indirectly regulated by HrpX via the other transcription factors. In this regard, we looked for genes with sequence specific DNA binding activity in the 106 differentially expressed genes. Six genes namely hrpG, pcaQ, blal, XAC3026, XAC3445, and XAC3446 were known to have sequence specific DNA binding activity according to GO annotation. Among them, XAC3446, XAC3445, and blaI have been newly identified in this study containing PIP box motif (Additional file 1: Table S7, Additional file 4). Thereby these 3 transcription regulators are directly regulated by HrpX, which in turn we assume regulate the 46 genes, which do not contain the PIP box motif and hence indirectly regulated by HrpX. The DNA binding signatures for many of these transcription factors are unknown; hence, obscure the further confirmation of regulation by these transcription factors. Nevertheless, the fact that many of the genes that were uniquely determined by each method showed a clear PIP box in their cis-regulatory regions reiterates that RNA-seq and microarray complement each other.
Currently, RNA-seq is becoming the preferable choice for gene expression profiling in place of microarrays. Although, all the parameters that influence the various aspects of this method are yet to be understood completely, RNA-seq undoubtedly is playing a very important role in deciphering the complexity of the transcriptome by giving a new direction to isoforms, allelic expression, untranslated regions, splice junctions, antisense regulation and intragenic expression [10,16,29,68-74]. Several studies have begun to investigate on the parameters like sequencing depth, precision, GC bias, length bias, lane effects, and processing artifacts [16,29,48,75-77]. On the other hand, microarrays are in usage for more than two decades. Therefore, most of the biases inherent to this method have become more apparent . For instance, biases in the hybridization of the samples labeled with Cyanine5 (Cy5) and Cyanine3 (Cy3) are sufficiently explored, and currently several approaches are practiced to minimize such effects [79-82]. Further, systematic variability like influence of the image scanner settings on the dye intensity measurements have now been robustly handled by applying various normalization techniques [83-86]. Despite these developments, some inherent genes–specific biases like differential hybridization efficiencies of the labeled target transcript to the same probe are still found to be inevitable in microarrays. In RNA-seq as well as microarray, all these known and unknown parameters influence the final outcome. Therefore, in this study, we focused on the assessment of RNA-seq and microarray based on the final outcome .i.e. statistically significant differentially expressed genes.
In comparison with previous RNA-seq studies, with a sequence coverage of 97% we observed for our data set, is in consistence with the reported 89.5% to 95% coverage observed in other bacterial RNA-seq studies [87-89]. In our study, RNA-seq has identified more significantly differentially expressed genes (82%), when compared to microarray (63%) as in previous studies [18,29,30]. The overall correlation (rs 0.76) in the magnitudes of FC for the consensus genes between the two methods was found to be similar or higher than previous studies [18,29,30,72]. Furthermore, our comparison analysis with qRT-PCR suggested that the expression levels were highly reliable for those genes that were determined to be differentially expressed by both RNA-seq and microarray. Hence, confirming the differential expression of genes by multiple methods reduces false positives thereby enhances the biological discovery.
Even though microarray overall outperformed RNA-seq by detecting more known HrpX target genes from the T3SS in hrp cluster by satisfying both FC and FDR cut-off threshold, in principle RNA-seq also detected genes hrpB5, hrcS, hpaP, XAC0395, hrpB7, and hrcT, in terms of FC, but failed to pass FDR threshold. This parameter is more directly influenced by error model considered in the statistical method that is used to infer the differential expression rather than RNA-seq itself. For the same read counts, one can get slightly different FDR values depending on the statistical method . But the implementation of all the statistical methods is not feasible for every dataset. From the T3SS in hrp cluster, three genes namely, hrcC, hpa2, and hpaA were not found to be detected by both RNA-seq and microarray, mainly because they fail to pass FDR threshold. Interestingly, our previous microarray analysis confirmed that all these three genes are regulated by HrpX, but only at a later stage of the growth phase by satisfying both FC and FDR cut-off thresholds . This consolidates the regulation of some of the genes at later stages of the growth phase. Further, in case of Type III effector genes, 8 genes (36.4%) were not detected by both RNA-seq and microarray within considered cut-off threshold limit. However, among them xopL, avrBs2, xopAK and xopZ were found to be regulated by HrpX only at the later stage of the growth phase (OD600 time point 0.5), according to our previous microarray analysis . Further, four genes namely, pthA2, pthA1, pthA3, pthA4 were regulated by another transcription regulator HrpG at early stage of growth phase (OD600 = 0.25 and 0.4) as observed in our previous study, while another undetected gene xopE was found to be also regulated by HrpG, but only at OD600 = 0.25 time point of growth phase . Thereby this study further validated our previous results. Subsequently, both methods detected 100% of the genes known to be regulated by HrpX (at time point OD600 = 0.4) without any false positives. Among them, 72% were detected by both the methods while interestingly 28% of the known target genes were detected by either one of the methods. Hence, both the methods together could complement each other.
In addition 55 genes (~51%) were newly identified as differentially expressed by applying both microarray as well as RNA-seq methods, thereby adding up to the already existing repertoire of HrpX regulated genes. Furthermore, 46 (83.6%) genes among them were uniquely identified by either one of the methods. Overall, 21 newly identified genes were found to have PIP box in their promoter regions, wherein 14 (58.3%) genes were uniquely identified by either RNA-seq or microarray. The presence of the PIP box in the promoter regions of the HrpX-regulated genes uniquely identified by RNA-seq and microarray further not only confirmed that these genes are directly regulated by HrpX, but also that these candidates are not false positives. Consequently, 100% of the known HrpX regulated genes could only be detected together by both the methods, since each method missed out on some of the known genes; hence both the methods together enhance the understanding of HrpX regulome by providing a more comprehensive picture.
This study has significantly advanced our understanding of the regulome of the critical transcriptional factor HrpX and demonstrates that RNA-seq and microarray complement each other in transcriptome profiling. Consequently, our study demonstrates the advantage of applying multiple transcriptome profiling methods to reveal a more comprehensive picture of a transcriptome, rather than relying solely on one method.
The wild-type X. citri subsp. citri, and the hrpX mutant strains used in this study were described in our previous study . Both the strains were grown at 28°C in nutrient broth (NB), on nutrient agar (NA), or in NYG medium . Antibiotics rifamycin and kanamycin were added to the media at 50 μg/ml final concentrations.
Total RNA was extracted from the wild-type and the hrpX mutant strains as described in our previous study . Briefly, strains from NA plates were grown in NB medium at 28°C until mid-exponential phase. Cultures were harvested by centrifugation and inoculated in to nutrient-deficient XVM2 medium, after washing the pellet once with the same medium. Cultures were finally harvested for RNA extraction, when the optical density at 600 nm reached the value of 0.4, and mixed immediately with RNAprotect bacterial reagent (Qiagen, Valencia, CA, and U.S.A.). Total RNA was extracted from each replicate separately using RiboPure bacteria kit (Ambion, Austin, TX, USA), according to manufacturer’s instructions. Genomic DNA contamination from the extracted RNA samples was removed using TURBO DNA-free kit (Ambion). Amount and the quality of the RNA samples was initially determined using NanoDrop™ 1000 spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE). Samples with absorbency at 260/280 and 260/230 nm ratios > 2 were subjected to further processing. Three biological replicates of the wild-type and the hrpX mutant samples were used for RNA-seq analysis.
The microarray data used in this study was generated during our previous study . Three unique 60-mer oligonucleotide probes were designed for each of the 4,427 protein coding genes of X. citri subsp. citri. 8-by-15-K DNA microarray chips covering the whole genome were implemented under the Agilent platform. These microarrays were processed at the Interdisciplinary Center for Biotechnology Research Microarray Core Facility, University of Florida. The raw data is available at National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) data repository under the accession number GSE24016 .
Total RNA samples were enriched for mRNA, by depleting rRNA using MICROBExpress kit from Ambion following the manufacturer’s instructions. Enriched samples were checked for integrity using Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). RNA samples that passed the quality control were sequenced using the Illumina Genome Analyzer IIx (GAIIx) system by following the standard protocol at the Center for Genome Analysis at Yale University. Real-time analysis and base calling were performed using the CASAVA v1.6 pipeline. The raw sequence data has been submitted to the NCBI Sequence Read Archive and assigned with an accession number SRA052842.
The X. citri subsp. citri whole genome sequence consisting of one chromosome [GenBank: NC_003919.1], and two plasmids [GenBank: NC_003921.3 and NC_003922.1], along with the annotation information were downloaded from NCBI repository (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). Quality-filtered reads were aligned on to the genome using CLC Genomics Workbench v4.7.2 (CLC bio, Aarhus, Denmark). Reads uniquely aligned to each gene were tabulated from each replicate separately. Differentially expressed genes were estimated using DESeq package , available under the open-source Bioconductor suite of programs . DESeq is a powerful tool to estimate the variance in RNA-seq data and test for differential expression . As an input, DESeq accepts a table of read counts for each gene from different biological replicates, and estimates the differentially expressed genes using negative binomial distribution . Statistically significant differentially expressed genes from both microarray and RNA-seq data were obtained by applying a cut-off threshold of FDR ≤ 0.05 (5%) and an absolute log2 fold-change ≥ 0.6.
Similarity searches were performed online using position-specific iterative BLAST (PSI-BLAST) at NCBI site against non-redundant protein database . T3SS and T2SS predictions were performed using Effective database . The promoter regions of the significantly differentially expressed genes were retrieved manually using NCBI genome browser to look for the presence of PIP boxes. The differentially expressed genes were assigned to the transcriptional units by referring to the MetaCyc database . Biological interpretation of the differentially expressed genes was carried out using the ClueGO v1.5 , a Cytoscape plug-in .
All the qRT-PCR assays were performed as detailed elsewhere . Briefly, gene-specific primers were designed for the selected genes using PrimerQuestSM from Integrated DNA technologies (IDT), Coralville, Iowa (Additional file 1: Table S6). qRT-PCR experiments were performed in triplicates, at least three times for each gene using 7500 fast real-time PCR system (Applied Biosystems, Foster City, CA, USA), using a QuantiTect SYBR green RT-PCR kit (Qiagen) with similar results, by following the manufacturer’s instructions. The relative fold change of target gene expression was calculated using 16S rRNA as an endogenous control with the formula 2–CT.
The raw RNA-seq data from this study is deposited at the NCBI sequence read archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi), under the accession number SRA052842, while the raw microarray data is available at the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) with the accession number GSE24016.
We declare no competing interests.
NW conceived the idea and initiated the study. SK performed analysis and interpretation of the data and wrote the manuscript. QY carried out the PCR experiments. YG participated in the study design, sample and data collection, and data interpretation. NW participated in interpretation of data and writing the manuscript and supervised the overall work. All authors read and approved the final manuscript.
The following excel format file contains the following 8 additional tables: Table S1: Summary of RNA-seq reads from wild-type and hrpX mutant strains of X. citri subsp. citri. Table S2: List of genes that are called by both RNA-seq and microarray. Table S3: List of genes that are uniquely called by RNA-seq and microarray. Table S4: List of statistically significant differentially expressed genes by RNA-seq and microarray filtered by cut-off thresholds. Table S5: List of randomly selected genes for the comparison with qRT-PCR from the statistically significant differentially expressed genes from RNA-seq and microarray. Table S6: Gene specific primers used in qRT-PCR experiment. Table S7: Summary of bioinformatics analysis of statistically significant differentially expressed genes to be part of Type III Secretion System (T3SS) and Type II Secretion System (T2SS) along with the occurrence of PIP box. Table S8: List of 90 transcriptional units from X. citri subsp. citri to which the 106 differentially regulated genes belong.
Contains the following two additional figures, Figure FS1: Venn diagram summarizing genes called by both technologies, when comparison is carried out between the total currently annotated open reading frames (ORFs) available transcripts from the transcriptome of X. citri subsp. citri. Fold-change values are available from RNA-seq (4323) and microarray (4349). Gene’s called by both technologies are indicated by the overlap between the two circles. 4312 are found in consensus, while 11 and 37 are unique to RNA-seq and microarray respectively. Figure FS2: Venn diagram summarizing genes that are significantly differentially expressed determined by RNA-seq and microarray. Gene’s common to both methods are indicated by the overlap between the two circles.
Figure FS3 - Comparison at the level of the functional annotations of the significantly differentially expressed genes from RNA-seq and microarray. GO term and KEGG pathway information enrichment analysis is shown for the genes from RNA-seq (left panel) and microarray (right panel). The overview of the analysis is shown in the form of pie chart for gene set from RNA-seq (A), and microarray (D). The histogram shows the number of genes associated with terms for the genes from RNA-seq (B) and microarray (E). Significantly enriched terms are indicated with ’*’. The terms that are functionally related are shown as a network with terms as nodes and relatedness is indicated with thickness of the edges that is based on their kappa score. The most significant term per group are shown for genes from RNA-seq (C) and microarray (F).
Figure FS4 - Snapshot of the PIP box motif present in the cis-regulatory region of significantly differentially expressed genes is shown in the context of the whole genome of X. citri subsp. citri. The absolute position of each PIP box motif occurrence is shown on the whole genome map along with the −10 ‘TATA’ regions and the gene start site.
We wish to acknowledge Gabriela Bindea for providing annotation files for ClueGO software. This work was supported by United States Department of Agriculture - NIFA Special Citrus Canker Grant Project 94677.