Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Genomics. Author manuscript; available in PMC 2008 February 1.
Published in final edited form as:
PMCID: PMC1945183

Evaluating microarrays using a semi-parametric approach: application to the central carbon metabolism of E. coli BL21 and JM109


E. coli K (JM109) and E. coli B (BL21) are strains used routinely for recombinant protein production. These two strains grow and respond differently to environmental factors, such as glucose and oxygen concentration. The differences were attributed to different expression of individual genes that constituted certain metabolic pathways that are part of the central carbon metabolism. By implementing the semi-parametric algorithm which is based on the null hypothesis of equal distribution, it was possible to compare and quantify the expression patterns of groups of genes involved in several central carbon metabolic pathways. The groups comprising the glyoxylate shunt, TCA cycle, fatty acid, and gluconeogenesis & anaplerotic pathways were expressed differently between the two strains while no differences were apparent for the groups comprising either glycolysis or the pentose phosphate pathway. These results further characterized differences between the two E. coli strains and illustrated the potency of the semi-parametric algorithm.

Keywords: microarray, E.coli, glucose metabolism, semi-parametric algorithm


Following gene transcription with microarrays is currently the preferred method for evaluating either gene expression variations between different cells or the effects of various compounds and environmental conditions on gene expression. With the use of commercially available software, such as Partek Genomics Suite or Acuity, microarray data can be organized to highlight specific genes or to identify entire biological pathways [1]. The level of gene organization depends on the algorithms being used to search the data for possible correlations, groupings, and patterns [2]. From a statistical standpoint, a critical aspect of evaluating microarray data is determining whether or not the expression levels of certain genes or groups of genes are truly different from one another [3, 4]. For this type of comparative analysis, a number of hypothesis testing methods are commonly used such as classical t-test, F-test, and ANOVA; each with advantages and limitations depending on the exact application [5, 6].

It was previously established that the growth behavior and metabolite production pattern of E. coli K (JM109) and E. coli B (BL21) are very different [7,8]. Most importantly, E. coli B is insensitive to high glucose concentration and does not produce acetate, which is believed to adversely affect both growth and protein production. In contrast, E. coli K is very sensitive to glucose concentration and produces high levels of acetate. As a result, intensive work was conducted to understand the differences between the two strains by comparing their central carbon metabolism. This was done by analyzing growth characteristics and acetate production [7], performing flux analysis [8], following carbon flow using labeled carbon [9], and following individual gene transcription using northern blot analysis and cDNA microarrays [10, 11]. In the presented work, a statistical semi-parametric algorithm [12] was implemented for evaluating previous results by comparing the transcription of groups of genes belonging to specific metabolic pathways of the central carbon metabolism. This algorithm analyzed microarray results by employing both computational and graphical components. The computational component calculated density functions, estimated data distributions, and then correlated differences between data sets that had distinct distributions with commonly used p-values. The graphical component used calculated log2 numbers that were obtained from the intensity values and plotted these unitless numbers along the x-axis while the estimated density functions were plotted along the y-axis.

A list of the genes involved in the glyoxylate shunt, TCA cycle, glycolysis, pentose phosphate, the combined gluconeogenesis and anaplerotic patyways, and the fatty acid pathways was assembled. Bacterial samples were collected and analyzed using oligonucleotide microarrays. Normalized intensity values for each gene within a specific pathway were then used as input for the semi-parametric algorithm and for evaluating the differences between the pathways of the two strains.


E.coli B (BL21) and E.coli K (JM109) were grown at an initial glucose concentration of 40 g/L (Figure 1) and their gene transcriptions were analyzed using oligonucleotide arrays. Samples for microarray analysis were taken at the end of the logarithmic growth phase, when the glucose concentration was below 1 g/L. Genes were then grouped according to the following pathways: glyoxylate shunt, TCA cycle, glycolysis, pentose phosphate, fatty acids, and gluconenogenesis and anaplerotic; each consisting of 5 to 18 genes. Three separate fermentations were conducted with samples from each run being assayed using microarrays. The resulting data, in the form of gene-specific signal intensity values, are summarized in Table 1. This data was then used as input for the semi-parametric algorithm, the results of which are shown in Table 2 and Figure 2. Table 2 provides the p-values for each pathway while Figure 2 provides plots of the estimated density functions for both E. coli strains vs. log2 of the average intensity values. Gene-specific intensity values from the three different arrays were averaged and converted to log2 values for each E. coli strain. (This log2 conversion is the reason for the difference between the intensity values listed in Table 2 and the values in found along the x-axis in the plots of Figure 2).

Figure 1
Bacterial growth, acetate production and glucose consumption at high initial glucose concentration of E. coli strains, A – E.coli BL21, B – E.coli JM109 the arrows indicate sampling time for microarray analysis
Figure 2
Comparison of the reference density function (E.coli BL21) and the distortion density function (E.coli JM109) vs. log2 of average intensity values for each of the following pathways: (a) Glyoxylate shunt, (b) TCA cycle, (c) Glycolysis, (d) Pentose phosphate, ...
Table 1
Microarray data in the form of normalized signal intensities; used as input for the algorithm
Table 2
Results of the semi-parametric algorithm applied to normalized microarray data from 6 hybridized oligonucleotide arrays

As shown in Table 2, the glyoxylate shunt, TCA cycle, and fatty acid pathways are distributed differently between the two E.coli strains because their p-values are much smaller than 0.05 (0.0068, 0.0005, and 0.0238, respectively) which was set to correspond to the likelihood of occurrence of 5% [13]. Acceptance of the null hypothesis of equal distribution takes place for p-values greater than the limit of 0.05. Conversely, p-values below the limit of 0.05 correspond to rejection of the null hypothesis. In other words, the genes that collectively constitute each of the three pathways listed above are being expressed differently between the two E. coli strains, and these differences are less than 5% likely to occur naturally; taking also into consideration inherent variability between slides, sample preparation, etc. [14]. In fact, the glyoxylate shunt and the TCA cycle have such low p-values that the likelihood of these distributions occurring naturally is a fraction of 1% (0.68% and 0.05%, respectively). Figures 2(a), 2(b), and 2(e) graphically illustrate the differences between the two E. coli strains for the three pathways. No point-specific overlaps or structural similarities are apparent in any of these figures.

The gluconeogenesis and anaplerotic pathway has a p-value only slightly larger than the limit of 0.05 (0.0592) and therefore the genes in this pathway are also being expressed differently between the two strains, but not as significantly as the previously mentioned pathways. Figure 2(f) highlights the differences and similarities between the two strains for this pathway. Despite some common features between the two curves, such as the slope of the initial ascent, there are several important differences, such as the well-defined peak in the E. coli JM109 curve and the rapid descent of the curve for higher values of x when compared with the E. coli BL21 curve.

For the glycolysis and the pentose phosphate pathways, no differences were apparent, evidenced by their relatively large p-values (0.6142 and 0.2964, respectively). In both Figures 2(c) and 2(d) the curve for one strain traces the curve for the other strain. Both figures have points of overlap and nearly identical shapes, therefore, the genes constituting each of these two pathways behave similarly between the two E. coli strains.


Microarrays are an efficient tool for the identification of gene transcription differences between cells, tissues, and microorganisms [15]. They have been used extensively for studying genotypic causes for phenotypic differences, divergent responses to environmental pressures, and evolutionary trends. A typical array experiment generates a large amount of data that requires statistical methods to perform searches to detect and quantify differences between gene expression levels [16, 17]. In most cases, the results of such a search are a list of up-regulated and down-regulated genes, relative to a reference or control gene expression pattern.

E. coli JM109 and E. coli BL21 are two strains commonly used for recombinant protein production. These two strains are different in their response to glucose concentration, especially excess glucose. E. coli JM109 excretes high levels of acetate when the glucose concentration exceeds a few grams per liter, while E. coli BL21 is insensitive to glucose concentrations and excretes low levels of acetate even the glucose concentration is above 30 grams per liter. When careful study of the central carbon metabolism of these strains was done by enzymatic activity, metabolic flux analysis, and cDNA arrays [11] several differences in the metabolic pathways were identified. The cDNA array analysis enabled us to determine in which strain and under what growth conditions would the transcription of a particular gene be higher or lower. However, it was not possible to directly compare and evaluate the expression levels of groups of genes constituting specific pathways such as glycolysis, TCA cycle, or the glyoxylate shunt. This comparison was needed for a comprehensive picture of the strains’ metabolic behavior, and was possible to perform by using a semi-parametric algorithm. Results generated from the implementation of this algorithm indicated that the transcription of the glycolysis and the pentose phosphate pathways were comparable in the two strains, however, clear differences were identified in the transcription of the TCA cycle, glyoxylate shunt, fatty acid pathways and to a lesser extent in the gluconegensis and the anaplerotic pathway. Differences in the growth rate and the metabolic patterns, including acetate formation between the two E.coli strains, can therefore be attributed to the combined effects of the glyoxylate shunt, TCA cycle, gluconeogenesis, and fatty acid pathways. In fact, this difference in gene transcription patterns is very likely the reason for efficient utilization of glucose through both the TCA cycle and the glyoxylate shunt, as well as assimilation of acetate via glucoenogensis and fatty acid biosynthesis in the E. coli BL21. The work presented in this text therefore supports previous information demonstrating that the glyoxylate shunt enzymes were more active in E. coli B than in K and that certain TCA, gluconeogensis and anaplerotic, and fatty acid metabolism genes are transcribed differently between the two strains. However, it was not possible to demonstrate that the transcription patterns of an entire pathway were different.

Although other hypothesis testing methods are available, the semi-parametric algorithm was chosen because it has been shown to be robust, relatively insensitive to outliers, and readily capable of analyzing an assortment of number sets, irrespective of their origin [18]. The work presented focused on using results from hybridized oligonucleotide arrays as input for the semi-parametric algorithm, but the overall process presented here is applicable to cDNA arrays as well. This, however, will require a universal control sample to be used with each dual-channel cDNA slide to allow cross-comparisons, a limitation not encountered with single-channel oligonucleotide arrays since only one sample is hybridized per slide instead of two samples (test and control).

The results of this study demonstrate the benefit of the semi-parametric algorithm to validate and expanded upon information obtained from genomic microarrays. The purpose of this study was to present a comprehensive approach involving the implementation of a statistical method to microarray data in order to validate and expand upon previous results. As such, the method presented offers researchers another way to decipher microarray data and explore genomic differences in the context of entire biological pathways.

Materials and Methods

1. Statistical formulation

The semi-parametric method used in this work generates both numerical values (p-values) and graphical illustrations to highlight distinctions between genes or groups of genes [12, 19, 20, 21]. Suppose that a set of gene-specific intensity values labeled x11, x12, …x1m are distributed such that a probability density function g1(x) can describe the distribution of these m numbers. Another set of gene-specific intensity values labeled x21, x22, …x2n are distributed such that a probably density function g(x) can describe the distribution of these n numbers [22]. As part of this mathematical construct, m and n may be different, however, for some h(x) the following equation is assumed:

Eq. (1)

where α1 and β1 are unknown parameters that can be estimated from the data sets (i.e. x11, x12, …x1m, x21, x22, …x2n), and h(x) is a function that must be specified. This equation expresses g(x) as a baseline or reference density while calculating the deviation or ‘tilt’ associated with g1(x) in terms of the reference density. In other words, Eq. 1 illustrates the mathematical relationship between g(x) and g1(x), the reference density function and the deviation density function, respectively.

This set-up allows for testing the null hypothesis of equal distribution; that is g1(x) = g(x) and (H0) β1=0. Incorporating the idea of a null hypothesis allows insight into subsequent analysis [19]. Accepting the null hypothesis, β1 = 0, signifies g1(x) and g(x) are distributed equally. If the null hypothesis is rejected, then g1(x) ≠ g(x); hence there is a difference between the distributions of these 2 data sets [20]. In order to test the hypothesis, a test statistic, χ1, is designated. It is asymptotically distributed as χ12 with one degree of freedom and adheres to the following equation:

Eq. (2)

where ρ1 = m/n, Var(h(t)) is the estimate of the variance of h(t) with respect to the reference distribution g(x), and [beta]1 is the estimate of β1 [18]. For our microarray data sets, m = n = the number of genes in a particular pathway times the number of arrays spotted with each gene, 3 this case because that was how many arrays were hybridized per E. coli strain [22].

The semi-parametric algorithm makes no assumptions regarding normal distributions for either g(x) or g1(x). The only assumption made is for h(x). The choice of h(x) = x is quite satisfactory for symmetric or nearly symmetric probability distributions whereas h(x) = log x is adequate for skewed distributions. In our analysis we utilized h(x) = x [21].

The algorithm quantifies the level of similarity between g(x) and g1(x) by numerical and graphical means. The numerical approach calculates the p-values resulting from the hypothesis test, discussed earlier. The graphical approach (seen in Figure 2) is an integral part of the analysis and not simply an illustration. The graphical approach: highlights the differences between the two E.coli strains for a specific biological pathway, indirectly correlates p-values with the overall structure and predictability of the density functions, and demonstrates how a series of numbers, in this case the intensity values, can be averaged and combined into a new data set independent of the original source, in this case, the genes. In other words the graphical approach both visualizes and interprets the results generated by the semi-parametric method, as shown in Figure 2. The greater the similarity between the distributions of a given pathway, the closer the plots of the estimated g(x) and g1(x) are one to another; and the higher the corresponding p-value.

2. Bacterial strains

The two E. coli strains studied were BL21(λDE3) (F, ompT, hsdSB (rB−,mB+), dcm, gal, (DE3), Cmr) and JM109(DE3) (endA1, recA1, gyrA96, thi, hsdR17 (rk,mk+), relA1, supE44,Δλ, Δ(lac-proAB), [F', traD36, proAB, lacIqZΔM15], λDE3). Both strains were obtained from Promega Corp. (Madison, WI).

3. Fermentation and sample preparation

Both strains were grown at 37ºC in modified LB medium containing 10 g/L tryptone, 5 g/L yeast extract (15 g/L for JM109), 5 g/L NaCl, and 5 g/L K2HPO4. After sterilization, 10 mM MgSO4, 1 mL/L trace metal solution, and 40 g/L glucose were added. Overnight cultures grown at 37ºC were used to inoculate 4.0 L of medium in a B. Braun fermentor equipped with data acquisition and a control system. The cultures were grown to high cell density, the pH was controlled at 7.0 by the addition of 50% NH4OH, and dissolved oxygen was kept above 30% of saturation at all times.

Samples for total RNA purification were collected at the late logarithmic phase of growth, indicated by arrows in Figure 1. Next, the samples were centrifuged at 14,000 g for 10 min at 4°C; the supernatant was removed and the pellets were quickly frozen with dry ice and stored at −80°C.

4. Total RNA preparation

Total RNA was isolated using a MasterPure RNA Purification Kit (Epicentre Technologies, Madison WI) according to the manufacturer’s protocol (Kit MCR 85102). Isolated RNA was further purified with an RNAeasy Kit 75144 (Qiagen). Overall RNA concentration was determined by measuring absorbance at 260 nm (A260) using a GeneQuant Pro (Amersham Biosciences). Purified RNA samples were determined to have absorbance ratios (A260/A280) of 1.85–1.95 and by running 1% agarose/formadehyde denaturing gel. To further ensure equivalency between individual samples, the 23S and 16S ribosomal RNA (rRNA) from each sample were analyzed by an Agilent 2100 Bioanalyzer (Agilent Technologies). The intensity of each band was calculated and the rRNA ratio (23S/16S) for each sample was calculated to be greater than 1.5.

5. Oligonucleotide microarrays

Standard methods available from Affymetrix (Santa Clara, CA) for cDNA synthesis, fragmentation, and end-terminus biotin labeling starting with a total RNA (10 μg) sample were used. The biotin-labeled cDNA was hybridized to E.coli Affymetrix Antisense Genome Arrays at 45 °C for 16 hours as recommended in the GeneChip technical manual (Affymetrix). Hybridized arrays were stained with streptavidin-phycoerythrin using an Affymetrix Fluidic Station. The GeneChips were scanned using an Affymetrix/Hewlett–Packard GeneArray GC2500 Scanner. The signal intensity was normalized using Affymetrix Microarray Suite Software (version 4.0).


Funding was provided by the National Institute of Diabetes & Digestive & Kidney Diseases (NIDDK), National Institutes of Health (NIH).


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Murphy D. Gene expression studies using microarrays: principles, problems, and prospects. Adv Physiol Educ. 2002;26:256–270. [PubMed]
2. Tamayo P, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999;96:2907–2912. [PubMed]
3. Burke HB. Discovering patterns in microarray data. Mol Diagn. 2000;5:349–357. [PubMed]
4. Dopazo J, Zanders E, Dragoni I, Amphlett G, Falciani F. Methods and approaches in the analysis of gene expression data. J Immunol Methods. 2001;250:93–112. [PubMed]
5. Saeed AI, et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 2003;34:374–378. [PubMed]
6. Yang HH, Hu Y, Buetow KH, Lee MP. A computational approach to measuring coherence of gene expression in pathways. Genomics. 2004;84:211–217. [PubMed]
7. Shiloach J, Kaufman J, Guillard AS, Fass R. Effect of glucose supply strategy on acetate accumulation, growth, and recombinant protein production by Escherichia coli BL21 (lDE3) and Escherichia coli JM109. Biotechnol Bioeng. 1996;49:421–428. [PubMed]
8. Van de Walle M, Shiloach J. Proposed mechanism of acetate accumulation in two recombinant Escherichia coli strains during high density fermentation. Biotechnol Bioeng. 1998;57:71–78. [PubMed]
9. Noronha SB, Yeh HJ, Spande TF, Shiloach J. Investigation of the TCA cycle and the glyoxylate shunt in Escherichia coli BL21 and JM109 using 13C-NMR/MS. Biotechnol Bioeng. 2000;68:316–327. [PubMed]
10. Phue J, Shiloach J. Transcription levels of key metabolic genes are the cause for different glucose utilization pathways in E. coli B (BL21) and E. coli K (JM109) J Biotechnol. 2004;109:21–30. [PubMed]
11. Phue J, Noronha SB, Hattacharyya R, Wolfe AJ, Shiloach J. Glucose metabolism at high density growth of E. coli B and E. coli K: differences in metabolic pathways are responsible for efficient glucose utilization in E. coli B as determined by microarrays and Northern blot analyses. Biotechnol Bioeng. 2005;90:805–820. [PubMed]
12. Qi Y. Classification of Microarray Data, Department of Mathematics. College Park, MD: University of Maryland; 2002.
13. Conway T, Schoolnik GK. Microarray expression profiling: capturing a genome. Mol Microbiol. 2003;47:879–889. [PubMed]
14. Ideker T, Thorsson V, Siegel AF, Hood LE. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol. 2000;7:805–817. [PubMed]
15. Mantripragada KK, Buckley PG, Diaz de Stahl T, Dumanski JP. Genomic microarrays in the spotlight. Trends Genet. 2004;20:87–93. [PubMed]
16. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. [PubMed]
17. Quackenbush J. Computational analysis of microarray data. Nat Rev Genet. 2001;2:418–427. [PubMed]
18. Fokianos K, Kedem B, Qin J, Short D. A semiparametric approach to the one way layout. Technometrics. 2001;43:56–65.
19. Qin J, Zhang B. A goodness of fit test for the logistic regression model based on case-control data. Biometrika. 1997;84:609–618.
20. Gilbert PB, Lele SR, Vardi Y. Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika. 1999;86:27–43.
21. Kedem B, Wolff DB, Fokianos K. Statistical Comparison of Algorithms. IEEE Trans Instrum Meas. 2004;53:770–776.
22. Gagnon R. Certain Computational Aspects of Power Efficiency and State Space Models, Department of Mathematics. College Park, MD: University of Maryland; 2005.