Quantitative PCR usually relies on the comparison of distinct samples, for instance the comparison of a biological sample with a standard curve of known initial concentration, when absolute quantification is required [

16], or the comparison of the expression of a gene to an internal standard when relative expression is needed. The equation inserted in Figure is used to calculate the ratio of initial target DNA of both samples (Eq. 2). The error on the normalized ratio depends on the error on the

*Ct *and the error on the efficiency, and it can be estimated from Eq. 11. However, the range and relative importance of the various components, and the origin of the error on practical measurements remain poorly characterized.

To evaluate the reproducibility of

*Ct *measurements and their associated error, we generated a set of 144 PCR reaction conditions corresponding to various target DNA, cDNA samples and dilutions (see Additional file

1 for a description of targeted genes and amplicons). Each of these 144 reaction conditions was replicated by performing 4 or 5 independent PCR amplifications. This yielded a complete dataset of 704 amplification reactions which collection of raw data is given in additional file

2. Individual

*Ct *values corresponding to each reaction conditions were averaged, providing a set of 144

*Ct *values and their associated errors. The standard deviation (SD) shows an increase of the error with higher

*Ct *values, with SD values smaller than 0.2 for

*Ct *up to 30 cycles, and spreading over 0.8 for

*Ct *higher than 30 (Additional File

3). Thus, all replicates with SD above 0.4 were excluded, which corresponds to some of the reactions with

*Ct *above 30 in this study. We conclude that

*Ct *between 15 and 30 can be reproducibly measured leading to a dynamic range of 10

^{5}, which is within the 4 to 8 logs dynamic range reported in other studies [

17]. In these conditions,

*Ct *value determination is unlikely to be a major source of error when calculating normalized ratio of expression. Thus, we then focused on the estimation of efficiency.

Estimation of the efficiency of a PCR reaction

We compared estimates of the efficiency obtained from two distinct methods: the generally used serial dilution (Figure ) and the alternative LinReg method (Figure ). With our experimental setup, estimation of the efficiency with the serial dilution method requires a set of 24 PCR reactions for a given sample and a given amplicon, using serially diluted template DNA. The efficiency obtained was compared to the average efficiency estimated from each of the reactions with the LinReg method. Efficiency estimates are comparable when looking at values given in Figure and , but they differed when comparing the efficiencies obtained from one of the four DNA samples (Figure ). Thus, we questioned whether the two methods provide statistically similar measures of efficiency, and whether they display similar reproducibility.

The statistical equivalence of the LinReg and serial dilution methods was assessed using an analysis of variance (ANOVA, Table ), which indicated that the efficiency averages are not significantly different between the two methods except for one amplicon, corresponding to the Connective Tissue Growth Factor (CTGF) cDNA (*p *< 0.05). This may be linked to the fact that this gene is expressed at very low levels and because of the reduced size of the data set, as some data had to be discarded because signal was undetectable (*Ct *≥ 40). Also, some estimates were taken from PCR displaying Ct values in the 35 – 40 range. Thus, statistically significant difference between the two methods may likely result from the smaller dataset and/or the use of reactions with *Ct *values outside of the optimal range. The reproducibility of the LinReg method appears to be overall higher than that of the serial dilution method (Figure ). An F-test performed over the averaged variance of each method indicated that for each set of primer, the difference between the variance of the serial dilution and LinReg methods is very significant, with *p*-values well below 0.001 (Table ).

| **Table 1**Comparison between serial dilution and LinReg for the measurement of efficiency |

Overall, we conclude that the two methods display comparable accuracy in measuring efficiency values of a set of reactions. Statistically, this implies that these methods provide acceptable estimator of the efficiency. However, LinReg appears to be more robust, as lower variances were obtained. Furthermore, LinReg can be mathematically justified when the PCR amplification is in the exponential phase (see Additional File

4).

Experimental parameters influencing efficiency determination

Next, we wished to determine which of the experimental variables may affect the precision of the estimation of efficiency. This was evaluated on the complete set of quantitative PCR reactions. Figure shows the distribution of the efficiencies measured for all reactions. Efficiencies ranged from 1.4 to 2.15 with a peak value around 1.85. Theoretically, efficiencies can only take values between 1 and 2, and therefore they are expected to deviate from a normal distribution, as indicated by a Kolmogorov-Smirnoff test (not shown). However, the distribution appears to be sufficiently symmetrical to be considered normal, such that classical statistical tests can be validly performed.

First, we determined if single PCR parameters (amplicons, cDNA samples, *Ct *value, etc.) may influence the efficiency value by performing a multiple ANOVA test on all values. The first four entries of Table indicate that the efficiency is most dependent upon the amplicon and relatively less on cDNA samples, as indicated by high F values, both of these effects being highly significant (*p *< 0.001). However, efficiency was not found to depend on the *Ct *nor the dilution, showing that efficiency is solely dependent on the kinetic of the PCR reaction in the exponential phase and not on the initial condition (i.e. the amount of initial template). Possible interactions between pairs of parameters that affect the efficiency (co-dependence) were also assessed, showing that amplicon-dependent effects on the efficiency are modulated by the type of sample. Interestingly, while the dilution is not significantly influencing the efficiency, it can significantly modulate the effect of the primers and the samples. This is consistent with the presence of inhibitor(s) in samples that would affect PCR reactions at the highest concentrations. For instance, salts or competing genomic DNA would be expected inhibit the interaction of primers and target DNAs differently. None of the other co-dependences were of any significance. Since the *Ct *and dilution are not independent parameters, two ANOVA test were run excluding either one. The *Ct *did not have a significant effect on the efficiency even when the dilution parameter was not taken into account. On the other hand, the dilution effect was not affecting efficiency significantly by itself but it modulated the effect of the amplicon. Either way, these results indicate clearly that the *Ct *has no direct effect on the efficiency.

| **Table 2**Parameters influencing the efficiency of qPCR reactions. |

Overall, these results indicate that efficiencies are highly variable among PCR reactions and that the main factor that defines the efficiency of a reaction is the amplicon. This is consistent with the empirical knowledge that primer sequences must be carefully designed in quantitative PCR to avoid non-productive hybridization events that decrease efficiency, such as primer-dimers or non-specific hybridizations. Efficiency might also depend upon the dilution for a minority of the cDNA samples, indicating that dilute samples should be preferred to obtain reliable efficiency values.

DNA quantification models

The models we evaluated in this study can fall into two different groups: being derived from either linear or from non-linear fitting methods. Comparison of qPCR data using models based on non-linear fitting methods (Eq. 6 and Eq. 8) is done simply by calculating the ratio of the initial amount of target DNA of each amplicon (Eq. 7 and Eq. 9) as in the first part of Eq. 2. The standard deviation of the ratio on a pool of replicate is calculated using Eq. 10. Note that in this case, errors resulting from the non-linear fitting itself are not considered in the analysis.

Linear fitting methods also allow the estimation of the initial level of fluorescence induced by the target DNA. For instance, Eq. 3, upon which the LinReg method relies to determine efficiency, can also be used to determine *F*_{0 }as the intercept to the origin of a linear regression of the log of fluorescence. This figure can then be used to calculate relative DNA levels (Eq. 2). This calculation method was termed *LRN*_{0}.

However, even small errors on the determination of the efficiency will lead to a great dispersion of *N*_{0 }values due to the exponential nature of PCR (Eq. 2). Therefore, we considered alternative calculation strategies, whereby the efficiency is averaged over several reactions rather than using individual values, which should provide more robust and statistically more coherent estimations. We therefore evaluated the use of efficiency values calculated in three different manners.

As the amplicon sequence is the main contributor to the efficiency, we used the efficiency averaged over all cDNA samples, dilutions and replicates of a given amplicon, as a more accurate estimator of the real efficiency than individual values. The error on the efficiency is no longer considered in the calculations of relative DNA concentrations, thus assuming that the estimator is sufficiently precise so that errors become negligible. This model is termed below (*PavrgE*)^{Ct}.

Alternatively, the small influence of the sample upon the efficiency was taken into account by averaging the efficiencies obtained for each dilutions and replicates of a given cDNA sample and a given amplicon. Thus, for a given cDNA sample and amplicon, one efficiency value is obtained from 24 PCR reactions. This value is used in further calculations, assuming again the average value to be a sufficiently good estimator of the efficiency so that the relative error may not be taken into account. This model was named (*SavrgE*)^{Ct}.

Finally, we tested a model in which the efficiency is estimated individually for each set of replicated reactions. This was addressed by averaging the efficiency of each replicates of a given amplicon, cDNA sample and dilution. This model is referred below as *E*^{ct}. These three models are summarized in Table .

| **Table 3**Models for the use of single reaction efficiencies |

Evaluation of the quantitative PCR calculation models

The dilutions of a given sample form a coherent set of data, with known concentration relationships between each dilution. Each calculation model was therefore used on each dilution series, using the undiluted sample for normalization. All data can be presented as measured relative concentrations, the undiluted dilution taking the relative concentration value of 1, the 10-fold dilution taking the value of 0.1, the 50-fold dilution a value of 0.02, and so on. The measured relative concentrations for all dilutions, samples and primers and the associated errors were calculated using each model from the complete dataset of 704 reactions. For the models giving a direct insight to the initial *N*_{0 }values, *N*_{0 }were averaged for each amplicon and cDNA sample, and they were plotted in comparison with the expected concentrations relative to the undiluted samples (Figure ). The models were evaluated on three criteria: resolution, precision and robustness.

We defined the resolution as the ability of a model to discriminate between two dilutions. Relative concentrations were compared pair-wise between adjacent dilutions. Typically, it can be seen in Figure that models did not give uniformly coherent results. For instance, models that do not rely on explicit efficiency values, such as the sigmoid or exponential models, are unable to discriminate between the 0.1 and 0.02 relative concentrations, which shows a lack of resolution in this range of dilutions. The Δ*Ct*, (*PavrgE*)^{Ct }and (*SavrgE*)^{Ct }models performed well under this criterion, allowing easy discrimination of the 10-fold and 50-fold dilutions in this example.

The resolution was statistically evaluated with a coupled ANOVA-LSD t-test, which is a two step analysis of variance (ANOVA) coupled to a t-test run under the

*Least Significant Difference method (LSD) *[

18]. Unsurprisingly, the ANOVA test indicated that for all models at least one of the measured concentrations differed significantly from the others as expected (data not shown). To further assess if all measured concentrations significantly differ from one another, or if some are undistinguishable, a coupled t-test was performed on pairs of adjacent dilutions in a given serial dilution series. Results are summarized in Table . All models were able to discriminate the undiluted condition from the 10-fold dilution (highly significant, p < 0.01). The sigmoid and exponential models did not discriminate further dilutions. The Δ

*Ct*,

*E*^{Ct }and LR

*N*_{0 }p-value indicate that these models could discriminate the 10-fold from the 50-fold dilution, but not further dilutions. The (

*SavrgE*)

^{Ct }and (

*PavrgE*)

^{Ct }models were able to discriminate the 10-fold from the 50-fold dilution and were at the limit of significance when comparing the 50-fold with the 100 fold dilutions (significant, p < 0.1). Finally, none of the models were able to discriminate the 100 fold from the 1000 fold dilution. These comparisons indicated that (

*SavrgE*)

^{Ct }and (

*PavrgE*)

^{Ct }models performed equally well in this assay, followed by Δ

*Ct*,

*E*^{Ct }and LR

*N*_{0}, while the sigmoid or exponential models were of low resolution. These results also illustrate that more dilute samples are generally more difficult to discriminate, as expected from the finding that variance increases with higher

*Ct *values (Additional File

3).

| **Table 4**Resolution of each calculation model |

The precision of a model is defined by its ability to provide expected relative concentrations of the known dilutions. Again Figure shows that the (

*PavrgE*)

^{Ct }and (

*SavrgE*)

^{Ct }models provide precise relative concentration values over all dilutions, with the measured relative concentrations matching the expected ones. Estimations obtained by the Δ

*Ct *model appear to be less reliable, with a systematic under-representation of concentrations. This result is expected since all of our amplicons have efficiencies that are below 2 (see Additional File

5).

We statistically evaluated the precision of each model by plotting the expected relative concentration against the measured relative concentration averaged from all primers and samples (Additional File

3). A linear regression was done on the data obtained from each model and a t-test was performed to determine if the slope is statistically different from 1. A low p-value in Table is associated to a high probability that the slope is different from 1, indicative of a poor correlation between expected and measured values. As before, the (

*PavrgE*)

^{Ct }and (

*SavrgE*)

^{Ct }models outperformed all other models, being more precise than the and sigmoid models, the exponential, Δ

*Ct *and LR

*N*_{0 }displaying lowest precision.

| **Table 5**Precision of each calculation model |

Finally, the robustness is related to the variability of the results obtained from a given model, and it indicates whether trustable results may be obtained from a small collection of data. For instance, a model could be very precise (eg providing a slope of 1) with a large data set, but the distribution of the points around the regression line could be very dispersed. Such a model would not be robust as a small data set would not allow precise measurements. Thus, the robustness of a model was estimated from the standard deviation of the slope and the related correlation coefficient of the linear regression (*r*^{2}), with higher *r*^{2 }values indicating more robust models. Three models showed high robustness, the Δ*Ct*, (*PavrgE*)^{Ct }and (*SavrgE*)^{Ct}, followed by *E*^{Ct }(Table ). Overall, only two calculation models combine high resolution, precision and robustness, namely the (*PavrgE*)^{Ct }and the (*SavrgE*)^{Ct }methods. However, only the slope of the (*SavrgE*)^{Ct }did not statistically differ from 1.

Model evaluation on a biological assay of gene expression regulation

Usually, experimenters are interested in the difference between two conditions (with versus without a drug, sane versus metastatic tissue, etc...) [

19-

21], for instance to determine whether the expression of the gene of interest is induced or repressed upon treatment or between samples. So the useful figure is the normalized induction ratio (Eq. 13). We set up to use the most promising approaches on samples of biological interest. NIH-3T3 fibroblastic cells were incubated with the TGF-β growth factor for 4 hours, as it is known to induce the expression of a number of extracellular matrix protein genes. For this experiment, the CTGF, FN and PAI-1 genes were chosen, for they were shown to be induced at various levels by the growth factor in fibroblasts [

22,

23]. The total mRNA of three independent biological samples from the induced as well as the non-induced condition were mixed and processed as before. The expression levels of these genes were normalized to the ribosomal L27 protein gene expression used as an invariant mRNA, so as to correct for differences in mRNA recovery or reverse transcription yield. Following the results of the previous section, only the (

*SavrgE*)

^{Ct}, (

*PavrgE*)

^{Ct }and Δ

*Ct *methods were used.

Ten replicate PCR reactions were performed for each condition (induced or non-induced) and normalized expression values obtained with (

*SavrgE*)

^{Ct }(

*SavrgE*)

^{Ct}, (

*PavrgE*)

^{Ct }or Δ

*Ct *are shown in Figure . Fibronectin is expressed at high levels but it is only moderately induced by the growth factor (Additional File

1), while PAI-1 and CTGF have much lower expression levels but higher induction ratios (Figure , top panels). The three methods yielded consistent results overall. However, the low induction ratio of the fibronectin gene was statistically significant with (

*SavrgE*)

^{Ct }(p < 0.05) but not with (

*PavrgE*)

^{Ct }(0.38) or Δ

*Ct *(p = 0.39).

To assess whether the relative performance of the three models depends critically on the number of replicate assays, the analysis was repeated, but taking into account only the first three values obtained from the set of 10 replicates. Similar results were obtained (Figure , bottom panels), and the small induction of the expression of the FN gene was again only detected using the (*SavrgE*)^{Ct }model. Thus, small differences in gene expression are also more reliably estimated from this model with a low number of replicates commensurate with usual experimental procedures.

Dataset size required to achieve statistical significance

In the above example, independent biological samples were mixed so as to decrease the variability associated with cell culture and mRNA isolation. Therefore, this study provides the statistical significance that may be expected just from the intra-assay variability in the qPCR process. However, statistical significance will also depend on the inter-assay, or biological variability. To assess the statistical significance associated to particular conclusions on gene expression regulation, replicates of induction experiments are usually generated and, in most experimental studies, the number of biological replicates is low, being typically obtained from 3–6 independent biological samples.

Thus, we wished to determine how many biological replicates may be necessary to obtain statistically reliable results, depending upon the variability of the assay (Eq. 15). Using the data from the 10 replicates to estimate the intra-assay variability, we found that the standard deviation is proportional to the induction ratio value (Additional File

1). This is shown by coefficient of variation (CV) values being conserved for all induction ratios at a level just below 15%, irrespective of the calculation method. Use the set of three replicate assays resulted in more variable but comparable CV values around or lower than 15%, which is in agreement with other published data [

17]. However, inter-experiment biological variability will be specific to each experimental system. The true variability of the PCR assay (intra-assay and biological inter-assay variability) is higher, typically with overall CV values ranging around 30% to 50% (own unpublished results and [

24]). Another parameter influencing the number of replicates needed to assess statistical reliability is the domain (range) of confidence of the measure. This value is defined as the largest acceptable error on the measure and it is set arbitrarily by the experimenter. Thus, setting a domain of 20% indicates that the estimated induction should fall within 20% of the real value. It follows that the larger the domain of confidence, the lower the number of replicates needed.

Table provides the number of independent measurements that is required to achieve a statistically significant measurement from any given induction ratio. Taking the minimal theoretical CV of 15%, obtaining an induction ratio within a domain of 10% of the real value would require 9 independent induction measurements. If one accepts a range of 30%, only one value is expected to be required. However, considering the biological variation in a CV value of 50%, and setting a 10% range of confidence, then 97 independent measurements would be needed to achieve statistically valid conclusions, which is unpractical in most cases. One therefore needs to accept a range of confidence of 50% to pull this value down to a feasible experimental set of 4 independent biological samples. Thus, with such settings, qPCR may be reasonably used to detect induction ratio around 2-fold or higher.

| **Table 6**Number of measurement replicates needed to reach statistical significance |