Normalization and quality of fit
The normalized data range in value from −1.07 to 0.63, with one high outlier at a value of1. 35. illustrates how the normalization removes general trend across the plate for the same data used in . The raw luminescence data (3-A) vary over a range of nearly twofold, with a very strong row effect. After normalization (3-B), the row effect has been almost completely removed. A measure of the variability of the data is the ratio of the fitted value
σ to the mean of the raw data on the control plate. That ratio ranges from 0.028 to 0.0574 over all 15 assays, with a mean of 0.0389. The normalization by removal of row and column effects greatly reduces the variability of the data. The ratio of the standard deviation of the raw data on the control plate to its mean ranges from 0.043 to 0.143, with mean 0.081 over all assays. shows the raw and fitted values for columns 5 to 48 on the control plates. The fitted values for the control plates, with zero concentration of the study chemicals, are given by
equation (4). For most points, the fit is good. Examination of the residuals by plate location (not shown) reveals that the residuals mostly show little to no trend with respect to row or column. However, there is a tendency for the absolute value of the residuals to be higher near the edges of the plate. This could be due to higher measurement error at the edges or to non-error effects that are not fully modeled by
equation (4).
The frequency at which fitted values of parameters were at the limits of their ranges from the optimization algorithm depended on the p-cutoff used in the parameter reduction step, decreasing as the cutoff became stricter. The parameter values most likely to reach their limits were v and n, which reached their upper limits 7.5% and 6.6% of the time, respectively. There is more discussion in the
supplemental material.
The significance of cytotoxic effects was evaluated using two statistical tests. In the first, all of the experimental concentrations were used; the second removed the highest concentration when trying to determine whether there was a significant concentration-response. Doing the test without the high-concentration data point helps to show whether a response is significant only because of a single data point. shows the p-values for the two tests (p15 and p14 for the 15 and 14 point tests respectively). As the modeled response at the highest concentration drops (−1 is complete loss of viability), the chances of seeing a significant response increase until finally every evaluation is positive for both 14 and 15 concentrations. Some of the results are significant when all concentration points are taken into account but not when the highest concentration is excluded (blue points). Over all 15 assays (including the triplicate HepG2 assays), 14% of concentration-response curves have simultaneously significant (with positive response) values of both p15 and p14; 6% have significant values for p15 but not p14; and 80.0% have nonsignificant values of both p15 and p14. When the value of p15 is below the significance level but p14 is not, the significance of the result depends on the one data point at the highest concentration; that data point has a small but real chance of being an outlier. If it is not, the p values suggest that response data at concentration levels between the two highest levels may be needed to better characterize the concentration-response curve.
As mentioned above, the likelihood ratio test used here overestimates the significance of the response. A better measure of concentration-response strength might be one similar to that used in Inglese et al.
3. In this article, the classification will be into three categories: active, inactive, and marginal. An active concentration-response curve is one satisfying these criteria: 1) the response is significant (p<0.05/1408) with or without the high-concentration data point; 2) the response is cytotoxic, = (i.e., the Hill parameter v>0); 3) the AC
50 is below the highest concentration (i.e., the Hill function parameter k<0.092 mM); 4) the normalized response at the highest concentration is less than −0.1 (i.e., there is at least a 10% loss of viability). An inactive curve is one with v≤0 or p>0.05/1408. All other concentration-responses are marginal. With this classification, 11.4% of all the concentration-response curves are active, 80.0% are inactive, and 8.6% are marginal.
In any testing method whose, there is a tradeoff between sensitivity (the ability of the test to detect actual effects) and specificity (the ability of the test to detect lack of effect). Changing the criterion which the test classifies substances as having or lacking an effect will affect both the sensitivity and the specificity. The classification criterion for this method can be adjusted at two stages in the calculation. The p-value cutoff in the parameter reduction step can be changed, or the nominal significance level in the final test of significance can be changed. In this model, the cutoff for the parameter reduction step and the nominal significance level for the final test of significance are both set at p=0.05/1408. Using higher p-values at either stage (or omitting the parameter reduction step entirely by setting the p-cutoff to 1) will increase the number of substances classified as having a significant concentration-response. The
supplemental material gives more detail on this effect. The main effect of increasing the p-cutoff for the parameter reduction step was to change the classification of substances from “no loss of viability” to “weak response, marginal or inactive”. The number of substances classified as active changes less when the p-cutoff was changed, probably because the criteria for an active classification are strict enough that they include mostly stronger responses which will be detected under any reasonable analysis method. Under the assumptions of the simulation study, the criteria used here were found to give a very high specificity (0.98) but a much lower sensitivity (0.53). The simulation found that using triplicated study data would give a great improvement in sensitivity (or, if less strict selection criteria were used, an improvement in specificity and a smaller improvement in sensitivity). Details are in the
supplemental material.
The plate effects in this model are represented as the product of a row effect and a column effect, with the row and column effects varying by plate. Another possibility for removing the plate effects would be to use the control plates at the beginning and end of the run to generate corrections, as was done, for example, in a previous analysis of these data
11. One way to do such a correction would be to calculate the plate parameters α and γ for the initial and final plates and then interpolate to find the values for intermediate plates. This approach was also tried and the model was found to give a significantly worse fit for all but one of the assays, although examination of the values of α and γ from the current analysis shows that there is a rough trend in their values across plates. Nevertheless, fitting the Hill function model to the normalized data used in the previous analysis gives very similar results to those from the algorithm used in this paper. This is discussed further in the
supplemental material.
Duplicate substances
Of the 1353 substances tested, 55 were tested in duplicate. Comparing the data and the model results for the duplicated substances provides an indication of the reproducibility of the test results. compares the two normalized data values from a pair of duplicates. 91% of the duplicate values differ by less than 0.1, but there are some values for which the two duplicates are very different. Examination of individual concentration-response profiles showed varying patterns of difference between duplicates. In many cases, the responses at the highest concentration are roughly similar for the two duplicates, both being near full loss of viability, while the responses at lower concentrations vary. In many cases the duplicate compounds were drawn from two lots from different suppliers, which may be reflected in the results. It is difficult to tell from the current data what the sources of variability in the duplicated measurements are. The current model allows only two sources of variability: plate location effects and normally-distributed noise. It would be possible to use a model with a more complicated error model. For example, the random error could be dependent on the slope of the concentration-response curve, which would correspond to uncertainty in the concentration of the administered substances. However, the optimization for such a model would be considerably more difficult.
A major factor describing the potency of a compound in these assays using the Hill model for analysis is the AC50 (i.e., k). Concentration-response curves with identical maximum responses (v) can have very different AC50 values if their responses at low concentrations are different. In general there is considerable variation in the values for k between duplicates (). Luckily, much of this disagreement is for compounds with non-significant concentration-response curves, since compounds for which both duplicates show significant concentration-response have more consistent values of k (red data points in ). In those cases, the k values seem to match more closely for compounds with low k values, possibly because of the closer spacing of exposures in the low-concentration part of the exposure range. shows Pearson correlation coefficients for several variables between duplicates for the duplicate pairs for which both members of the pair were active. Of the 825 (55×15) duplicate pairs, 612 had neither duplicate significant and positive, 67 had only one significant, and 146 had both significant, so 92% of the pairs agreed in significance between duplicates. Using the activity classification, 677/825 had neither duplicate active, 103 had both active, and 45 had only one active, so 94.5% agreed in classification of active vs. marginal or inactive.
| Table 1Pearson correlation coefficients between parameters in sets of duplicates for all duplicate pairs in which both duplicates are active (103 pairs) or in which both duplicates are active and have at least 50% loss of viability (68 pairs) |
When the concentration-response data are nearly linear, there is considerable dependence among the parameters, because the noise and measurement error in the data produce uncertainty as to the true shape of the concentration-response curve. This results in uncertainty in the value of the parameters, which may partly explain why the correlation between high-concentration responses is greater than that between actual parameters in when looking at pairs with both substances active. When the criterion for choosing the pairs is stricter, requiring both activity and a >50% response, the correlations for the variables are more similar. These results resemble those seen in a simulation study (see the
supplemental material). A concordance was computed to compare values of the AC
50 parameter
k between duplicates. Values of
k were compared for either the duplicate pairs with at least one active or for pairs with both active. The parameter values
k were considered to match if the ratio of the larger
k to the smaller was less than a cutoff value, either 2 or 10. They were considered to mismatch if the ratio was larger than the cutoff value or if only one duplicate was active. The concordance equals the fraction of matching
k values in the whole set of
k values. For a cutoff of 2, the concordance is 0.54 for pairs with at least one active, and 0.78 for pairs with both active; for a cutoff of 10, the concordance is 0.69 for pairs with at least one active and 0.99 for pairs with both active.
Triplicate HepG2 data
The assay for HepG2 cells was done in triplicate to test the reproducibility of the assay. The p values varied among the replicates. Out of all 1408 substances (counting duplicates separately), 105 had active concentration-response curves in all 3 triplicates, 1187 did not have an active concentration-response in any of the triplicates, 69 were significant only for one replicate, and the other 47 were significant for two replicates. This means that 92% of the triplicates had the same active or inactive/marginal classification for all triplicates. The correlation coefficient of various parameters between each of 3 possible pairs of the triplicates for the cases in which all 3 triplicates were active was calculated as a measure of similarity. The correlation of v between triplicates ranged from 0.89 to 0.91. The AC50 (k) had lower correlation (0.73 to 0.85), as did the shape parameter n (0.59–0.81). The normalized data had higher correlations, especially at the highest concentrations. The correlations ranged between 0.93 and 0.95 for the data at the highest concentration and between 0.92 and 0.96 for the second highest concentration.
The results of the likelihood ratio test for parameter equivalence were examined on a pooled basis by considering all 1408 results as a common data set. Duplicated substances were considered as separate data points. The triplicate data were fitted to the Hill function model with all parameters constrained to be equal and with no constraints on the parameters. A likelihood ratio test was used to evaluate the null hypothesis that the triplicates could be described using the same parameters. Of the 105 triplicates where all three responses were active, 26% showed a significant difference in the parameter values (p<0.05/1408). It appears that the main difference between the triplicates is in the value of k. When the likelihood ratio test was carried out for a null hypothesis of all parameters constrained to be equal vs. an alternative hypothesis of only 2 constrained to be equal (and thus with 1 parameter allowed to vary), the null hypothesis was rejected 11% of the time when n was allowed to vary, 17% of time when v was allowed to vary, and 29% of the time when k was allowed to vary.
Substance behavior across all assays
Most compounds were inactive or marginal for most of the assays. 936 of the compounds were inactive or marginal in all 15 assays. 36 of the compounds were active in all 15 assays.
The concentration-response curves can also be classified by strength of response. For the following results, a “strong” curve is one with normalized modeled response at the highest concentration which is below −0.8. A “medium” curve is one with response between −0.3 and −0.8, and a “weak” curve is one with modeled response at the high concentration between 0 and −0.3. A “negative” curve is one which shows no loss of viability (i.e., the response is 0 or more [equivalently, the v parameter is 0 or less]). The classification of “negative” is stricter than the “inactive” classification above, which does not require that there be a complete lack of concentration-response. Of all concentration-response curves (15×1408=21,120) evaluated, 80.0% (16,898) are negative. 7.4% (1561) of the responses are weak. The majority of the weak responses (1161 of them) are inactive. The medium responses make up 7.2% (1520) of the results, with most of them (944) being active. 5.4% (1141) of the results are strong, with the vast majority, 1065 of them, being active. Because all of the strong-response curves were significant according to the likelihood ratio test, the only way a strong response could be not active was to be classified as marginal because the response was nonsignificant when tested without the high-concentration data point (i.e., strong non-actives are curves whose significance depends on the strength of the response at the highest concentration). Of the 576 medium-strength substances that are inactive, 376 are so because they are nonsignificant without the highest-concentration data point. shows a breakdown of this classification by assay. Twelve of the 1408 compounds had strong active response for all 15 assays.
| Table 2Classification of concentration-response curves by strength and activity classification for each assay and for all assays combined |