Failure to take proper account of interactions in the statistical model
In a typical study for which the methodology is being proposed, replicated plots of GM plants, a conventional comparator and a number of reference varieties are laid out as a randomised complete block design at each of a number of different sites. Given this design, and adopting the term "Genotype" to represent the complete list of test entries (i.e. GMO, conventional comparator and reference varieties), the basic linear mixed model structure is as follows:
where i, j and k are the indices for site, block within site, and genotype respectively, Yijk is the observed response at site i, block j for genotype k, SitexGenotypeik is the interaction between site and genotype, and eijk is residual plot error. To embrace the authors' proposals it is necessary to partition some of the above terms into separate components but the basic structure still applies. For example, and as the authors explain, Genotype is separated into GenotypeGroup (a three-level fixed factor distinguishing GMO, comparator and the group of reference varieties as a whole) and Genotype-within-GenotypeGroup (a random factor representing genetic variation in the reference population). Partitioning the interaction between site and genotype into the corresponding components is the logical next step, supported by the argument that interaction with site may be different among the test entries than among reference varieties.
Crucially, the authors choose not to include any interaction terms in their model, and in our view omitting such interactions from the model has led to fundamentally flawed procedures.
To illustrate the importance of accounting for interactions, we first focus on the test comparing GMO and comparator means. By neglecting to include any terms for interaction in their model, the difference test proposed by the authors will always be based on an error term that includes all potential sources of interaction, as well as plot effects, with correspondingly large degrees of freedom. The consequence of this is not the same in all cases and instead depends on the relative sizes of the various sub-components of the site × genotype interaction. For example, in the not unrealistic situation in which both the site × test entry interaction and the site × reference variety interaction are non-zero and of broadly similar magnitude, the error term will be underestimated and the associated degrees of freedom will be far too high. As a consequence, the false positive error rate will tend to be higher than the nominal rate indicated and the confidence interval for the difference between means will be too narrow. Certain other arrangements will result in the proposed difference test being unduly conservative. In fact, the only situation for which the proposed difference test will behave as intended is when there are, in reality, no interactions whatsoever with site.
To illustrate the effect of ignoring interactions on the behaviour of the proposed difference test we conducted a small simulation exercise, details of which are given in the Methods section. For simplicity, data were generated under the basic model structure in equation (1) with a single component to represent the entire site × genotype interaction. This implicitly assumes that the magnitude of the site × test entry interaction is the same as the site × reference variety interaction, and while this may be a somewhat special case, the example is sufficient to demonstrate that, in general, the proposed methodology does not perform as intended.
Within the general framework of the difference test, taking proper account of the site × test entry interaction is not difficult conceptually, nor would it be difficult to implement in practice if the reference varieties were simply omitted from the analysis. Within the authors' specified framework, however, we have yet to find a way of adapting their SAS code that will result in the right degrees of freedom in all cases for even the most straightforward of EFSA-inspired experimental designs in which all test entries and reference varieties are included at all locations. (A design, incidentally, that would not be practicable on cost grounds). Conversely, how to take proper account of interactions with site in the authors' proposed tests of equivalence is conceptually far less obvious but clearly no less important. For example, should equivalence intervals reflect the fact that reference varieties are likely to perform differently at different locations? The way interactions are handled will have a direct bearing on the various standard errors and intervals that are central to the equivalence testing procedures and hence on the outcome of such tests. There are other aspects of the proposed tests of equivalence that give cause for concern but it is impossible to evaluate these properly until the issue of interactions has been resolved.
We therefore fail to see how the proposed tests of difference and equivalence can be regarded as being fit for purpose until the issue of interactions with site has been adequately addressed, and our opinion is that this must be clarified before any such methodology becomes a regulatory requirement. We also question the value of trying to evaluate the statistical properties of the proposed methodology prior to this, noting that the authors' simulation studies that purport to establish validity of their methods were performed under a model that implicitly assumes that all potential interactions are non-existent. Whether a valid and workable solution that takes proper account of interactions actually exists within the authors' proposed framework is uncertain.
Whilst the authors make it clear that, in their opinion "The primary objective for an average difference/equivalence approach is neither the identification of possible interactions nor per site (per year) evaluation," this does not in any way lessen the need to partition the sums of squares and degrees of freedom in a way that is consistent with the design structure. The fact that the authors later choose to include site and interaction terms as fixed effects to check for consistency over sites is largely irrelevant here in that it does not address the need to take proper account of interactions in the main difference and equivalence tests. This approach is also questionable given that sites were originally specified as random effects. Furthermore, the way in which interactions with site are handled has the potential to impact on the suitability of different experimental design options, which is of particular concern to companies that are required to implement the proposed methodology. While EFSA's Scientific Opinion document proposes that different reference varieties can be grown at different sites, it is currently unclear which of the possible design options will allow the necessary statistics to be generated when interactions between site and the various genotype subgroups are included in the model.
Irrespective of the specific technical concerns raised above, we see the whole process as extremely convoluted and by no means intuitive. With regard to the authors' proposed equivalence tests, comparisons with traditional equivalence testing are not very helpful in this respect because, with a traditional equivalence test, the focus is on whether the difference between two specific treatments is less than some pre-specified amount whereas here the focus is on the difference between GMO and some hypothetical population of non-GM varieties. This is discussed in more detail by Herman et al (2010) [
3].
Other key concerns
Under the current proposals, the thresholds for the tests of equivalence are entirely study-specific, being totally dependent on the precise set of reference varieties that happened to be included. This being the case, the situation could easily arise in which two different submissions with very similar profiles in terms of GMO, comparator, environments and levels of residual variation could result in very different overall conclusions simply because the respective sets of reference varieties led to very different sets of equivalence limits. Yet this would clearly be inappropriate because both sets of reference varieties are intended to estimate exactly the same thing, i.e. the true spread of responses amongst the entire population of non-GM varieties with a history of safe use grown under identical environments as the test entries. In practice, this problem will be exacerbated by the fact that applicants' access to a diverse range of varieties is often restricted. The need for a level playing field should surely be a key requirement of any regulatory process, and the idea of establishing a common set of equivalence limits should be seriously considered, either based on historical data or, if necessary, based on data from new studies set up specifically for this purpose.
We see no scientific justification for changing the Type I error rate associated with the difference test from the customary 5% level to 10%, as proposed.