Two-phase studies are efficient compared to standard case-control designs. The variant design presented in this paper improves on some aspects of standard two-phase studies. Specifically, with respect to data collection there is only one time of contact. At a time when studies are struggling with decreasing response rates, collection of all necessary data at a single time of contact may result in improved overall participation rates. Moreover, for rare exposures, minimum numbers of exposed subjects can be guaranteed in this design, thus increasing the power, even compared with standard balanced Two-Phase designs. The disadvantage of the flexible two-phase design compared to other designs, including standard two-phase, is the additional complexity in design planning. Another possible disadvantage is that the categories that are relatively easy to fill will be filled quickly during recruitment, while the hard-to-fill categories will take longer to reach their sampling targets. This can produce complex relationships between covariates and recruitment times. This could be alleviated by the randomized recruitment approach proposed by Weinberg and Sandler [
13] in which the most common Phase One category would be included in Phase Two with a given probability, chosen so that all categories are filled in at about the same time.
In the examples presented, we focused on rare exposures for which one could identify inexpensive proxies. Using our proposed heuristic rule, this allows oversampling the rare exposure and thus increasing power. This approach is efficient provided the analysis method used is maximum likelihood, thus, implicitly assuming non-differential misclassification, i.e., that the proxy is not a confounder. In practical terms, this means that the disease risk, given exposure, is the same in all strata. If the disease risk varies across strata, the effect of exposure may have to be assessed separately in each stratum resulting in reduced power to detect the effect of exposure in the underrepresented strata.
One major consideration for the flexible two-phase design is the availability of an adequate proxy for Phase One screening. The proxy must be easily obtained on all screened subjects but must also have high sensitivity and specificity. For a study focused on occupational exposures, as in example 1, a question about working in the industry of interest is easily collected and should yield a reasonable proxy for exposure. This binary stratification for the proxy may be extended to increase sensitivity and specificity. For example, one could ask about duration of work in a particular industry, thereby obtaining a proxy of the actual cumulative dose. Similarly, a positive family history was previously shown [
14] to be a good proxy for a rare gene with a strong effect. However, as the effect of the allele decreases and its frequency increases (as would be the situation for a low-risk gene) the sensitivity and specificity for family history decreases. In such situations, an alternative proxy for G may need to be considered, such as age at diagnosis, or a quick inexpensive physiologic test during the in-person interview at Phase One. Of course, the more information obtained at Phase One, the more expensive Phase One becomes.
We acknowledge that a gene-environment interaction odds ratio of 5 may be rather extreme for most diseases, particularly given some recent findings, as in [
15]. We are currently working on a more topic-oriented comparison of different study designs for detecting gene-environment interactions using a wider range of scenarios and including the Flexible Two-Phase design and case-only design (under the assumption of independence of Genetic and Environmental factors in the population).
In the present paper, we focused on the estimation of a single odds-ratio. However, dose-response estimation is possible, as long as detailed data are available at Phase Two. Similarly, it is possible to adjust for confounders as long as the relevant data are available in Phase Two. However, since the flexible two-phase design is mostly targeted on predefined hypotheses, especially if one oversamples some strata, there may be limited power to test other hypotheses or perform exploratory analyses. For example, exposure to some aromatic amines increases risk for bladder cancer, but this exposure is rare in the metal industry. Thus, the design we considered would have low power for detecting this risk. Many epidemiologic studies are exploratory in that they assess the effects of a large spectrum of factors without focusing on predefined hypotheses. The Flexible Two-Phase design is not adapted to this situation and focuses necessarily on a restricted number of explicitly stated hypotheses. We are, however, convinced that in many circumstances, only studies with predefined hypotheses will allow progress in understanding disease etiology.