As noted above, the NCS is designed to provide data appropriate for analyzing relationships between health outcomes and exposures and genetic factors. The sampling unit is the birth, and the target population is all births occurring during the birth enrollment period to women who are U.S. residents at the time of the birth. A set of key hypotheses was formulated to make the research objectives more specific and to help guide the design of the NCS. In addition to these hypotheses, the study design also must consider a myriad of other research questions that will be addressed using the data. These include current research questions that are of substantive interests, as well as research questions that have not been developed (e.g., those based on associations that will emerge from future research).
The study hypotheses had a direct influence on the design of the study. Some hypotheses require preconception and prenatal measures, which means the study must enroll at least some women into the study prior to conception and most women as soon after conception as possible. Most of the hypotheses demand postnatal and childhood measures, so a prospective, longitudinal study of births is necessary. Other hypotheses call for environmental and ecological measurements, implying the data collection areas where the women reside should be as contiguous and as compact as possible to reduce costs. The full set of research hypotheses provided the rationale for a long-term, prospective study of approximately 100,000 children. As described below, all of these requirements are dealt with in the sample design.
The problem of determining the necessary sample size for the NCS was extremely complex. For each hypothesis considered, the required sample size depends on the type of statistic required, the level of confidence needed for point estimates, and the statistical power for analytic comparisons. Because the NCS is an environmental study, the proportion exposed to risk must also be considered. For hypotheses to be investigated over time, the potential for loss to follow-up has to be incorporated; that is, the sample size for a longitudinal objective must be inflated to a larger number for the initial birth cohort to allow for non-retention over time. To ensure sufficient sample sizes for all hypotheses, the required sample size was set for the most demanding analytic hypothesis.
As part of the planning process for the NCS, a range of detectable odds ratios with varying prevalence, varying levels of precision, and varying exposure levels were examined. This extensive investigation determined that a sample size of 2,000 births would be required for the most demanding hypotheses. The expected prevalence associated with these hypotheses were as low as 2 percent; thus, a total of 100,000 births is required to obtain the 2,000 births with expected characteristics of the outcome variable conditional on the exposure level. Once the 100,000 totatl sample size was established, potential hypotheses can be examined relative to the sample size to determine whether these hypotheses should be included in the study. As such, it was the full set of multiple research hypotheses that provided the rationale for a longer-term prospective study of 100,000 children. As described below, all of these requirements are dealt with in the sample design.
While different analytic methods will be used to address the hypotheses and other research questions that will be supported by the data, we expect generalized regression modeling (e.g., see McCullagh and Nelder [
4]) will be the primary approach used with NCS data. Linear regression models for continuous outcome variables and logistic regression models for dichotomous outcome variables are two well-known regression models in this class that will be used often with NCS data. Other nonlinear models such as Poisson regression models for count data, polytomous logistic regression models for nominal outcome variables, proportional hazard models for ordered categorical outcomes, and survival or time-to-event models may also be appropriate for some analyses. Structural equation models and latent variable models are other analytic techniques that will be used to examine important research questions, especially those related to causality. Multilevel or hierarchical linear models may also be needed to address hypotheses related to community or neighborhood effects. All of these modeling efforts will be aided by collecting key confounding and mediating variables in the study.
Given the breadth of the research questions, disproportional sampling of some groups to improve the precision of estimates for those groups must be weighed against detrimental effects this would have on the research for undersampled groups. The NCS has the advantage that proportional allocation with 100,000 sampled children provides adequate sample sizes for most of the groups that are traditionally oversampled in national studies.
Another advantage of proportional allocation is that the sampling weights are approximately equal for the units in the sample. These weights are used to estimate population parameters in design-based analysis, where design-based analysis means statistical inferences are based on the distribution induced by the sample design (and, as such, incorporate aspects of the sample design such as probabilities of selection, sample unit stratification, sample unit clustering, and adjustments for differential nonresponse and population coverage). An alternative approach to inference is the model-based approach, where the analyst assumes a statistical model and draws inferences based on that model. The sampling weights are not used in model-based analysis. When the weights are approximately equal, design-based and model-based parameter estimates are generally very similar, and most of the differences between these approaches are related to the precision of the estimates rather than the level of the estimates. Design-based estimates generally have larger variances due to the clustering of the sample, but with many analytic methods, such as linear and logistic regression analysis, even these differences are not as pronounced as they are with descriptive statistics such as means and totals. (See Korn and Graubard [
5] for a more complete discussion of these approaches to inference and the effect on the estimates.)
As noted above, both household probability samples and center-based sampling approaches have been used in longitudinal surveys of health in the U.S. and internationally, and both have advantages and disadvantages. The rationale for choosing a structure depends on a number of factors that are often specific to the goals and resources of the study.
The household probability sample approach was chosen for the NCS, and is being used for the Vanguard Study, after much deliberation because it seemed most likely to support key objectives of the study. The primary advantage of the household probability structure is that it has greater potential to sample women early – prior to conception or soon after conception – thus providing preconception and prenatal data needed for some key hypotheses. Another advantage of the household structure relative to the center model is the more complete coverage of women, particularly those women who are outside the traditional medical system. The children of these women could have health outcomes that differ from children borne by mothers within the medical system, and the relationships of interest may vary as a result of the differences. The household approach may be less susceptible to these selection biases. A related advantage is that probability sampling may be more robust for making population-level inferences, as well as for making individual-level estimates (see Curtin and Feinleib [
6]). Additionally, the multi-stage area probability sample design facilitates linkage of the sampled address to external data sources; e.g., merging census block-level data (environmental, socio-economic, crime statistics, etc.) to the NCS data for analytic purposes. Although this linkage to external data sources is also possible in the center-based and office models, the household model allows for much greater control over the degree of clustering of the sample within areas such as the census block; this control of clustering is beneficial in the estimation of neighborhood effects.
While there are good reasons for choosing the household structure, the choice was far from simple. There are distinct advantages for the center-based structure. A blue-ribbon panel consisting of national experts in sampling, study design, and epidemiology were asked to consider this question and to make a recommendation on the design. The final report of the panel (see
www.nationalchildrensstudy.gov/research/workshops/Pages/samplingdesign032004.aspx, last accessed January 18, 2010) contains a complete accounting of the issues. Their recommendation was accepted and led to the adoption of the household sampling approach that is discussed in detail below.
The NCS organizational structure consists of a Program Office, a Coordinating Center, and a set of Study Centers. The Program Office oversees the operations of the study. The Coordinating Center is responsible for information management, sampling, data collection and analysis, and quality control. The Study Centers are the organizations responsible for recruitment of participants and data collection within the primary sampling units (PSUs). The Study Centers collaborate with community representatives to tailor their outreach and data collection approaches, within the guidelines of the study protocol, to the needs, characteristics, and interests of their communities. Additionally, the Study Centers provide the community perspective that is critical to the segment formation process described in section 4.