The data for this study were North Carolina birth and fetal death records for 1988–1997. Maternal race/ethnicity subgroups were defined by the data contained in the vital records. Race was listed as white, black, American Indian, or Asian. A separate variable indicated whether the mother was of Hispanic ethnicity. Hispanic ethnicity was subcategorized as Cuban, Puerto Rican, Central/South American, or Mexican. This study was approved by the Institutional Review Board of the University of North Carolina at Chapel Hill.
Conceptualizing maternal linkage error
Our approach to maternal linkage error was guided by the purpose behind linking the data, namely, statistical analysis of the linked records. This approach suggests that maternal record linkage should be considered analogous to the data collection phase of a study. In this view, the goal of maternal record linkage should be 1) to produce a sample of maternal sets that is representative of the true maternal sets in the population (as opposed to reproducing the true maternal sets themselves) and 2) to enable computation of statistical measures of the representativeness of the maternal sets that can be incorporated into analysis of the data, as is customary with other types of epidemiologic samples. It follows that the assignment of records to maternal sets should be based on the underlying probability distributions of the relevant population parameters.
A second notion that follows from considering the purpose of maternally linked data is that maternal linkage error should be conceptualized as misclassification of maternal sets (17
). This contrasts with previous approaches that defined maternal linkage error as misclassification of pairs of records (21
Integrating the above concepts, we used maternal set probability as our measure of potential linkage error. Maternal set probability was defined as the probability that all of the records in a set refer to the same woman. It was calculated as the mean of the pairwise match probabilities of all possible pairs of records in the set. This is explained in detail below.
Calculating maternal set probability and selecting maternal sets
The Fellegi and Sunter (22
) formulas for probabilistic record linkage satisfy the above criterion of being based on underlying probability distributions. However, for linking multiple records within a single file (as opposed to linking records between files), all of the available applications of the Fellegi and Sunter formulas employ practical adaptations that depart from the underlying probability distributions. We therefore used a linkage method that combined the Fellegi and Sunter formulas with probability sampling. Specifically, we calculated the maternal set probability for all possible maternal sets (with restrictions, as described below), and then we selected from the possible maternal sets based on the maternal set probability.
This process had three stages. First, we calculated match probabilities for pairs of records using the Fellegi and Sunter (22
) formulas for probabilistic record linkage (using LinkSolv 4.0, Strategic Matching, Inc., Morrisonville, NY). The Felligi and Sunter approach is based on the assumption that two records that have the same identifying information are more likely to refer to the same individual if that combination of names, birth date, etc. is rare (relative to the data) than if it is common, and similarly, two records that have differing identifying information are less likely to refer to the same individual if the information is rare than if it is common. To calculate pairwise match probabilities, the identifying variables are compared across each pair of records. Variables that have the same value on the two records contribute positively to the pairwise probability, and variables that have differing values contribute negatively to the pairwise probability, both weighted such that rare values have greater weight.
As stated above, the available linkage software departs from the underlying probability distributions when applying the Fellegi and Sunter formulas to linking records within a single file. Therefore, to obtain the pairwise match probabilities, we divided the records into separate files and calculated the pairwise probabilities across files; pairwise probabilities were undefined for records within the same file. This effectively made records in the same file ineligible to be linked to each other, because the linkages were based on the pairwise probabilities (see below). Accordingly, the assignment of records to the separate files was based on the assumption that no two records for a particular woman would have the same birth order (or fetal death order) and substantially different personal identifying information (names and birth dates). This assumption is conservative in that it allows for errors or changes in the data across a woman’s events (e.g., change in name after divorce or marriage or two records erroneously indicating the same birth order) except for gross errors that would be unlikely to result in linkage in any case. In practice, records were initially organized into separate files based on birth or fetal death order. Records within each file that had similar identifying information were then separated among the files so that they were eligible to be linked with each other. This approach further implied that a particular record could be correctly linked to at most one other record in any one file. We refer to these files below as linkage files.
Using the above approach, we calculated pairwise match probabilities for all possible pairs of records between all possible pairs of linkage files. This made each record eligible to be linked to each of the other records in all of the other files. Our matching strategy for the pairwise probabilities is given in Appendix B
The second stage of the linkage process involved creating a universe of all possible maternal sets and calculating their maternal set probabilities. To reduce errors as well as computation time, potential sets whose probabilities were 10% or less were excluded from the universe of possible sets.
The third stage involved selecting maternal sets at random proportional to the maternal set probability from the universe of possible sets. In the universe of possible sets, any one record was contained in numerous sets. In stage 3, when a particular set was selected, all other possible sets containing any of those records became ineligible for subsequent selection. Thus, in the selected sample, any particular record was included in one maternal set only.
Stages 2 and 3 were applied, in sequence, to the linkage files in a stepwise fashion; that is, first two files were processed through stage 2 and then stage 3, then a third file was processed through stage 2 and stage 3, then a fourth file, and so on for all of the files. Once a particular set was selected, those records remained together as a set throughout subsequent iterations, although additional records could be added to the set. In this way, the final maternal sets were built up one file at a time. This process is explained further below and illustrated in .
Illustration of the first three iterations of the maternal record linkage process using hypothetical linkage files containing records A–E, W–Z, L–N, and P–R, respectively.
In the initial iteration of this process, the universe of possible maternal sets (stage 2) consisted of all pairs of records between two of the linkage files. In this initial iteration only, the pairwise probabilities calculated in stage 1 constituted the maternal set probabilities for the possible maternal sets. Implementing stage 3, some of these pairs of records were selected into the sample of maternal sets for that iteration.
In the second iteration, the universe of possible maternal sets included the sets selected in the previous iteration as well as the individual records from the two initial files that were not included in the selected sets. The universe of possible maternal sets for the second iteration was constructed by combining each of these units (i.e., two-record sets selected in the previous iteration and single, unselected records) with each of the records in a third linkage file. Thus, some of the possible maternal sets in the second iteration contained three records and some contained two records. The maternal set probability was calculated for each possible set as the mean of the pairwise match probabilities (from stage 1) of all possible pairs of records in the set. Implementing stage 3, some of the three-record sets and some of the two-record sets were selected into the sample of maternal sets for that iteration.
The third and subsequent iterations followed the pattern of the second iteration. The universe of possible sets was constructed by combining the selected maternal sets from previous iterations and the single records that were not included in any selected maternal set with the records in another linkage file. For example, the universe of possible sets for the fourth iteration included the selected sample from the third iteration that contained four-record, three-record, and two-record sets; three-record and two-record sets that were selected in the second iteration but not in the third iteration; two-record sets that were selected in the first iteration but not in the second or third iterations; and the single records from all of the previous linkage files that were not included in any of the previously selected maternal sets. Each of these sets and single records brought forward from the third iteration was combined with each of the records from another linkage file to form the universe of possible sets for the fourth iteration. The maternal set probabilities were calculated, and the sample of maternal sets was reselected.
The linkage files were entered into the above process according to their nominal birth order. The final maternally linked dataset consisted of the sets selected in the final iteration and the individual records that were not included in the selected sets. Thus, the final dataset included all of the records from the original ten-year birth and fetal death cohort.
Assessing differential linkage error across population subgroups
To assess differential linkage error, we compared the distributions of maternal set probability among selected population groups. Because the distributions were highly skewed to the right, we based the comparisons on the proportion of sets in each group for which the set probability was ≥ 0.99. Lower proportions were taken as indicating more linkage error.
We postulated that maternal linkage error was more likely to occur for records representing fetal deaths or teenage mothers (maternal age < 20 years) and for large maternal sets. Consequently, we expected maternal set probabilities to be lower in sets that contained at least one fetal death record, at least one teen birth or teen fetal death record, or five or more records. We also compared set probabilities across race/ethnic groups.