|Home | About | Journals | Submit | Contact Us | Français|
To advance understanding of linkage error in U.S. maternally linked datasets, and how the error may affect results of studies based on the linked data.
North Carolina birth and fetal death records for 1988-1997 were maternally linked (n=1,030,029). The maternal set probability, defined as the probability that all records assigned to the same maternal set do in fact represent events to the same woman, was used to assess differential maternal linkage error across race/ethnic groups.
Maternal set probabilities were lower for records specifying Asian or Hispanic race/ethnicity, suggesting greater maternal linkage error. The lower probabilities for Hispanics were concentrated in women of Mexican origin who were not born in the United States.
Differential maternal linkage error may be a source of bias in studies using U.S. maternally linked datasets to make comparisons between Hispanics and other groups or among Hispanic subgroups. Methods to quantify and adjust for this potential bias are needed.
With the recent emphasis on understanding and preventing the recurrence of adverse birth outcomes (1), improving epidemiologic and statistical methods in perinatal epidemiology (2-4), and taking a life-course approach to maternal and child health (5), investigators are increasingly using maternally linked datasets to study perinatal health (6). These datasets consist of state birth and fetal death records covering several consecutive years. They include a variable (the maternal set number; see Appendix A for a glossary of terms that appear in italics) that groups the records into maternal sets, each set representing the events (births and fetal deaths) that occurred to a particular woman during those years (7, 8). Analysis of the data is based on the assumption that 1) all of the records in a maternal set are for the same woman and 2) all of her records are in that one maternal set.
Assignment of records to maternal sets is the special feature of maternally linked datasets. It allows the investigator to study successive pregnancies to the same woman (9), which otherwise would not be possible using U.S. vital records data. At least 35 studies using maternally linked datasets from eight U.S. states have been published since development of these datasets began in the mid-1990s. These studies have advanced understanding in several areas of perinatal health, including adverse birth outcomes (6, 10-12), maternal behaviors (13), the quality of vital records data (14, 15), and the quality of maternally linked data (16, 17).
In some European countries, maternally linked data sets have been created using the national registration number to identify records that represent different births to the same woman (18-20). This approach works because the national registration number is a unique identifier that is included in the birth records. In the United States, however, birth and fetal death records do not include a unique identifier. Instead, assignment of records to maternal sets is based on similarities and differences across records in names, birth dates, and other personal identifying information. This process, called maternal record linkage, is subject to error (maternal linkage error) in that records that do not refer to the same woman may nevertheless be assigned to the same maternal set, and records that do refer to the same woman may be assigned to different maternal sets (17).
Despite the increasing reliance on maternally linked datasets in perinatal health research, very little is known about maternal linkage error and how it affects results of analysis of the linked data. This is due in part to a lack of methods, including quantitative measures, for assessing and evaluating maternal linkage error (21). The purpose of this study was to advance understanding of maternal linkage error and how it affects results of studies based on maternally linked data. Our specific objective was to examine differential maternal linkage error across population subgroups. We first adapted the Fellegi and Sunter (22) method of probabilistic record linkage to obtain a quantitative indicator of potential linkage error. We then compared distributions of this indicator across selected population subgroups. Finally, we discuss the implications of our findings for studies based on maternally linked datasets. This continues a line of research whose overall goal is to develop quantitative techniques for adjusting analysis of maternally linked records for potential bias from maternal linkage error (17).
The data for this study were North Carolina birth and fetal death records for 1988–1997. Maternal race/ethnicity subgroups were defined by the data contained in the vital records. Race was listed as white, black, American Indian, or Asian. A separate variable indicated whether the mother was of Hispanic ethnicity. Hispanic ethnicity was subcategorized as Cuban, Puerto Rican, Central/South American, or Mexican. This study was approved by the Institutional Review Board of the University of North Carolina at Chapel Hill.
Our approach to maternal linkage error was guided by the purpose behind linking the data, namely, statistical analysis of the linked records. This approach suggests that maternal record linkage should be considered analogous to the data collection phase of a study. In this view, the goal of maternal record linkage should be 1) to produce a sample of maternal sets that is representative of the true maternal sets in the population (as opposed to reproducing the true maternal sets themselves) and 2) to enable computation of statistical measures of the representativeness of the maternal sets that can be incorporated into analysis of the data, as is customary with other types of epidemiologic samples. It follows that the assignment of records to maternal sets should be based on the underlying probability distributions of the relevant population parameters.
A second notion that follows from considering the purpose of maternally linked data is that maternal linkage error should be conceptualized as misclassification of maternal sets (17). This contrasts with previous approaches that defined maternal linkage error as misclassification of pairs of records (21).
Integrating the above concepts, we used maternal set probability as our measure of potential linkage error. Maternal set probability was defined as the probability that all of the records in a set refer to the same woman. It was calculated as the mean of the pairwise match probabilities of all possible pairs of records in the set. This is explained in detail below.
The Fellegi and Sunter (22) formulas for probabilistic record linkage satisfy the above criterion of being based on underlying probability distributions. However, for linking multiple records within a single file (as opposed to linking records between files), all of the available applications of the Fellegi and Sunter formulas employ practical adaptations that depart from the underlying probability distributions. We therefore used a linkage method that combined the Fellegi and Sunter formulas with probability sampling. Specifically, we calculated the maternal set probability for all possible maternal sets (with restrictions, as described below), and then we selected from the possible maternal sets based on the maternal set probability.
This process had three stages. First, we calculated match probabilities for pairs of records using the Fellegi and Sunter (22) formulas for probabilistic record linkage (using LinkSolv 4.0, Strategic Matching, Inc., Morrisonville, NY). The Felligi and Sunter approach is based on the assumption that two records that have the same identifying information are more likely to refer to the same individual if that combination of names, birth date, etc. is rare (relative to the data) than if it is common, and similarly, two records that have differing identifying information are less likely to refer to the same individual if the information is rare than if it is common. To calculate pairwise match probabilities, the identifying variables are compared across each pair of records. Variables that have the same value on the two records contribute positively to the pairwise probability, and variables that have differing values contribute negatively to the pairwise probability, both weighted such that rare values have greater weight.
As stated above, the available linkage software departs from the underlying probability distributions when applying the Fellegi and Sunter formulas to linking records within a single file. Therefore, to obtain the pairwise match probabilities, we divided the records into separate files and calculated the pairwise probabilities across files; pairwise probabilities were undefined for records within the same file. This effectively made records in the same file ineligible to be linked to each other, because the linkages were based on the pairwise probabilities (see below). Accordingly, the assignment of records to the separate files was based on the assumption that no two records for a particular woman would have the same birth order (or fetal death order) and substantially different personal identifying information (names and birth dates). This assumption is conservative in that it allows for errors or changes in the data across a woman’s events (e.g., change in name after divorce or marriage or two records erroneously indicating the same birth order) except for gross errors that would be unlikely to result in linkage in any case. In practice, records were initially organized into separate files based on birth or fetal death order. Records within each file that had similar identifying information were then separated among the files so that they were eligible to be linked with each other. This approach further implied that a particular record could be correctly linked to at most one other record in any one file. We refer to these files below as linkage files.
Using the above approach, we calculated pairwise match probabilities for all possible pairs of records between all possible pairs of linkage files. This made each record eligible to be linked to each of the other records in all of the other files. Our matching strategy for the pairwise probabilities is given in Appendix B.
The second stage of the linkage process involved creating a universe of all possible maternal sets and calculating their maternal set probabilities. To reduce errors as well as computation time, potential sets whose probabilities were 10% or less were excluded from the universe of possible sets.
The third stage involved selecting maternal sets at random proportional to the maternal set probability from the universe of possible sets. In the universe of possible sets, any one record was contained in numerous sets. In stage 3, when a particular set was selected, all other possible sets containing any of those records became ineligible for subsequent selection. Thus, in the selected sample, any particular record was included in one maternal set only.
Stages 2 and 3 were applied, in sequence, to the linkage files in a stepwise fashion; that is, first two files were processed through stage 2 and then stage 3, then a third file was processed through stage 2 and stage 3, then a fourth file, and so on for all of the files. Once a particular set was selected, those records remained together as a set throughout subsequent iterations, although additional records could be added to the set. In this way, the final maternal sets were built up one file at a time. This process is explained further below and illustrated in Table 1.
In the initial iteration of this process, the universe of possible maternal sets (stage 2) consisted of all pairs of records between two of the linkage files. In this initial iteration only, the pairwise probabilities calculated in stage 1 constituted the maternal set probabilities for the possible maternal sets. Implementing stage 3, some of these pairs of records were selected into the sample of maternal sets for that iteration.
In the second iteration, the universe of possible maternal sets included the sets selected in the previous iteration as well as the individual records from the two initial files that were not included in the selected sets. The universe of possible maternal sets for the second iteration was constructed by combining each of these units (i.e., two-record sets selected in the previous iteration and single, unselected records) with each of the records in a third linkage file. Thus, some of the possible maternal sets in the second iteration contained three records and some contained two records. The maternal set probability was calculated for each possible set as the mean of the pairwise match probabilities (from stage 1) of all possible pairs of records in the set. Implementing stage 3, some of the three-record sets and some of the two-record sets were selected into the sample of maternal sets for that iteration.
The third and subsequent iterations followed the pattern of the second iteration. The universe of possible sets was constructed by combining the selected maternal sets from previous iterations and the single records that were not included in any selected maternal set with the records in another linkage file. For example, the universe of possible sets for the fourth iteration included the selected sample from the third iteration that contained four-record, three-record, and two-record sets; three-record and two-record sets that were selected in the second iteration but not in the third iteration; two-record sets that were selected in the first iteration but not in the second or third iterations; and the single records from all of the previous linkage files that were not included in any of the previously selected maternal sets. Each of these sets and single records brought forward from the third iteration was combined with each of the records from another linkage file to form the universe of possible sets for the fourth iteration. The maternal set probabilities were calculated, and the sample of maternal sets was reselected.
The linkage files were entered into the above process according to their nominal birth order. The final maternally linked dataset consisted of the sets selected in the final iteration and the individual records that were not included in the selected sets. Thus, the final dataset included all of the records from the original ten-year birth and fetal death cohort.
To assess differential linkage error, we compared the distributions of maternal set probability among selected population groups. Because the distributions were highly skewed to the right, we based the comparisons on the proportion of sets in each group for which the set probability was ≥ 0.99. Lower proportions were taken as indicating more linkage error.
We postulated that maternal linkage error was more likely to occur for records representing fetal deaths or teenage mothers (maternal age < 20 years) and for large maternal sets. Consequently, we expected maternal set probabilities to be lower in sets that contained at least one fetal death record, at least one teen birth or teen fetal death record, or five or more records. We also compared set probabilities across race/ethnic groups.
The dataset consisted of 1,021,008 birth records and 9,021 fetal death records from events that occurred in North Carolina in 1988–1997. The linkage produced 241,783 maternal sets of two or more records (Table 2), 80 percent of which consisted of two records only (Table 3). The linked dataset was almost identical with one produced previously by a traditional Fellegi and Sunter (22) probabilistic record linkage (using AutoMatch) conducted on the same data (17); 96% of the maternal sets in the earlier study were reproduced exactly in the present study.
Overall, 97% of sets had a maternal set probability ≥ 0.99. Set probabilities were somewhat reduced for large sets and for sets that included a fetal death record. However, contrary to our prediction, sets that included a teen birth had only slightly reduced probabilities. Among the major race/ethnic groups that were indicated on the records, set probabilities were markedly lower among sets that included a record indicating Asian or Hispanic race/ethnicity(see Table 4).
To further elucidate the lower probabilities among sets containing Hispanic records, we compared the four Hispanic subgroups indicated on the records. The reduced probabilities were concentrated in sets that contained a record indicating Mexican origin of the mother. Within this group, sets that contained a record indicating non-U.S. birth of the mother had reduced probabilities, whereas sets that contained a record indicating U.S.-birth of the mother did not (see Table 4).
Evaluations of maternal linkage error in the literature are sparse. Nitsch et al. (23) reported the number of maternally linked birth records that contradicted maternal histories obtained by questionnaire. Leiss (17) used measures of logical consistency between pairs of linked records as indicators of whether the two records referred to the same women. These approaches are inadequate because they measure maternal linkage error as misclassification of individual records. In contrast, the purpose of maternally linked data is to study the clustering of phenomena within women (1) or to adjust for that clustering when it is not of interest in itself (3). It follows that maternal linkage error should be measured as misclassification of maternal sets. Leiss (17) proposed a method for measuring maternal linkage error in this way; however, it requires a gold standard comparison file, which is rarely available for maternally linked datasets.
Two previous studies assessed maternal linkage error as misclassification of maternal sets. Adams et al. (16) reported the proportion of maternal sets for which the number of records in the set equaled the number of events indicated on the records in the set. This proportion (86% overall) was greater for white race, more education, being married, smaller sets, having the same father listed on all records in the set, and having no fetal deaths or infant deaths in the set. Most of the differences were small. Croft et al. (24) identified maternal sets within which there were discrepancies in the mother’s names, birth date, age, time between births, parity, and other data.
In the present method, assignment of records to maternal sets is based on the probability that all of the records in the set refer to the same woman, i.e., the probability that the set is correctly classified. Differential distributions of this probability were used as indicators of differential maternal linkage error. We found that the distribution of maternal set probability was shifted to the left for maternal sets that included a record indicating Hispanic ethnicity compared to sets that did not include any Hispanic record. By successively examining constituent subgroups of the group with reduced probabilities, we determined that it was one particular subgroup among Hispanics—Mexican immigrants—for whom maternal set probabilities were lower compared to other Hispanics and to the overall population.
Analogous to other applications of probability theory in epidemiology, maternal set probabilities quantify the degree of certainty that can be ascribed to the composition of the sets. Our results suggest that a high degree of certainty characterizes the maternal sets for the overall population, for women who had an event as a teen, and for whites, blacks, and American Indians. The reduced probabilities for Mexican immigrants indicate greater uncertainty for the composition of these sets. This may reflect greater maternal linkage error for this population. It is possible, however, that it reflects lower quality information for determining the linkages (e.g., inconsistency in the spelling of a woman’s name across records for her different births) but that the correct linkages were nevertheless achieved. Further research is needed to disaggregate the contributions of linkage error and correct low quality linkages to reduced maternal set probabilities.
The probabilistic sampling method used in this study is intended to yield a sample of maternal sets that is representative of the true maternal sets in the population. Within this framework, linkage error is properly conceptualized as departure from representativeness. The relation of error as departure from representativeness to error as departure from the true linkages, which is the approach customarily taken in the record linkage literature (16, 17), requires further study. The high congruence of the maternal sets produced in the present study with those produced by a traditional Fellegi and Sunter linkage of the same data (17) suggests that the present method produces the true maternal sets as well as a method designed for that purpose.
Potential bias as indicated by reduced maternal set probabilities is of concern in at least three types of studies (21). One is direct comparisons with the subgroup that has excess error. Several studies, not using maternally linked data, have found differences in perinatal outcomes between U.S.- and non-U.S.-born Mexican women in the United States (25). Use of maternally linked data could potentially advance understanding of these differences (26-28), but differential linkage error could bias the intergroup comparisons.
Secondly, potential bias is a concern when a subgroup with excess linkage error comprises a substantial proportion of a larger group in the analysis. Births to non-U.S.-born Mexican women comprised 80% of all Mexican births and 53% of all Hispanic births in the maternally linked dataset. Several studies, not using maternally linked data, have compared perinatal outcomes among Mexican and other Hispanic subgroups (29, 30) and among Hispanics and other population groups (31, 32). Similarly, teen births to non-U.S.-born Mexicans accounted for 60% of Hispanic teen births in the maternally linked dataset, and differences in teen birth outcomes among Hispanics and other population groups is an active area of research (33, 34). Again, use of maternally linked data to advance understanding of these issues (26-28) would potentially be biased by differential linkage error.
Finally, maternally linked data are increasingly being used for follow-up studies, including studying recurrence of perinatal outcomes (6, 9, 10, 12, 13, 35, 36). Differences in recurrence by Hispanic ethnicity have not been studied to date, but this is a logical next step in this emerging area of research (37). Differential maternal linkage error would be of particular concern as a source of bias in these types of studies because the maternal set is the unit of analysis.
An important question following from the present study is whether our finding of greater uncertainty for maternal sets of Mexican immigrants is true for the other U.S. maternally linked datasets. The several existing datasets were developed using a variety of methods, all of them different from the method used in the present study. None of the previous methods produced statistics for quantifying maternal linkage error. Nevertheless, it is possible to make tentative generalizations from our findings to the other datasets because of a fundamental similarity shared by all of the methods used to create them. In all of the methods, the composition of the maternal sets is determined by agreement across records on a limited number of variables (7, 8, 12, 13, 17, 38, 39). Because maternal names and birth date predominate in determining the linkages, the various methods can be expected to produce similar maternal sets. Therefore, in datasets that contain a substantial number of births to Mexican immigrants, bias of the type we found seems likely.
More generally, our results suggest that some population groups may be characterized by elevated maternal linkage error. Because the race/ethnic composition of the maternal population varies by state (40), the particular population groups affected may be different for different maternally linked datasets. It would be useful to determine which other population groups (if any) are likely to have reduced maternal set probabilities, whether this is a general characteristic of immigrant populations, or whether it is limited to Mexican immigrants.
Maternal record linkage has generally been conducted with the “competing goals of linking as many records as possible and avoiding incorrect linkages” (16, p. 91). The linkage method used in this paper resolves this conflict into the unified goal of creating a study population of maternal sets that represents the birth and fetal death experience of the source population. As with other methods of assembling data for epidemiologic research, the primary goal in creating a maternally linked dataset for research should be a sample in which the error is nondifferential and can be quantified using the underlying probability distributions; avoiding linkage errors should be a secondary goal.
The primary outstanding methodological question for epidemiologic studies using maternally linked data is whether, as our results suggest, there is important differential maternal linkage error across population groups. Future research should focus on developing methods to quantify differential error in existing maternally linked datasets, account for bias from differential error in studies using maternally linked data, and reduce differential error in the linkage process.
Enthusiasm for the potential of maternally linked data to lead to improvements in perinatal health has not been matched by methodological rigor in developing maternally linked datasets. Our findings support the validity of extant studies that were based on maternally linked data, primarily because those studies did not examine ethnic differences. Looking to a future in which maternally linked data will play an increasingly prominent role in perinatal health research, it will be important to understand weaknesses in the data and how these might be influencing results. This will allow realization of the full potential of this powerful tool for improving the health of mothers and infants (1, 27).
This research was supported in part by contract GS-10F-0351K from the Centers for Disease Control and Prevention and grants HD35785 from the National Institute of Child Health and Human Development, CA88757 from the National Cancer Institute, and UR6/CCU417428 from the National Center for Health Statistics. The authors thank Shawn Harris for developing the random selection algorithm program, Abenah Vanderpuije for programming assistance, Dr. Patrick Crockett for statistical advice, and Dr. Russell Kirby for comments on an earlier version of the manuscript. Dr. Ming Yin contributed significantly to the development of the linkage method.
Erroneous assignment of records to maternal sets such that records that do not refer to the same woman are assigned to the same set, or records that do refer to the same woman are assigned to different sets.
The process of assigning birth and fetal death records to maternal sets. Assignment is based on similarities and differences across records in names, birth dates, and other personal identifying information.
A variable indicating which maternal set a particular record belongs to. If there is no maternal linkage error, then each value of this variable represents events to one woman only and no two values represent events to the same woman.
The probability that all of the records in a set are from the same woman. Maternal set probability is calculated as the mean of the Fellegi and Sunter pairwise match probabilities of all possible pairs of records in the set.
A group of birth and fetal death records designated as representing all of the events that occurred to one woman. If there is maternal linkage error, then a maternal set may represent events to more than one woman and one woman’s events may be divided among two or more maternal sets. The term maternal set is preferred to sibship because if there is maternal linkage error, the offspring are not siblings and because siblings who have the same father and different mothers would, in the absence of maternal linkage error, be assigned to different maternal sets.
Software: LinkSolv version 4.0. Join variables (blocking variables)—four passes: mother’s maiden name; mother’s birth date; mother’s state of birth combined with mother’s first name; father’s last name.
|Probability of error||0.001||0.01||0.01||0.01||0.01||0.01||0.01||0.01|
|Weighting factor for|
|Weighting factor for|
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.