We evaluated Asian racial/ethnic group identification inferred from surname and given name lists for 6 Asian American racial/ethnic groups (Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese) against self-identified race/ethnicity, using electronic medical records from a large healthcare organization. We have 2 key findings. First, name identification for Asian subgroups is more complete and no less accurate when given names are considered in addition to surnames. Second, health characteristics are very similar between self-identified and name-classified Asian racial/ethnic subgroups.
Comparing different name-based algorithms using surnames alone and a combination of surnames and given names, we found that the completeness of the name lists as measured by sensitivity was moderate and the accuracy as measured by specificity was very high. The racial/ethnic subgroups had a spectrum of sensitivities (range: 0.45–0.78), with higher sensitivity for Vietnamese, Japanese, and Chinese. Filipinos had the lowest sensitivity. Specificities ranged from 0.97 to 1.00, meaning that names not associated with Asian individuals were not on the name lists, which was expected, given how the name lists were initially built. The Spanish surname list, a validated name list that has been frequently used in health research, had intermediate sensitivity compared with the Asian name lists, but lower specificity.
We found that much was gained by combining given and surname lists. For example, the sensitivities of some subgroups nearly doubled when adding given names (eg, Asian Indians). As we expected, the name lists had higher sensitivities and specificities for men than women likely due to name changes following marriage. We found that sensitivities were also higher for persons aged 65 and older, suggesting more within-group marriage for older women and more distinctive given names among men and women because of foreign birth.
Since the predictive values depend on the concentration of Asians in the population, we compared PPV and NPV for locations with different prevalences of Asian racial/ethnic subgroups. For locations with low Asian prevalence, the lists had moderate PPV (range: 0.31–0.56). However, as the concentration of Asians increased, the PPV rapidly rose. Locations with racial/ethnic subgroup prevalence similar to those of the state of California or the San Francisco Bay Area may be reasonably confident that persons identified as Asian by the name lists are highly likely to be Asian (SF Bay Area PPV: 0.70–0.87). When name identification can be conditioned on the broad race category for Asians, the PPV is very high for all Asian subgroups.
Most importantly, we found that health characteristics for specific racial/ethnic groups were very similar when identification was by name list and by self-identification. This was true for both health outcomes we tested: obesity and hypertension.
Few attempts have been made to evaluate the performance of Asian name lists. Lauderdale and Kestenbaum compared the performance of their surname lists, built from US Social Security files, on a separate evaluation data set from the Census.11
The sensitivity of the Asian (unconditional) surname lists ranged from 0.24 to 0.68 with a PPV of 0.83 to 0.93. Quan et al validated Lauderdale’s Chinese surname list against a Canadian national health survey and found a sensitivity of 0.53 and PPV of 0.92 in a Canadian population with 1.6% Chinese.18
Our validation of the same Chinese surname list in a US population had higher sensitivity, but lower PPV. There are a few commercial hybrid methods that combine geocoding and surname analysis that have been validated on UK populations, but not on a US population,19
and new Bayesian surname and geocode methods demonstrate potential.20
To our knowledge, there has been no prior evaluation of combining given name and surname classification, and no prior study that compared health characteristics of name-identified and self-identified groups.
Our study has both strengths and limitations. The PAMF’s service area spans over many of the San Francisco Bay Area counties. The San Francisco Bay Area, one the most ethnically diverse metropolitan areas, is home to the largest population of Asians in the nation. The large underlying Asian population guarantees adequate representation of each Asian racial/ethnic subgroup, and makes it an ideal place to determine sensitivity and specificity of name lists in a real world application. However, few areas of the country have as high Asian prevalence as the San Francisco Bay Area and name identification will be less accurate in communities with more typical racial/ethnic distributions. Additionally, this study capitalized on PAMF’s electronic medical record system. While all clinical data sources include patient names, access to identifiable records is often difficult or complicated for researchers. Another limitation of name list identification for Asian subgroups is that sensitivity varies across racial/ethnic groups, for reasons related to name characteristics. Asian Indian name identification is less complete because there is a large universe of names used in the linguistically heterogeneous South Asian subcontinent. The original name lists omitted names that had fewer than 5 occurrences in the derivation files at the Social Security Administration (for confidentiality reasons), and that reduced the sensitivity of the Asian Indian list. Finally, Korean surnames have a unique problem. A small number of surnames are extremely common in Korea, but some of them also occur among non-Asian populations (eg, Lee) and others are Chinese in origin (eg, Chang). These names are not specific enough to Koreans and cannot be used for name identification. However, our comparisons of clinical outcomes demonstrate that even though some Asian subgroups are less completely identifiable by name, they appear to be just as representative of the entire group as those groups with more complete name-identification, such as Vietnamese.
This paper has shown that when clinical data sources have names but limited or no Asian race/ethnicity data, name lists may be used to infer specific Asian racial/ethnic subgroups. In these situations, clinicians and decision makers could use name lists to identify potential racial/ethnic disparities in disease or in healthcare receipt, or target specific populations to provide more culturally competent care. Using the inverse of the sensitivity estimates from this study as sample weights, the number of people identified by surname and given name can be adjusted to estimate the actual racial/ethnic population size.
We make several recommendations to users of these Asian subgroup name lists. First, organizations planning to use name lists to infer Asian subgroups should consider using the given name list together with the surname lists. For situations when the broad category of “Asian” is known, the combination of conditional surname and given name should be used with the known race information. Second, one should be aware of the differences in sensitivity across subgroups and sex when applying name lists to a target population. Name list identification is also more accurate and complete for older populations. Third, one should be attentive to the prevalence of Asian subgroups in the target population. Even though the specificities of the lists are high, the accuracy of name lists as measured through the PPV and NPV vary dramatically by the concentration of Asians in the target population. Finally, we hope our findings will lead to new studies of racial/ethnic health and healthcare disparities in areas with substantial concentrations of Asian such as California, the San Francisco Bay Area, Los Angeles, New York, and Hawaii and in data sources where there is Asian race information.