The large proportion of authors whose name appears in more than one manner and the direct correlation between productivity and variability suggest that the origin of the problem is in the lack of consistency on the part of authors themselves in signing their research articles. This behavior leads to serious inconsistencies in the DBs, whose managers and technicians apparently do nothing to curtail or correct the problem.
One of the principles used to improve the quality of bibliographic databases is to establish a single structure for each name that appears associated with different source documents. Our findings strongly suggest that the DBs we sampled do not apply any type of control measure to keep variants of the same name from proliferating. As a result, the quality of information retrieval based on searches by author is poor. The consequences for users of these DBs are clear: if users wish to locate all items published by a certain author by searching on the author field, they must perform at least two separate searches for half of the authors (at least, for the population of authors affiliated with the University of Granada medical school) and often many more, depending on how many variants there are of the authors' names. With browsing techniques, users would need to discover by trial and error which variants have been used for entries, among the many possible combinations (some clearly incorrect and unlikely to be guessed by a typical DB user) of first and second surnames and first and middle name initials.
We were surprised to find the highest rate of variability in the Spanish database, IME. In theory, familiarity with Spanish name structures on the part of the persons who manage this DB should have led to less variability in IME than in the two international DBs created and managed in the United States. The withdrawal in 1990 of the public funding on which IME depended for its maintenance and day-to-day operations may explain the drop in the quality of bibliographic control and may account for the poor performance we observed for items published between 1987 and 1996. In addition, IME enters information into its records just as it appears in the source documents and produces indexes only of Spanish publications. This process may be construed as evidence that Spanish authors—at least those who publish in biomedicine—are less careful about using the same name for all their publications when they submit articles to Spanish journals than when they submit them to “international” journals. Studies designed to investigate authors' behavior in signing different articles will be needed to shed light on this hypothesis.
Our analysis of the relationship between productivity and variability revealed a direct correlation between these two variables. In MEDLINE, the increase in variants as a given author published more items was less than in the other two DBs, a finding that might be related to the use of measures to control author names or at least to invert the order of first names and surnames correctly to respect the original name structure in Spanish.
Hence, greater productivity implies lower effectiveness of information retrieval, with a tendency for the number of items theoretically retrievable to approach one for highly productive authors. The consequences of this tendency are obvious: as an author produces more publications, the likelihood that all of them will be located by a single search with a single variant of the author's name—regardless of whether the variant tried is the correct one according to Spanish language usage or one of the several possible incorrect permutations—decreases sharply, as illustrated in .
For comparatively unproductive authors who have published only a few articles or perhaps only one, retrievability is better, but only if the search or browsing session is based on the same name structure used to index the item or items in the DB. Regardless of their productivity, then, this means that even for Spanish authors indexed by only one name—which may or may not be correct according to Spanish usage—retrievability cannot be assured. As several studies have already pointed out [48–50
], information retrieval based on searching by author name will become optimal only when authority control measures are used to standardize entries, ensure their consistency across records (unification), and guarantee that “see” and “see also” cross-references are appropriately linked.
Despite the trend toward loss of reliability in retrieval as the number of publications increases, , which is based on results obtained with SCI, shows that high productivity is not always associated with high variability in author name structures. Some authors have standardized their name structures throughout their publishing career by signing all their articles with the same “pen name.” (For some examples, see authors 11, 60, 116, and 167 in the appendix
.) These authors, along with those for whom only one item is indexed, account for the subgroup of authors with no variants. At the opposite extreme are authors with complex names who have not adopted a permanent pen name; under these circumstances many variants can arise. (For examples, see authors 18, 56, 68, 77, 85, and 149 in the appendix
.) Interestingly, some highly productive authors in the present sample appear to have used a pen name more systematically for some journal submissions than for others. This transitory consistency is reflected in the numbers of variants found in SCI for some cases. (For examples, see authors 20, 92, 128, 132, 150, and 172 in the appendix
.) This may reflect the fact that most source items indexed by SCI are from journals published in English-speaking countries, and many journals may impose English-language conventions on the structure of foreign authors' names. In any case, it appears that some Spanish authors with publications in international journals consider the country where the journal is published when they place their name on the title page of the manuscript.
An analysis of the most frequent variants in each DB suggests some answers to some of the questions raised above. In overall terms, four main variants account for a large percentage of occurrences of author names (). These four variants are all derived from the two name structures that are currently the most common in Spain and that together represent the full legal names of 83.7% of the authors in our sample (). However, to understand these data, it is necessary to backtrack and deduce how these variants came about as a result of the indexing rules or criteria used by each DB.
In IME and MEDLINE, the most frequent and fourth-most frequent variants are the correct, standardized forms of authors names according to national (RCE) and international (IFLA) cataloging rules, the only departure from these guidelines is that in both cases initials are used for the first and middle names. IME uses the RCE criteria (summarized in ), and MEDLINE, produced by the National Library of Medicine (NLM), uses the NLM Cataloging System [51
]. This latter system is compliant with the Program for Cooperative Cataloging (PCC) developed by the Library of Congress and the Name Authority Cooperative Organization (NACO) [52
], whose base standards are the criteria established by the second edition of the Anglo-American Cataloging Rules (AACR2) [53
] for the formation of headings for persons and the specifications of the MARC format for type of personal name entry element [54
The second-most and third-most frequently seen variants in IME and MEDLINE, as well as the remaining variants shown in , can be derived from the base standards cited above, although these variants do not represent correct names. We can therefore deduce that the indexing criteria used by IME and MEDLINE follow Spanish linguistic practices, although they seemingly fail to apply mechanisms such as checking against an authority file to ensure that the same form of the author's name is indexed consistently and continuously throughout the life of the DB. Such systematization measures would greatly improve the retrievability of information by ensuring that all works linked to the same author are located in a single search.
In SCI, however, the most frequent variants are the result of the application of a specific indexing criteria used by this DB: “the general rule is that the final name presented is taken as the surname—this applies to all languages. All other names presented are processed as initials” [55
]. This general rule is compatible with the basic criterion of the AACR2 for the standard structure (name and surname) of names in English. On the basis of this general rule, ISI always considers the last part of the name given in the source document to be the only indexable surname for that author and thus uses this part as the entry element. The remaining parts of the name are reduced to initials; for example, José María Bermúdez García becomes García, JMB. The system used by ISI uses one exception for all languages: particles that link the first name with the surname are treated as part of the surname: “Particles are included as part of the surname. There is a list of accepted particles that is applied to all languages” [56
]. For example, Juan Luis del Arbol becomes Delarbol, JL. A specific rule used by SCI for Spanish names further confirms that these names are often mutilated by their indexing policy: “Compound names joined by ‘y’ or ‘e’ are split so that the last name presented is processed as the surname, and the conjunction is taken as an initial” [57
]. For example, María González y Rodriguez becomes Rodriguez, MGY.
According to these indexing criteria, and considering that the most common structure of Spanish names is “first name (middle name if present) first surname second surname,” the most common variants in SCI would be expected to be “second surname, first name initial first surname initial” (e.g., Angeles Ruiz Extremera becomes Extremera, AR) and second surname, first name initial (middle name initial) first surname initial (e.g., María Estrella Ruiz Requena would be expected to be indexed in SCI as Requena, MER). However, these variants actually occupy the fourth and sixth positions in descending order of frequency.
We found that the variants produced by authors who presumably adapted their pen name to English-language conventions were more frequent in this DB than the “standard” entry structure derived from applying SCI's indexing criteria to Spanish names published according to normal Spanish-language conventions. For example, authors whose name appeared as “first name first surname” were indexed under entries structured as “first surname, first name initial,” the most frequent variant in SCI. Authors who signed their articles as “first name middle name first surname” accordingly were indexed under “first surname, first name initial middle name initial”—the third most frequent variant in SCI. If authors joined their two surnames with a hyphen (first name, first surname-second surname), they were indexed as first surname run together with second surname, first name initial—the second most frequent variant in SCI.
Why are these variants found in SCI but not in MEDLINE? Both DBs are produced in the United States and may be assumed to adapt Spanish names in a similar manner. The explanation may lie in the fact that most of the items by Spanish authors in MEDLINE (about 70%) are from the thirty-four Spanish journals that this DB indexes, whereas nearly all items by Spanish authors in SCI are from journals published in English-speaking countries. This difference appears to be the result of two factors. First, author names in MEDLINE are spared any attempt to adapt them to English linguistic conventions and are indexed correctly. Second, for articles in journals covered in SCI, authors may have been more careful to adapt their names to English linguistic conventions, either spontaneously or to comply with the journal's instructions to authors.