EDGAR begins by assigning an underspecified syntactic parse to each sentence in the abstract under consideration. All subsequent analysis depends on this structure. The natural language processing tools include a stochastic tagger [Cutting, et al., 1992
], which resolves part-of-speech ambiguities in support of the underspecified parser [Aronson, et al., 1994
]. As shown in the analysis (4) for the sentence in (3), the syntactic structure is underspecified in the sense that, although low-level constituents (notably noun phrases) are identified, they are not attached in a fully-specified parse tree.
3) “This effect of cyclosporin A or herbimycin A on the down-regulation of ERCC-1 correlates with enhanced cytotoxicity of cisplatin in this system.”
4) [this effect]NP [of [cyclosporin A]NP]PrepP [or]CONJ [herbimycin A]NP [on [the down-regulation]NP]PrepP [of [ERCC-1]NP]PrepP [correlates]V [with [enhanced cytotoxicity]NP]PrepP [of [cisplatin]NP]PrepP [in [this system]NP]PrepP
To identify those noun phrases that function as arguments in the predications representing drug and gene interactions, EDGAR relies primarily on the Metathesaurus, with support from the ancillary gene and cell lists. Given the clinical orientation of the UMLS, the Metathesaurus has wide coverage of the drugs that appear in the relevant abstracts. However, since none of the constituent vocabularies of the Metathesaurus has extensive coverage in molecular biology, genes and cells are not as well represented. Furthermore, the ancillary lists are incomplete, particularly for cell lines. Therefore, EDGAR uses contextual information to identify gene and cell names when these do not appear in any of the available knowledge sources.
The general strategy for harvesting contextually-determined gene and cell names depends on the fact that the structure of noun phrases referring to cells and genes in the abstracts in this domain is quite regular. The phrases in (5), all from a single abstract, are typical.
5) human ovarian carcinoma cells, a2780/cp70 human ovarian carcinoma cells, a2780/cp70 cells
Each noun phrase in (5) has cells as its head. Furthermore, if the word that appears immediately to the left of the head is not a normal English word, it is the name of a cell. These generalizations are paradigmatic of the general approach taken to identifying both gene and cell names by context.
A small set of characteristic signal words (such as cell, clone, line, and cultured for cells and activated, expression, gene, and mutated for genes) mark certain noun phrases as referring to either cells or genes. In such phrases, the characteristic words occur in a regular pattern with respect to the names of genes and cells. Words such as cell, line, and gene function as heads of the phrase and the target name is likely to occur immediately to their left. Cultured, activated, and mutated are modifiers that precede the target name. A few signal words, such as expression (and related forms) may serve as the head of a gene noun phrase but may also indicate that their complement (introduced by of) is almost certainly a gene name.
Once gene and cell noun phrases have been identified, the potential target name is scrutinized in order to eliminate carcinoma, for example, as the name of a cell type. If the text token immediately to the left of the word cell does not occur in the SPECIALIST Lexicon and does not have the orthographic characteristics of a normal English word (normal words contain at least one vowel and no digits), then it is likely to be the name of a cell. Similar rules apply to other characteristic signal words and the corresponding gene or cell names.
Although these generalizations have been found useful, they are not always correct. Hyphenated expressions, in particular, produce false positives. For example, upon encountering the noun phrase c-myc-overexpressing cells, EDGAR concludes that c-myc-overexpressing is the name of a cell because this string is not in the SPECIALIST Lexicon. Similarly, apoptosis-related is identified as a gene name on the basis of the noun phrase apoptosis-related gene expression. Because of the many hyphens in gene and cell names, additional work in this area is necessary.
Contextually-identified gene and cell names are harvested in an initial pass through the entire abstract before the identification of all drugs, genes, and cells is attempted. This separate pass is necessary because a gene or cell name may occur only once in a context in which it can easily be identified. For example, in (6), the appearance of c-fos and c-jun as modifiers in the noun phrase whose head is expressions provides strong evidence that these are gene names. This evidence can be used with confidence when the same names appear in another sentence in the same abstract (7) but in a context which less reliably identifies it as a gene name.
6) Cyclosporin A and herbimycin A, which suppress c-fos and c-jun gene expressions, respectively, blocked the cisplatin-induced increase in ERCC-1 mRNA.
7) The products of c-fos and c-jun are components of the transcription factor AP-1 (activator protein 1).
Gene and cell names identified by context constitute an internal knowledge source local to the current abstract. This local source is used to supplement the Metathesaurus and ancillary lists when each sentence is processed to identify arguments in the predications representing drug and gene interactions in cells.
Argument identification proceeds by examining each noun phrase in the underspecified syntactic parse for each sentence and determining whether it matches a Metathesaurus concept, an entry in one of the ancillary lists of genes and cells, or an item in the local, contextually-determined list. For access to UMLS, EDGAR calls on MetaMap [Aronson, et al., 1994
], a program that examines the syntactic structure of noun phrases and determines the best match between the input phrase and concepts in the Metathesaurus. A noun phrase that maps to a Metathesaurus concept and that has one of the UMLS semantic types “Pharmacologic Substance,” “Gene or Genome,” or “Cell” is considered accordingly to be a drug, gene or cell. For example, when the sentence in (3) above is submitted to MetaMap, EDGAR determines that the noun phrases in (8) refer to drugs. A search in the ancillary lists finds that (9), another noun phrase in (3), is a gene name.
8) [of cyclosporin A] - Cyclosporine (Pharmacologic Substance) UMLS [herbimycin A] - herbimycin (Pharmacologic Substance) UMLS [of cisplatin] - Cisplatin (Pharmacologic Substance) UMLS
9) [of ERCC-1] - ERCC1 (Gene) Ancillary list
As suggested in the discussion of (6) and (7), during this phase of the processing, contextually-determined items are also used whenever possible to identify arguments as either genes or cells.
EDGAR retrieves cell features other than the name, including organ type, cancer type, organism, and several domain specific features, the most important of which refer to transfection and resistance. EDGAR harvests this information using techniques similar to those described for the contextual identification of gene and cell names: specific signals (notably transfected and resistant) provide guidance, and the Metathesaurus semantic types are consulted for organisms, body parts, and neoplastic processes.
The algorithm for identifying the referential vocabulary that represents the interaction of genes and drugs in cells is recapitulated schematically in .
To further illustrate the processes in EDGAR, we show here the analysis of a MEDLINE abstract (UI 99140404) with the title “V-src induces cisplatin resistance by increasing the repair of cisplatin-DNA interstrand cross-links in human gallbladder adenocarcinoma cells.” All of the gene and cell noun phrases discovered by EDGAR in this abstract are given in (10) and (11), respectively.
10) gene_np([activation, of, src]). gene_np([activated,’h-ras’]).
gene_np([‘v-src’, transfected,’hag1’, human, gallbladder, …adenocarcinoma, cells]).
gene_np([‘v-src’, transfected,’hag/src3-1’, cells]).
gene_np([‘v-src’, transfected, cells]). gene_np([activated, src]).
gene_np([mrna, expression, of, topoisomerase, ii]).
11) cell_np([human, gallbladder, adenocarcinoma, cells]).
cell_np([‘v-src, transfected,’hag-1’, human, gallbladder, …adenocarcinoma, cells]). cell_np([‘v-src, transfected,’hag/src3-1’, cells]).
cell_np([‘hag/src3-1’, cells]). cell_np([cell, lines]).
The drugs, genes and cells identified as arguments are listed in (12), (13), and (14). Note that, because of word-sense ambiguity, mapping to the UMLS occasionally produces errors such as “Link” in (12). There is a drug with this name in the Metathesaurus, and MetaMap erroneously matched the text cross-links from the title to this concept. Also note that appropriate characteristics have been added to the cell predications in (14) (e.g., the “tfw” label to indicate transfection with v-src).
12) drug(’99140404’,’Doxorubicin’). drug(’99140404’,’Etoposide’).
drug(’99140404’,’Fluorouracil’). drug(’99140404’, wortmannin).
drug(’99140404’,’Link’). drug(’99140404’, herbimycin).
drug(’99140404’, radicicol). drug(’99140404’,’Cisplatin’).
13) gene(‘99140404’,’h-ras’). gene(‘99140404’,’v-src’).
14) cell(‘99140404’,’HAG-1’,’Gallbladder’,’Adenocarcinoma’, tfw(‘v-…src’),’Human’).