As previously stated, the major objective of this work was to compare the information defined as keyword content carried by different sections of a paper, especially the differences between the Abstract and the rest. Therefore, as source for our analysis we used a set of full text articles with a regular section structure, in our study having a defined Abstract, Introduction, Methods, Results, and Discussion (A, I, M, R, D). Another requirement was certain homogeneity of style across the articles (for example, a similar length of the Methods section) and, since there is great interest in the field of data mining on the detection of gene names, the subject should be related to Genetics. Thus, we chose the 104 articles published in Nature Genetics from June 1998 (volume 19, issue 2) to June 2001 (volume 28, issue 2), which comply with the AIMRD structure. Note that other journals, or even the Letters of the very same Nature Genetics, might have a different structure (for example, lacking separated I, M, R, D sections).
Selection of Keywords
To simplify matters, and following our previous work [4
], we focused on the extraction of relevant words (keywords) regarding objects, detected as nouns from natural text by a standard grammatical tagger (TreeTagger, Helmut Schmid, IMS, Stuttgart University, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
). In order to derive keywords from the section of an article, we first compute the associations between the words in the section. Here, we take the sentence as the unit of text to look for associations, that is, two words are associated in the context of a section if they co-occur repeatedly in sentences within that section (see METHODS).
Since words associated strongly to many other words are relevant to the matter that is dealt in the article [5
] we use a score (K
) that is higher for words with many and strong relations to other words (see METHODS). This measure is used to select words as keywords, in this case, related to objects such as proteins, genes, organisms, etc.
In order to evaluate the performance of the keyword detection, we observed how the selected keywords matched the MeSH (Medical Subject Headings, http://www.nlm.nih.gov/mesh/
) terms attached by indexers at the National Library of Medicine to these 104 articles (18.6 on average). Since MeSH terms can be composed of several words (for example, "Learning Disorders"), we selected those composed of a single word (6.80 terms on average). We noted that the most unspecific (for example, animal
) were often not present in the text and thus could not be matched by a keyword as opposed to species names (mouse, mycobacterium, human
), or anatomical terms (hippocampus, cerebellum, breast
). Of those single-word MeSH terms, 4.91 were found on average in the article (as nouns), and 2.22 were among the set of selected keywords (above K
>= 0.3). Obviously, a more accurate comparison to MeSH terms would require the detection of bigrams, and trigrams (keywords composed of multiple words), but this is out the scope of our work. The recall when matching the original MeSH terms (6.80 on average) went down from 4.91 / 6.80 = 0.72 in the dictionary of 470.6 different nouns present in an article to 2.22 / 6.80 = 0.33 in the 66.6 keywords selected. However, since the size of the list of all nouns found in an article (470.6) is much larger than the number of keywords (66.6), the precision in matching the MeSH terms of an article increased from 4.91 / 470.6 = 0.010 to 2.22 / 66.6 = 0.033.
Keyword Selection by Section
The number of words selected upon a threshold in the K value varies for different sections (see Figure ). The first observation is that there are a small number of words that have much better K scores than the rest. This means that the organization of words makes it possible to extract keywords for all the five considered sections.
Average number of keywords versus K for A, I, M, R, and D sections. The average number of nouns per section is, A = 52, I = 171, M = 404, R = 600, D = 331.
The number of selected words is very similar for all sections for very high values of K (above 0.8). Above a threshold on K (K >= 0.5; see Table ) the resulting number of keywords is quite similar for Introduction and Methods (around 15 for each) with the other three sections producing around nine keywords. However, if one accounts for the size of the sections it is obvious that the frequency of keywords (selected with K >= 0.5) per noun is the best in the Abstract (0.18), followed by the Introduction (0.08), with Methods, Results, and Discussion lagging behind. This justifies data mining strategies that focus in the analysis of Abstracts in order to minimize computational resources. However, this result already indicates that not all keywords are in the Abstract, and that therefore mining the rest of the article may be worthy.
Keyword selection per section.
Sections Display Heterogeneous Information
As a way to show that the keyword content in different sections is heterogeneous, we examined which keywords (if any) were selected in all the sections of an article. Our results indicate that, as it could be expected, not many keywords are present in every section and those are not very relevant. Even for a low threshold of K >= 0.3, there is on average only one of such general keywords per article. Those are often non-informative words such as "gene", or "protein". This indicates that the information is unevenly distributed across the sections of the article, that is, different sections contain different kind of information.
We illustrate the heterogeneity of the information by section with the keywords selected (for K
>= 0.5) for a particular article [6
] (Figure ). This work deals with a mutation of the Nf1
gene of mouse (an exon loss) that produces learning deficits. The only keyword present in every section is the organism under study, the mouse
. If the Methods section is excluded, only one single more keyword (mutation
) is selected. Other three-section overlaps give more interesting keywords such as the name of the gene under study (Nf1
), a domain contained in the resulting protein (GAP
), the method for testing learning performance of mice (maze
), or the resulting phenotype (impairment, lethality
). Keywords unique to different sections tend to correspond to the different information contained in each section. For example, the keywords unique to the Methods section deal with reagents and techniques (antibody, amersham, tris, primer
Figure 2 The keywords selected for an article  with a K >= 0.5 are represented as they appear in the different sections of the article.
In order to quantify the differences and similarities of content across the article we have used the number of keywords that are shared between different sections (Table ). The values indicate that the Methods section is the most different of all. In Methods, the content is usually focused on the techniques and protocols used, and not so much on the biological phenomena that is the main subject of the article. This alone explains why those keywords present in every section (for example protein, gene) are scarce and uninteresting.
Average number of keywords (K >= 0.5) shared by two sections for the corpus of 104 articles.
Regarding similarities between sections, A, I, and D are evenly similar among them, and R is the closest to M, as it is shown when plotting the distance matrix of Table as a dendogram (see Figure ). This is probably due to the fact that the Results section deals with the protocols used, although not as explicitly as the Methods section. The Discussion focuses again on the biological results (stressing their relation to the current knowledge) without detailing the techniques that have already been explained in Methods and justified in Results.
Figure 3 In order to display graphically the similarity between sections regarding keyword content, we took the inverse of the average number of shared keywords (Table ) as a measure of dissimilarity between sections, and we plotted it as a dendogram (more ...)
This result indicates that each section contains certain keywords that are unique to the section. In the following we try to characterize what are the differences in content between sections.
Qualitative Analysis of Subjects per Section
To make a deeper analysis of the kind of information present in each of the sections, we classified in seven categories a set of words present in our corpus of 104 articles (among the most frequent nouns). In order to do so as unambiguously as possible, we selected words that matched MeSH descriptors also consisting on that single word and belonging to only one major MeSH category (see METHODS). We added another category not present in MeSH, that of "Units, Dimensions, & Parts" in order to account for many terms that are currently not MeSH terms but are of interest to us.
The results (See Figure ) indicate that the large sections are a good source of keywords, obviously Methods gathering many terms related to techniques. Introduction, Results and Discussion contain a good deal of information regarding diseases. However, again, the Abstract section is shown as the best source for most subjects regarding frequency of keywords (Figure ) except for those typical of the Methods section (Techniques & Equipment; Chemicals & Drugs).
Figure 4 Word categories present in the five sections under analysis. Classes according to MeSH are A (Anatomy), B (Organisms), C (Diseases), D (Chemicals & Drugs), E (Techniques & Equipment), G (Biological Sciences). An additional class X was (more ...)
Distribution of Gene Names
Since the detection of gene and protein names is a very important subject, broadly used for the detection of macromolecular interactions (see for example [7
]), and because, as stated in the introduction, we are concerned about the relevance of matching gene names in different sections of an article, we examined the distribution of gene names across sections.
From a long list of genes names derived from the SWISSPROT database [8
], we selected a very restricted set of 539 genes whose names are composed of three letters followed by one single digit, thus very difficult to be mistaken to other words not being genes. For example, there are gene names called Not
. Shorter names (e.g. A6
) can also be a problem. A total of 224 gene names out of the 539 was matched in 76 of the 104 articles. The Results section was the one with the greatest number of unique gene names (Figure ). Again, the Abstract, and then the Introduction, are the sections with the highest frequency of these names (Figure ).
Distribution of matches to a set of 224 gene names across sections. (a) Average number of unique gene names per section. (b) Frequency of different gene names per total of nouns for each section.
In order to illustrate the problems that affect gene-name identification if context is ignored (even using gene names apparently easy to recognize) (discussed for example in [9
]) we checked manually the context of gene names that were exclusively mentioned in the Methods section. Of the 224 genes, just 24 were mentioned in the Methods section of the corresponding 14 articles and not elsewhere (see Table ). In five of the 14 articles, the name was referring to a non-gene object (three restriction endonucleases, a vector name, and a fibroblast cell strain). In five articles, the gene was mentioned in a technical context (usually, the gene mRNA level was used for analysis of cell state) and no biological process involving the gene was described. In only five articles we found the mention of the gene name relevant (See Table ). Additionally, we noted that of these 24 gene names, at least two (Pbp2
) could refer to two non-homologous (unrelated) genes, and another one (Sac1
) to four; such polysemous gene names complicate gene identification from text. Biologists are aware of such problems (see for example [10
]). In summary, extreme caution should be applied with gene names appearing uniquely in the Methods section because the context of gene names there is very different to that seen in the rest of the article. If automated methods to extract gene names from text are applied to the Methods section, those that explore the context of gene names using part-of-speech tagging (for example, [11
]) or Hidden Markov Models (for example, [7
]) should then perform better than those that just take co-occurrences of gene names [12
Detection of gene names appearing only in the Methods section.