We use a corpus that is specifically designed for summarisation for EBM8
. The corpus consists of real-life clinical queries, human generated summaries for each query and abstracts of articles referenced to generate the summaries. We divide the abstracts of the corpus into two sets – one for deriving statistics associated with good quality summaries (training set
: 1388 abstracts) and another for evaluation of our approach (test set:
1319 abstracts). The goal of the task is therefore to use the statistics derived from the first set to select the three most informative sentences from each abstract in the second set. For evaluation of the extracted summaries, we use ROUGE9
, which is a popular tool for evaluating the performance of summarisation systems.
We incorporate domain knowledge into our system by using a sentence classifier tailored for the EBM domain.10
The classifier classifies each sentence in a medical abstract as one of: Population, Intervention, Background, Outcome, Study and Other (PIBOSO). The classification of sentences into these categories enables us to analyse the type of content that is generally present in medical summaries. We also identify the medical concepts or semantic types that are present in the text of our corpus. For this we use the Unified Medical Language System (UMLS) and identify the concepts using the publicly available MetaMap11
tool. Similar to the PIBOSO information, this information enables us to identify important medical concepts that are generally present in summaries.
We commence our work by generating ideal
extractive summaries from the abstracts in our training set using the popular summary evaluation tool ROUGE. We do this by generating all three-sentence combinations from each abstract and then calculating the ROUGE-L1
f-score score of each combination to identify the best three-sentence combination for that abstract. The ROUGE-L score gives a measure of the similarity of an extract with the associated human generated summary in our corpus. Thus, the highest scoring three-sentence combination can be considered as the best extractive summary. We then use these best combinations to derive various statistics based on which our system performs the summarisation task.
We consider the problem of selecting the three sentences for a summary from a source text as three separate problems and derive statistics for each sentence position using the best three sentences in our training set. The score for each source text sentence, therefore, varies across the three target sentences and it can have a different score depending on whether the first, second or third sentence of the summary is being extracted. The statistics are based on factors such as relative sentence position (rps), sentence length (sl), PIBOSO classification of sentence (spib). The following is a brief discussion about each of these factors and how statistics related to each are generated and used.
Relative sentence position:
From the best sentence combinations of our training set, we create approximate probability distributions of relative sentence positions for each of the three target sentences. Thus, during summarisation, each sentence is given a score based on the probability of its relative position and the target sentence number.
Our analyses show that longer sentences tend to be more informative and therefore are generally more likely to be present in the final summary. Therefore, our summarisation approach rewards longer sentences and penalises shorter ones by assigning positive or negative scores.
From our training set, we derive the probabilities for each of the six PIBOSO types of sentences of being in the final summary. While existing research suggests that summaries of medical documents consist of Outcome sentences, there has not been any concrete analysis of this assumption. We therefore use our training set to obtain probability estimates of each type of sentence. The probability for a specific type of sentence is estimated by dividing the proportion of that type of sentence among the best sentence combinations by its proportion among all the sentences in the training set. The probability distributions for each of the three target sentences show that while it is highly probable for the last target sentence to be an Outcome sentence, the two other target sentences tend to include some Background, Population or generic (Other) information. Thus, incorporating this measure enables our summariser to include a number of different topics in our final extracted summaries based on probability, similar to the human generated summaries.
For each sentence, each of these factors contributes a score, which indicates the likeliness of the sentence of being in the final summary based on that factor. These scores are combined using the following Edmundsonian
equation to generate the final score for a sentence:
score = (α × rps) + (β × sl) + (γ × spib)................................... (1)
To calculate optimal values for the weights α, β and γ we perform an exhaustive search through values from 0 to 1 (with step sizes of 0.2) and choose the values that give the best results over the training set.