In this study, we report the development of BioN
T, a publicly available database of 32 million negated sentences taken from three major literature resources: PubMed, PubMed Central, and Elsevier. BioN
T is currently the only database available that reports negated events reported in biomedical literature. Our study found that almost 10% of sentences published in biomedical literature incorporated negated information. The statistics indicate that negated events are abundant in biomedical literature and therefore BioN
T can be an important resource for biomedical scientists.
After evaluating negated sentences for autism, Alzheimer's disease, and Parkinson's disease, we found many genes that are thought to be relevant by experts incorporate biomedical evidences suggesting the opposite.
Despite its utility, BioN
T has several limitations. Although extensive, it is not comprehensive as there are several full-text articles that were not analyzed by BioN
T relies on NegScope to identify and mark negation scope; hence, errors in NegScope's predictions could result in certain negated cases being missed by BioN
T. Moreover, we used the heuristic that an event is negated if all entities in the query are present in the same sentence and at least one of them is within the scope of negation. However, given the nature of discourse, this situation may not always be true. For example, in the following sentence, the negation scope is marked in boldface, and it can be seen that the genes FMR1
are not negated; however, BioN
T marked the association between these genes and autism as negative - To date, genome scans, linkage and association studies, chromosomal rearrangement analyses and mutation screenings have identified: (i) genomic regions likely to contain autism susceptibility loci on human chromosomes 1 q, 2 q, 5 q, 6 q, 7 q, 13 q, 15 q, 17 q, 22 q, Xp and Xq; (ii) genes whose mutations represent a rare cause of non-syndromic autism (NLGN3 and NLGN4) or yield syndromic autism (FMR1, TSC1, TSC2, NF1 and MECP2); and (iii) candidate vulnerability genes, with potential common variants enhancing risk but not causing autism per se
(Table ). Finally, BioN
T is not aware of the semantic category of the target entities, which can lead to false positives. For example, gene MET
is thought to be associated with autism because several irrelevant sentences have the word 'met' in them but it is not used as a gene name.
Our results show that a long way still remains before negated events can be incorporated for genetic diagnosis. Additional semantic information may benefit the task, including complete or incomplete penetrance, gene expression, and molecular functions.
We plan to address some of the above mentioned limitations as future work. First, we plan to mark the semantic categories of words in the negated sentences. Specifically, we plan to mark entities such as genes, diseases, drugs, cells, chemicals, species and other biomedical entities within these sentences. This approach would help avoid false positives when one of the target entities is also a common English word or when an acronym is ambiguous. Marking semantic information would also help to identify cases when synonyms of entities might have been used. We will also explore heuristics that can better identify if the relationship between two entities is negated or not.