In this study, we are interested in identifying statements declaring the deposition of biological data (such as microarray data, protein structure, gene sequences) in public repositories. In the rest of this article, we will refer to such statements as ‘deposition statements’. We take these statements as a primary method of identifying articles reporting on research that produced the kind of data deposited in public repositories. (1) and (2) show examples of such statements, with varying degrees of specificity. In (1) both the data and location are referred to in a highly specific manner [i.e. ‘the sequence of labA’ and ‘DDBJ/GenBank/EMBL databases (accession no AB281186)’], whereas in (2) data and deposition location are both very general (‘the microarray data’ and ‘MIAMExpress’). While the mention of data, public repositories and accession numbers are strong indicators of deposition statements, (3) and (4) show that these elements can also occur when authors refer to previous work. In the remainder of this article, we will refer to statements that do not report the deposition of data in public repositories—such as (3) and (4) as ‘non-deposition statements’.
- The sequence of labA has been deposited in the DDBJ/GenBank/ EMBL databases (accession no AB281186) (PMID 17210789).
- The microarray data were submitted to MIAMExpress at the EMBL-EBI (PMID 18535205).
- Histone TAG Arrays are a repurposing of a microarray design originally created to represent the TAG sequences in the Yeast Knockout collection (Yuan et al., 2005; NCBI GEO Accession Number GPL1444) (PMID 18805098).
- Therefore, the primary sequence of native Acinetobacter CMO is identical to the gene sequence for chnB deposited under accession number AB006902 (PMID 11352635).
gives an overview of the annotated datasets used in the training and test phases of this work. The various datasets shown on are provided as Supplementary Material
and are also freely available to the research community from the NLM/NCBI website. The following sections describe details of the datasets and experiments. In Section 2.1
, we describe the method used to collect the training datasets, and the analysis of deposition sentences that we carried out in order to gain an understanding of the variety and common characteristics of these statements. In Sections 2.2
, we explain how the training datasets were used to automatically identify deposition elements and perform sentence classification. Finally, in Section 2.4
we present the test set and in Section 2.5
we describe the experiments performed on the test set.
Overview of annotated datasets used in this work.
2.1 Training corpus collection and analysis
: to gain a better understanding of the variety of deposition statements across data types, journals and databases, we compiled a corpus of deposition statements based on previous work by Piwowar and Chapman (2008a
) and Ochsner et al. (2008
) that we extended.
Specifically, 112 microarray deposition statements from 105 articles were obtained using the existing corpora. After a manual review of these statements, two strategies were devised to collect additional statements. Our regular expression strategy
consisted in two steps. First, the Ochsner et al.
was used to retrieve 2008 articles in PubMed Central. Second, articles were segmented into sentences and sentences likely to report data deposition were retrieved if they met the three following criteria: (i) sentence length was between 50 and 500 characters to avoid section titles and sentence segmentation errors; (ii) sentence contained a mention of GEO or ArrayExpress, which are the largest databases for microarray data (Stokes et al., 2008
) or a mention of a GEO or ArrayExpress accession number or the pattern [micro]?array … data| experiment/analys| analyz; and (iii) sentence contained one deposition action seed from the following: deposit, found, submit, submission, available, access, uploaded, entered, posted, provided, assigned, archived.
After manual review, 133 of the 243 candidate sentences were added to the pool of deposition statements. The remaining 110 sentences [such as (3) and (4)] were kept as examples of non-deposition statements, and used in our machine learning strategy to retrieve deposition statements for data other than microarray.
In the machine-learning strategy, we aimed at enriching our training corpus, as proposed by Yeganova et al. (2011
). A simple (i.e. only using sentence tokens as features) Naïve Bayes (NB) model was built using the 243 microarray data deposition statements as positive examples and ~ 33 000 sentences (the 110 above non-deposition statements, plus sentences from MEDLINE abstracts that contained the word ‘deposit’ or ‘deposited’) used as negative examples.
In spite of our blanket assumption that the sentences extracted from MEDLINE abstracts were non-deposition statements, we did expect a small number of them to be actual deposition statements. Our reasoning was that the proportion of true non-deposition statements would be high enough to train an efficient model; while applying the model on the set of so-called negatives, it would rank the deposition statements high enough to collect them and adjust our training sets. By iterating on this method recursively, we finally obtained a training set of 586 positive or deposition statements (including the initial 243 microarray deposition statements) and 578 negative or non-deposition statements that scored high with the model (including the initial 110 non-deposition statements). This set was used as training data for building NB and SVM data deposition models, and will be referred to as Train-D ().
Analysis of deposition elements
: to better characterize deposition statements, sentences were tagged for components referring to data, deposition action and deposition location using the following guidelines:
- ‘Data’: a phrase referring to biological data that can be found in public repositories. Patient data and data relevant to ClinicalTrials.gov were not considered. However, generic references to data were marked, when used in the context of biological data. This included expressions such as ‘the data’, ‘the protein’, ‘RNA’, ‘DNA’. In addition, specific references to data such as ‘p53 conditional knockout mouse aCGH data’ were also marked.
- ‘Action’: a phrase describing the action undertaken by authors regarding depositing data. This includes phrases such as: deposit, submit, upload/download, is available, can be found, etc.
- ‘General Location’: reference to the location of data deposition, e.g. public repository name or website URL (e.g. http://www.ncbi.nlm.nih.gov/genbank/). This also includes a reference to an organization hosting a public repository in the context of data deposition.
- ‘Detailed Location’: detailed reference to the location of data deposition. This includes accession numbers and specific URLs allowing direct access to the data deposited (e.g. http://www.ncbi.nlm.nih.gov/nuccore/GQ386843).
- (1t-4t) show how the statements exemplified in (1–4) were tagged.
- (1t) <data>The sequence of labA</data> <action>has been deposited </action> <location = “general”> in the DDBJ/GenBank/EMBL databases </location>(<location = "detail">accession no AB281186 </location>).
- (2t) <data>The microarray data</data> <action>were submitted</action> <location = “general”> to MIAMExpress at the EMBL-EBI </location>
- (3t) <data>Histone TAG Arrays</data> are a repurposing of a microarray design originally created to represent <data>the TAG sequences</data> in the Yeast Knockout collection (Yuan et al., 2005 <location = “general”>NCBI GEO</location> <location = “detail”>Accession Number GPL1444</location>)
- (4t) Therefore, <data>the primary sequence of native Acinetobacter CMO</data> is identical to <data>the gene sequence for chnB</data> <action>deposited</action> <location = “detail”>under accession number AB006902</location>
Based on this tagging effort, shows a summary of component occurrences over the corpus of 586 deposition statements. Only 16% of sentences contain information that is not included in one of the four tags (7% for full-text sentences, 24% for abstract sentences).
Overview of component occurrences in data deposition statements
‘Data’ is a category with high variability. While general references to data such as ‘the data reported in this paper’ (25 occurrences), ‘the microarray data’ (22 occurrences) and ‘the sequences’ (20 occurrences) are the most frequent phrases used, they are not prevalent overall.
‘Action’ is the category with the least variability. It is expressed by verbs in most cases. In other (rare) cases, nominalization expresses the action, e.g. ‘the deposition/accession number is …’. In more than two-thirds of cases, the action is expressed using a passive verb form, or a present verb + adjective, which is a similar construct. Future tense was used only once in the corpus. (Note that the variability on ‘actions’ is slightly skewed due to the selection of MEDLINE abstract sentences with the words ‘deposit’ or ‘deposition’—variability for actions is otherwise ~20%.)
‘General Location’ is also of high variability, in spite of the fact that there are a limited number of locations referenced, such as GenBank or GEO. Variation factors are as follows: (i) preposition introducing the location at/from/in/into/through/to/on/via/with/within; (ii) URL used (e.g. ~5 variants for GEO); (iii) use of full name and/or abbreviation for institutes (NCBI, EBI) and database (GEO); (iv) typos, spelling errors and other variation (e.g. database versus data bank). ‘Detailed Location’ is a category with relatively low variability if we consider accession numbers as one token type. Variation factors are as follows: (i) preposition introducing the location through/under/with; (ii) reference to ‘accession number’: code/number/no/(super)series; and (iii) list of numbers versus only one number. In the case of a list, a specific data description may be embedded within the list.
2.2 Automatic identification of deposition components in sentences
Based on the analysis above, the identification of the four deposition components defined (data, deposition action, general location and specific location) in deposition statements appeared to be important for extracting specific deposition information. To provide a complete description of the sentences, any part not covered by the four tags was considered as belonging to a fifth default tag, ‘nil’. In addition, we anticipated that these components might be useful features for the classification of deposition statements. For this reason, in addition to the 586 data deposition statements tagged, another 697 non-deposition statements were also tagged manually. The negative sentences tagged here are different from the 578 negative sentences used to train the SVM classifier in order to provide a good balance of sentences that were partly or entirely covered with the ‘nil’ tag. These tagged sentences were then used as a training set (that we will call Train-C) for training a conditional random fields (CRFs) model using MALLET.2
2.3 Automatic identification of deposition sentences
Using the final sets of 586 positive and 578 negative sentences obtained as described in the previous section (Train-D), we built several machine learning models in order to assess the contribution of the following features to the automatic identification of data deposition statements:
- Tokens from the sentences (also used in our simple model above)
- Sentence relative position in article or abstract
- Part-of-Speech (POS) tags obtained with MEDPOST (Smith et al., 2004)
- Component tags obtained with CRF model (trained using Train-C)
We compared NB and SVM models built using these features. presents the performance (in terms of average precision) of each machine learning method and feature set using 5-fold cross-validation.
Average precision of SVM and NB models for 5-fold cross-validation with various feature sets
2.4 Test corpus
We built a test corpus relying on MEDLINE curation of accession numbers. Specifically, we used the following query to retrieve full-text articles indexed with accession numbers and published in 2010 (we selected 2010 as a publication date to avoid any overlap with our training data):
- – (GenBank[si] OR GEO[si] OR PDB[si] OR OMIM[si] OR RefSeq[si] OR PubChem-Substance[si] OR GDB[si]) AND pubmed pmc local[sb] AND 2010[dp] (N = 2,029)
These articles were considered as ‘positive’ for data deposition and were therefore expected to contain a data deposition statement.
Based on the use of the MeSH term Molecular Sequence Data for indexing articles containing references to various types of biological data (as per Chapter 28 of the NLM indexer manual http://www.nlm.nih.gov/mesh/indman/chapter_28.html
), we used the following query to retrieve full-text articles containing reference to biological data but no deposition information referenced in MEDLINE:
- – Molecular Sequence Data [mh] NOT (GenBank[si] OR GEO[si] OR PDB[si] OR OMIM[si] OR RefSeq[si] OR PubChem-Substance[si] OR GDB[si]) AND pubmed pmc local[sb] AND 2010 [dp] (N=4,708)
These articles were considered as ‘negative’ for data and were therefore not expected to contain data deposition statements.
All articles (N
=6737) were downloaded from PubMed Central in xml format and converted to text format for processing. A subset of the corpus comprising 700 articles (including 210 articles from the positive set and 490 articles from the negative set reflecting real-data balance) was selected for testing. The MEDLINE [si] field for the 210 articles selected contained annotations for GenBank (110 articles), PDB (50 articles), GEO (47 articles), RefSeq (4 articles) and GDB (1 article).3
Sentence-level gold standard: the 700 articles were segmented into sentences that were scored both with the NB and SVM classifiers. In order to avoid favoring one particular method, for each method, the top-scored sentence was selected for each article, forming two sets of 700 sentences that were manually annotated to determine whether they were data deposition statements. The set composed of sentences that were top-ranked according to the SVM model was called Test-SS. The set composed of sentences that were top-ranked according to the NB model was called Test-SB. Out of the two sets of 700 sentences, 423 sentences were selected by both methods, so that the manual annotation was performed on one whole set of 700 sentences, and completed by annotating the remaining 277 sentences. The three annotators involved in this task (the authors) were not shown the scores assigned to the sentences by either classifier, and they did not know whether a given sentence came from an article in the positive or negative set. All three annotators first assessed a common set of 100 sentences (30 from the positive article set and 70 from the negative article set to preserve balance) in order to check the inter-annotator agreement and allow some discussion of potentially ambiguous sentences. The remaining 600 sentences for this set were divided evenly between the annotators in three subsets that preserved the overall data balance. Finally, the 277 diverging sentences from the other set were also processed by one annotator.
Article-level gold standard: the 210 articles with an accession number reported in MEDLINE were considered as positive for data deposition in our gold standard. In addition, based on the manual annotation of sentences carried out for building the two sentence-level test sets, articles corresponding to a sentence annotated as positive for data deposition were also considered as positive in the article-level gold standard. This allowed us to add to the gold standard 70 articles reporting the deposition of data in repositories that are not currently covered by MEDLINE curation, such as EMBL/EBI databases. The remaining 420 articles were considered as negative for data deposition in our gold standard. The dataset comprising gold standard judgments on these 700 articles is now referred to as Test-A.
2.5 Sentence and article classification
Classification was performed based on the scoring of article sentences. At the sentence level, a classification decision was made by comparison of the score assigned to the sentence with a threshold, set at the 25th percentile score for positive sentences in the training set: if the score was above the threshold, the sentence was classified as positive for data deposition. Otherwise, the sentence was classified as negative. At the article level, a classification decision was made based on the top-scored sentence. If the top-scored sentence was classified as positive for data deposition, so was the article.
The performance of sentence classification was assessed using accuracy to allow for indicative comparison with inter-annotator agreement. Specifically, accuracy was computed as the number of sentences that were correctly classified as positive or negative according to our gold standard over the total number of sentences in the test set (N=700). We also computed precision, recall and F-measure to allow for a direct comparison with article classification. The performance of article classification was assessed using precision, recall and F-measure based on positive sentences only, to allow for indicative comparison with related work. Specifically, precision was computed as the number of articles that were positive in our gold standard and also classified as positive by the algorithm over the total number of articles classified as positive. Recall was computed as the number of articles that were positive in our gold standard and also classified as positive by the algorithm over the total number of positive articles in the gold standard. F-measure was then computed as the harmonic mean of precision and recall.