Originally, curated data were captured in Excel spreadsheets that were then edited, subjected to QC review, loaded into CTD and made public on a monthly basis (10
). However, as both the biocuration team and the amount of curated data increased substantially in size, it was necessary to develop an application that would expedite the curation process, centralize all core curation activities, eliminate the bottleneck of spreadsheet integration and enhance the efficiency of editing and QC review.
Interactions are now recorded directly into an online curation tool. Biocurators first enter the PMID of the article to be curated (, Step 1). This creates a ‘PubMed Curation Activity’ page for the PMID with a direct link-out to the PubMed abstract (, Step 2). Three buttons allow a biocurator to open up an ‘Interaction Entry Page’ (, Step 3) to either generate a new interaction (‘New’), edit a previous interaction (‘Edit’) or duplicate a current interaction (‘Clone’); the latter feature is extremely useful in that it allows biocurators to easily replicate an existing interaction and then slightly modify one or more data fields, or build on an existing interaction, instead of having to re-enter all of the same information for each interaction. Finally, biocurators also capture the email address of the corresponding author (, Step 4), which allows authors to be notified when their data are presented on the public website. This simple step raises awareness of CTD to potential new users, and provides a mechanism for feedback from authors regarding the quality of the curation of their data.
Figure 4. Curation tool overview. (1) Biocurators submit a PMID to create a ‘PubMed Curation Activity’ page. (2) This page has a hyperlink to the PubMed abstract, which the biocurators use for curation. (3) Based upon the abstract, biocurators can (more ...)
The curation tool interface has additional features for the convenience of the biocurator (, Step 5). The ‘Upload’ button allows the biocurator to upload an Excel spreadsheet of interactions, instead of manually entering them into the tool one at a time. This feature is especially useful and time saving when curating extensive tables of microarray data from an article. The biocurator typically first copies and pastes the key features of the table (usually gene symbols and gene accession IDs) into an Excel spreadsheet, then adds the necessary data fields (e.g. coded interactions, species, high-throughput, etc.) into other columns in the spreadsheet, and uploads en masse all the interactions to the curation tool. The ‘Report’ button allows biocurators to retrieve all of their previously submitted curation to review and edit, if necessary. A ‘Not Curatable’ button allows biocurators to flag a PMID as not containing any relevant data to CTD. Given the high volume of papers curated by CTD, the tracking of such rejected PMIDs, along with curated PMIDs, is essential to ensure that newly triaged papers are filtered and removed from the corpus of articles if they have already been examined.
For a new interaction, the curation tool provides the biocurator with an ‘Ixn’ field in which the coded interaction can be entered (). After the biocurator composes the interaction and tabs out of the cell, the curation tool automatically displays the necessary data fields required to correctly complete the curation.
Figure 5. Detailed view of ‘Interaction Entry Page’. After a biocurator composes a new interaction and tabs out of the cell, the curation tool automatically pops up the required data fields (here, C2, C1, G1 and G2) to correctly complete the interaction. (more ...)
The curation tool is designed with several visual QC features to help prevent errors. Color cues (loosely based on a traffic light paradigm) are used to help visually alert the biocurator to a particular type of error and therefore facilitate the curation process. For example, if the interaction field contains a spelling error or the notation does not match any official term, the field turns red, indicating ‘STOP’ and includes an error message (A). As well, chemical, gene, disease and organism terms entered by a biocurator are compared against the corresponding controlled vocabularies in real time and the tool alerts the biocurator if there is a discrepancy among the terms (B). Red indicates that the entered term does not match any official term or synonym in the respective controlled vocabulary, and that the biocurator must stop and enter a new term. Green indicates that the entered term matches only one official term, and is therefore acceptable to continue (‘GO’). Yellow cautions that the entered term does not match any official term, but does match a synonym that resolves to only one official term; the tool also automatically replaces the entered synonym with the official term but still signals the biocurator to proceed with caution. Finally, purple alerts that the entered term matches a synonym that cannot be resolved to a single official term; since the curation tool cannot resolve which official term was intended, the biocurator must resolve it and re-enter a new term to continue. Although this traffic light paradigm facilitates the curation process, it is not essential that CTD biocurators be able to recognize color; the ability to ‘Save’ interactions is not enabled until all of the controlled vocabulary terms have been validated.
Figure 6. Color-coded QC. (A) If an invalid curation code (here, ‘sce’) is entered in the interaction field (Ixn), the tool automatically alerts the biocurator by coloring the window red (‘STOP’) and producing an error report at (more ...)
Software design and engineering
As a result of the success of the CTD notation and the associated spreadsheet-based curation process, it was extremely important that the tool's software be engineered to closely match the curation workflow, building upon the success of the notation while minimizing the disruption caused by the move from spreadsheets. In addition, the curation tool was designed to resolve inefficiencies of spreadsheet-based curation, including the obvious lack of centralization, the inefficiency inherent in coupling an extremely flexible notation with a fixed column spreadsheet, and the lack of immediate interactive QC for the biocurator. It was also important to meet the needs of a very geographically dispersed team of CTD biocurators, potentially international in scope, with a high degree of individual software and hardware configuration variability. Due to these factors, as well as the technical requirements associated with the tool itself, a web-based solution that integrated the curation notation intact was chosen for the curation tool.
The tool's ‘Interaction Entry Page’ dynamically tailors and displays the actor fields for each chemical (C1), gene (G1) and disease (D1) specific to the interaction notation entry (), which is more efficient for the biocurator than having to tab through fixed spreadsheet fields to get to a particular column. The QC process is immediate, and all errors associated with the interaction are displayed on a real-time basis without the biocurator having to leave the screen. In fact, many core QC edits are completed before the onscreen ‘Save’ button is ever enabled.
Another key component of software design is the use of passive messaging where possible. Biocurators were concerned about having to mouse-click through endless QC-related error messages or informational message boxes. Instead, we implemented a passive messaging-based traffic light paradigm for term validation. Other passive visual cues are included throughout the application. Active messaging (i.e. requiring the biocurator to mouse-click) is reserved for only serious operations, such as confirming the deletion of previously entered data. The vast majority of the curation tool's messaging is asynchronous in nature, i.e. passive onscreen messages or visual cues.
In terms of QC, as indicated above, basic edits, such as term validation, occur as the biocurator is curating; however, many of the more complex edits occur after the biocurator has pressed the ‘Save’ button. For example, if the biocurator entered the (erroneous) notation C1 +sce G1/p and then pressed ‘Save’, a more complex QC test would be performed on the server-side of the tool's software indicating that +sce was an invalid operator (A). Even in these cases, an error message will appear without the biocurator ever having to leave the screen or the screen being refreshed; here, the screen background turns red (‘STOP’) and the error message is displayed at the bottom (A).
As indicated above, some biocurators prefer to continue to use spreadsheets at times, typically to enter microarray data or as a result of the unusual nature of an individual PubMed article. In these cases, the spreadsheets are submitted using an ‘Upload’ feature (, step 5) and errors are returned to the biocurator on a real-time basis via a summary report. The biocurator may then correct any errors and resubmit the entire spreadsheet recursively until all the errors are cleared.