E3Miner is an information extraction system that extracts E3s and their ubiquitination-related data from the published literature. Currently, the focus is on the extraction of E3-related information from the sentences, which mention E3s for definition, apposition, example, role and activity. After selecting E3-mentioning sentences, the system identifies E3 names from the text, based on a three-step approach: (i) tagging parts-of-speech (POSs) to the words in each sentence; (ii) identifying candidate protein names of E3s by using phrasal rules for definition, apposition, example, role and activity and then (iii) grounding (or linking) protein names into the corresponding entries in the UniProt database.
shows a sample procedure of the E3Miner system. Given a MEDLINE abstract, the system first selects sentences with enzyme markers for E3s (e.g. E3 ligase, ubiquitin ligase and ubiquitin-protein ligase). It then identifies candidate expressions for E3 protein names from the sentences, by using the clausal and phrasal rules for E3s. If candidate expressions include protein names, the system grounds (or links) the protein names into the corresponding entries of UniProt. It then identifies the protein names for E3-interacting proteins, such as target substrates, ubiquitin-activating enzymes (E1s), ubiquitin-conjugating enzymes (E2s) and deubiquitinating enzymes (DUBs), by using clausal rules for the E3-substrate interactions and other protein interactions. E3Miner then extracts E3-related data, such as GO terms, E3 domains, ubiquitination types and ubiquitination sites by matching them with the sentences, and then integrates the disease information from OMIM and other interacting proteins from IntAct, using the identified UniProt ID (UPID). In the output shown in , IDs beginning with ‘UPID’, ‘GO’ and ‘MIM’ indicate UniProt ID, GO term ID and OMIM ID, respectively. In addition, E3Miner provides statistical information from the precompiled E3 data. We describe each step of the procedure further in the following subsections.
An example procedure for E3 data extraction.
Selecting relevant sentences
E3Miner splits the input text into individual sentences and then selects sentences with mentions of E1, E2, E3 and DUB, using regular expressions of enzyme class names. For example, the E3 class names include markers, such as E3, ubiquitin-protein ligase, ubiquitin ligase
. shows example patterns for enzyme class marker. E3Miner utilizes the regular expressions to search for them from each sentence, with consideration for such morphological variations as plural endings. The details of regular expressions are shown in ‘Supplementary Material
’ Section 1.
Examples of enzyme class markers
If there are sentences with mentions of E3 markers, E3Miner attaches POS tags, such as NP (noun phrase) and VP (verb phrase), to the words of each such sentence. It then extracts sequential NPs with E3 protein/gene names, using clausal and phrasal rules for definition, apposition, example, role, activity and noun complex. shows example patterns for such rules. In this table, P and E3 indicate the noun phrases for E3 protein/gene names and the E3 class markers, respectively.
Example clausal and phrasal patterns for E3 protein names
Taking into account parentheses (‘(‘and’)’), hyphenation (‘-’) and coordination items (‘and’ and ‘or’), E3Miner recognizes E3 protein/gene names from extracted noun phrases, and discards E3 class names, if present. It then identifies UPID and synonyms from UniProt for the recognized E3 names.
In this work, we developed (i) a POS tagger that assigns to each word its most frequent POS tag by looking up a manually curated POS dictionary together with domain-specific correction rules, (ii) a noun phrase recognizer that looks for noun phrases that begin or end at words involved in determiners (e.g. ‘a’ and ‘the’), prepositions (e.g. ‘for’ and ‘with’) or verbs, since the noun phrases for ‘P’ and ‘E3’ in the patterns are adjacent to such words and (iii) a protein name linker that finds out UPIDs by using an organism name and a protein/gene name identified from the same sentences. If the procedure fails to identify the organism name, it attempts to search for UPIDs with an organism name from the title of abstract or from the prefixes (e.g. ‘h’ for ‘human’ and ‘y’ for ‘yeast’) of the identified protein/gene names. In this process, the procedure performs the exact match of the identified organism and protein/gene names, along with their variations by whitespace (‘’), hyphenation (‘-’) and symbols (e.g. ‘I’ and ‘II’). If this match results in a single UPID, then the procedure assigns it to the protein mention; but if multiple UPIDs are found, the procedure shows all of them without further disambiguation. In this case, the procedure assigns the first UPID for ‘human’ to the mention, as a default, for further uses in our precompiled E3 data.
Identifying E3-interacting proteins
If protein names of E3s are identified, the tool finds other E3-mentioning sentences that do not contain enzyme class markers. E3Miner then extracts target substrates from such sentences, using clausal rules encoding the ubiquitinating relation, such as ‘E3 ubiquitinate P’, ‘E3 target P’ and ‘ubiquitination of P by E3’, where P indicates the protein name of a target substrate. The interacting relations used in our system are further elaborated in Section 2 of ‘Supplementary Material
’. E3Miner identifies enzymes, such as E1, E2 and DUB, using a method similar to the E3 identification, and checks for co-occurring E3s in the same sentence, in order to ensure that their protein interactions are positively involved in ubiquitination pathways.
Extracting GO terms and ubiquitination-related features
E3Miner identifies GO terms that occur in sentences with mentions of E3s. It locates a part of noun phrases in sentences with a complete list of GO terms, by using the longest-first matching method. If a UPID is identified for an E3, this procedure imports GO terms from the corresponding UniProt entry. It then identifies ubiquitin-related molecular features, such as E3 domain, auto-ubiquitination, ubiquitination types of E3 and ubiquitin-binding sites of substrates. E3Miner extracts such features by matching the following patterns: (i) E3 domain names by using a marker ‘domain’ or words ending with ‘-dependent’ and ‘-containing’; (ii) auto-ubiquitination by the word-level patterns, such as auto- or self-ubiquitination; (iii) ubiquitination types by the patterns, such as poly-/multi-/mono-/oligo-ubiquitination and (iv) ubiquitination sites by patterns, such as ‘lysine #-linked’, ‘lys(#)-linked’ or ‘K#-linked’, where ‘#’ indicates the location number of a lysine residue.
Integrating data for human diseases and other E3-interacting proteins
The system searches for human disease names for E3s and their substrates from OMIM. It utilizes protein/gene names and their synonyms in the search process. It then integrates other E3-interacting protein names from IntAct, using UPIDs for E3s. Such interacting proteins may include (i) unknown substrate proteins binding to an E3, (ii) unknown enzymes involved in a ubiquitination pathway, (iii) unknown proteins having an important role in the regulation of an E3 and (iv) unknown proteins to be regulated by an E3. We believe that such protein information is useful for in-depth investigation of E3-related proteins.
Providing statistical information
E3Miner generates statistical information from the precompiled E3 data. The system looks up the precompiled data, in order to provide the statistics of E3-interacting proteins, together with their references to source articles for further investigation. Using this statistics, users may infer the importance, relevance, and research interest of a particular E3.