Enzymes are mostly protein based biomolecules that accelerate the rate of chemical reactions in a living organism. Enzymes are made of amino acids whose unique characteristic composition enables them to have different functionalities and also making them work efficiently at stable conditions, such as optimum temperature and pH. Thus, any mutation that occurs in this amino acid sequence may change the enzyme's 3D-structure, catalytic activity or stability, or even making the enzyme completely non-functional. Therefore, the knowledge of mutations and their impacts are of crucial importance in order to completely understand mechanisms of enzymes' functionality, and stability.
Many experimental studies have focused on finding the effects of mutations on enzymes. For example, many studies have aimed to create enzymes with novel properties such as designing hyperthermophilic enzymes [1
] or expanding the substrate specificity of an enzyme [2
]. The published results of these projects provide scientific information for researchers who are engaged in finding an impact of mutation on an enzyme. Though many databases are available on the nomenclature of enzymes [3
] or structure and function [5
], to our knowledge only BRENDA (BRaunschweig ENzyme Database) [12
], the largest manually curated enzyme-specific information system, contains an information on engineered enzymes and their effects on the enzyme's catalytic activity while directly referring to scientific literature. Manually curated databases are both slow and expensive for extracting information from scientific literature. There is a need for an efficient automatic extraction method that allows accessing relevant information rapidly with great efficiency, and possibly at any time.
With the latest developments in information extraction, biomedical term recognition has become an important area for researchers. Dictionary-based, rule-based, and machine learning-based approaches are used to extract names of genes, proteins and other cellular substances [14
]. Several systems have already been developed for automatic extractions of mutations from biomedical literature. MuteXt, for example, developed by Horn et al. [17
] is one of the initial works that focused on extracting single point mutations from scientific literature. Moreover, a gold standard data set [18
] is created for comparing the performance of mutation extraction systems and systematic evaluations [18
] for these systems are developed with a precise definition of evaluation metrics. The next step in mutation informatics is finding relation of mutations to other biological entities such as genes or proteins.
Rebholz-Schuhmann et al. [20
] developed MEMA, which extracts disease-related mutation-gene pairs from Medline abstracts. In MEMA, identification of both gene names and mutations are based on regular expressions compiled into two different finite state automations. If the abstract or the sentence that the mutation is extracted from contains only one gene name, the detected mutation is associated to that particular gene. However, if there is more than one gene name, the MEMA uses syntactical rules and proximity parameters as a criterion for decision. MuGeX (Mutation Gene eXtractor) uses a similar approach developed by Erdogmus et al. [21
] in order to extract disease related protein mutations. MuGeX makes use of regular expressions in identifying possible mutations. However, it also handles ambiguous mutation citations by using machine learning techniques. For gene name identification, it uses a dictionary-based approach and then associates the extracted entities according to proximity measures.
For mutation-protein associations, Lee et al. [22
] developed Mutation GraB, which identifies mutations using regular expressions, similar to the previous methods [20
]. Protein identification is also performed with regular expressions, which search for a dictionary of protein names and synonyms. Lastly, Mutation GraB uses graphs in which shortest-distance search and word bigram analysis are used in order to find the associations between mutations and proteins. MutationMiner which is developed by Baker et al. [23
] follows a different approach than the previous systems. It mainly focuses on associations between mutations and protein structure visualizations using NLP techniques. The system identifies the proteins and mutations in the form of name entities and if cited in the same sentence, the MutationMiner associates them to one another. Moreover, this system has been improved of late with the support of biological ontologies which make mutation annotations available in a semantically consistent format, and with the OWL ontology which enables the automated means of accessing knowledge possible [25
The above information extraction techniques became necessary because of the increased number of electronic documents. At the same time, however, the task to classify these documents based on their contents makes document classification an important field for researchers. Especially after integrating machine learning techniques to document classification, its accuracy has now became comparable to the less than 100% accuracy of human expertise [27
]. Therefore, because of growing interest and high accuracy rates, document classification has been used in different applications such as document organization [27
], word sense disambiguation [30
] or web document classification [31
Although the above works make it possible to associate mutations to other biological entities, the experimental results documented in scientific literature cannot be extracted with the techniques discussed above. Therefore, in case of mutation informatics, the next step is to extract information describing the effects of the mutations [17
]. The writers of this paper developed EnzyMiner, which is capable of automatically extracting protein mutations from PubMed [34
] abstracts for a given enzyme and classifying their impact on the enzyme's functionality and stability. In the case of mutation identification, the information extraction and document classification methods are used. For impact analysis, document classification techniques are again used for identifying the abstracts that contain a change in the stability conditions or catalytic activity of an enzyme resulted from a mutation.