|Home | About | Journals | Submit | Contact Us | Français|
A huge amount of data about genomes and sequence variation is available and continues to grow on a large scale, which makes experimentally characterizing these mutations infeasible regarding disease association and effects on protein structure and function. Therefore, reliable computational approaches are needed to support the understanding of mutations and their impacts. Here, we present VERMONT 2.0, a visual interactive platform that combines sequence and structural parameters with interactive visualizations to make the impact of protein point mutations more understandable.
We aimed to contribute a novel visual analytics oriented method to analyze and gain insight on the impact of protein point mutations. To assess the ability of VERMONT to do this, we visually examined a set of mutations that were experimentally characterized to determine if VERMONT could identify damaging mutations and why they can be considered so.
VERMONT allowed us to understand mutations by interpreting position-specific structural and physicochemical properties. Additionally, we note some specific positions we believe have an impact on protein function/structure in the case of mutation.
The online version of this article (doi:10.1186/s12859-017-1789-3) contains supplementary material, which is available to authorized users.
According to the International HapMap Project , there are approximately 10 million common single-nucleotide polymorphisms (SNPs); whereas, in accordance with the 1000 Genomes Project Consortium, the difference between the genome of an individual selected at random and the reference genome is approximately 10,000 non-synonymous SNP (nsSNP) sites . SNPs represent more than half of all the disease-associated variations in the Human Gene Mutation Database (HGMD) .
The sequence variation in a genome is a complex phenomenon. A huge amount of data involving genomes and especially sequence variation is available and continues to grow on a large scale. This makes experimentally characterizing these variations in terms of disease association and effects on protein structure and function infeasible. Therefore, reliable computational approaches are needed to support the understanding of mutations and their impacts.
Over the past two decades, several computational methods have been proposed to understand and predict the influence of mutations in protein structure and function based on different evolutionary and physicochemical data. Two recent reviews gave a panorama of such tools by discussing some representative cases, with some overlap [2, 4]. We did not aim to develop an exhaustive list of such methods because we believe this was already done well in the mentioned reviews. In this paper, we comment on some recent strategies that have been proposed to understand and predict the impact of mutations on protein structure and function based on different perspectives.
Worth and colleagues proposed in  the web server Site Directed Mutator (SDM) , which uses a statistical potential energy function to predict the effect of SNPs on the stability of proteins based on environment-specific amino acid substitution frequencies within homologous protein families.
In , Pires and others introduced mCSM, which encodes distance patterns between atoms to represent protein residue environments as graphs, where nodes are the atoms and the edges are the physicochemical interactions established among them. From these graphs, distance patterns are extracted and summarized in a structural signature that is used as evidence to train predictive models.
Also based on graphs, in  Giollo and colleagues proposed NeEMO, a non-linear neural network model for the prediction of stability changes upon mutations based on residue interaction networks (RINs). RINs are a graph description of protein structures where nodes represent amino acids and edges represent different types of physicochemical interactions.
Laimer and others, in turn, proposed multi-agent stability prediction upon point mutations (MAESTRO) . The method combines multiple linear regression, a neural network approach and support vector machine (SVM) with a multi-agent method to predict protein structure stability mainly based on Δ Δ G. In , the authors present MAESTROweb, a web interface for MAESTRO (which is a standalone software).
A predictor of the Impact of Non-synonymous-variations on Protein Stability (INPS) was introduced in . This method computes the Δ Δ G values of protein variants without relying on the protein structure, taking advantage of the fact that the number of available sequences is much higher than the number of structures. In , the authors presented INPS-Multi Descriptor (MD), which complements INPS with a new predictor (INPS-3D) that exploits descriptors derived from the protein structure.
iStable, proposed in , integrates I-Mutant2.0 , MUPRO , AUTO-MUTE , PoPMuSiC2.0 , and CUPSAT  through SVM to predict protein stability changes upon single amino acid residue mutations, and it performs better than any single method alone.
DUET, presented in , combines mCSM  and SDM  to predict the effects of missense mutations by consolidating the results of both methods in an optimized predictor through SVM trained with Sequential Minimal Optimization .
Despite several strategies being proposed to predict the impact of mutations, none of them alone has been proven to be accurate in all scenarios where mutation impact is investigated . Under these circumstances, a strategy that has gained attention is combining methods based on different paradigms and protein structural properties for the purpose of reaching a consensus on the understanding of mutation impacts. iStable and DUET are examples of such methods. Another inconvenience regarding the methods that are widely used in the study of a mutation’s impact is the lack of interpretability.
Authors from the mentioned works on protein mutations and from the reviews [2, 4] note common directions that can be explored to develop strategies with more accurate predictions. Some notable guidelines are the use of consensus approaches that integrates various methods; the development of user-friendly tools; and the use of relevant features to better describe the properties of mutations, such as those based on sequence, structure and database annotation. In line with these directions, this article proposes ViewER MutatiON Tool (VERMONT) 2.0, a visual interactive platform that integrates sequence and structural parameters such as intramolecular interactions, solvent accessibility, and topological properties, coupled with powerful interactive visualizations to make the impact of protein point mutations more understandable. VERMONT is visual analytics oriented, so it allows domain specialists to analyze and make sense of many structural properties for gaining insights into the impact of point mutations.
The first version of VERMONT  was presented in Biovis Contest in 2013 to analyze data from a functionally defective triosephosphate isomerase (dTIM) and its S. cerevisiae parent (scTIM) based on a dataset of proteins of the same family. The main goal was to point out mutations that have an impact on function and suggest how the function could be rescued. At that time, VERMONT was populated only with the contest data, and it was not possible to analyze mutations in proteins other than dTIM.
Due to the positive feedback of VERMONT, which received the Biology Experts Pick award, we decided to implement a whole new VERMONT 2.0 from scratch. Now, the tool takes as input any protein structure in PDB file format. The input module automatically searches the Protein Data Bank  for similar structures given an entry informed by the user and according to a desired similarity threshold, or the user can enter a list of PDB entries. Then, VERMONT proceeds to the necessary computations and notifies the user when the analysis has been completed. Furthermore, new interaction graphs and protein molecular structural visualizations were included to potentialize the analysis of specialists. We also coupled in our platform, the FoldX tool , which predicts the impact of a mutation through the calculation of Gibbs free energy change (Δ Δ G) to complement visual parameters displayed in VERMONT, supporting users on the selection of harmful mutations.
In this section, we detail the VERMONT platform by describing problem modeling, its functionalities and interactive visualizations organized by modules. A summary of the VERMONT analysis process is presented in Fig. Fig.11.
Given a dataset, we compute a variety of sequence and structural parameters for each residue. We were interested in visually representing these parameters in a way that they can be examined to detect relevant similarities and differences as well as trends and exceptions, which constitutes a multivariate visualization problem.
A first task that domain specialists perform to identify similarities and differences among a set of proteins is a sequence alignment, which shows each protein sequence in a row and equivalent residues in the same column. In addition, residues are usually colored according to a color scheme that associates residues with similar physicochemical properties to the same color, helping to identify conservation in a particular column.
To take advantage of a visual representation that domain specialists are very familiar with, we used the multiple sequence alignment visualization as the basis for our platform. In addition to displaying sequence alignment, we include an intramolecular interaction network, solvent accessibility, physicochemical properties and complex network topological parameters in this basic sequence alignment visualization.
Each structure was modeled as a graph to study the network of intramolecular interactions and analyze its topological properties from a complex network perspective. We computed interatomic contacts using Delaunay triangulation , which is a geometric and cutoff independent approach where edges represent interatomic interactions, excluding occluded contacts. Contact computation was performed using the CGAL  library, version 3.3.1.
For each chain of a particular PDB id, we constructed an atomic level contact graph where nodes represent atoms and edges represent interactions among them. Nodes are labeled according to their physicochemical properties as positive, negative, hydrogen bond donor, hydrogen bond acceptor, aromatic, hydrophobic and cysteine based on our previous works [26, 27], which were, in turn, derived from . Edges are labeled according to interatomic interactions and distance criteria such as hydrogen bond, repulsive, salt bridge, aromatic, hydrophobic and disulfide bridge based on . The interactions were then mapped to residue level.
These graphs, which represent protein structures, are the basis for the Interactions and Topological properties modules of VERMONT. Table Table11 provides the distance criteria and atom labels for each interaction type.
Next, we describe some features that are common to more than one visualization module in VERMONT.
The VERMONT input module is shown in Figure S1 from the Additional file 1, and it takes three basic parameters:
Additionally, users can receive an email to be notified when the server finishes data processing.
A structure-based sequence alignment of each protein from the family set against the wild protein is performed in a pairwise manner using Multiprot . To represent this alignment, we used multiple sequence alignment visualization, a kind of visualization biologists use to analyze and visualize. This visualization is the basis of our strategy, and an example of the Structure-based sequence alignment module is provided in Fig. Fig.2.2. Sequences from the family set are stacked using the wild protein sequence, on the top, as a reference. The sequence of the mutant protein is then positioned above the wild protein. Each row and column represent a protein chain and a correspondent position in the alignment, respectively. Each residue is colored according to its physicochemical properties, and similar rows are organized next to each other using the clustering algorithm Expectation Maximization (EM). The coloring and clustering helps to identify conservations and exceptions in columns.
Three color schemes are provided for protein residues:
After selecting a color scheme, there are some features to help users analyze and make sense of the data that are common in VERMONT modules, so we describe them separately in the “Common features of VERMONT modules” section.
The intramolecular interactions of each structure are represented as a graph as detailed in the “Visualized attributes computation methods” section. However, it is not trivial to identify and grasp conserved patterns in protein interactions by visually inspecting graphs. Thus, we devised a 2D representation of intramolecular interactions that gives a panorama of the intramolecular network, delineating the conserved columns for the whole family dataset at once. An example of the Interactions module is provided in Fig. Fig.3.3. In Fig. Fig.33 a,a, we show all interactions at once, while we show the interactions for a selected column in Fig. Fig.33 bb.
The multiple sequence alignment visualization, which is the basis of our tool, is used to represent the interactions. Each residue is colored according to the interaction it establishes. If a residue establishes more than one interaction, it is colored in gray. Hence, VERMONT provides a general view of the interactions, delineating the conserved columns. Additionally, one can select a specific column to inspect its contacts, which points out specific patterns on the contacts of a correspondent position in the alignment. The user can choose to show or hide each type of interaction in the sequence alignment panel.
By clicking on a particular position (a residue) in the sequence alignment visualization, VERMONT shows the Interaction viewer. In this module, the interactions established by the selected residue are depicted as a 3D molecular representation of the protein (Fig. (Fig.33 c)c) and as a 2D schematic representation in the form of a graph (Fig. (Fig.33 d),d), which allows users to make sense of these interactions in the context of protein structures.
Some interactions involve residues that are close to each other in the sequence space while others involve residues that are distant. To support users on the visualization and analysis of both long and short range contacts, we have a zoom control to provide a general view of the interactions, maintaining long and short range contacts on the same screen by using low values for zoom. Contact details can be obtained by using high values for zoom, hovering the mouse over each residue to see more information or by clicking on a specific residue to see its interactions in 3D and in 2D representations.
Complex networks are graphs whose connections between nodes are neither purely regular nor purely random. Most real-world graphs, such as for protein-protein interactions or social or gene-regulatory networks, are complex .
In VERMONT, three common complex network centrality measures were computed for each residue; that is, each node from each graph that represents a protein structure. These metrics were computed using the iGraph  package, version 1.0.1. Here, we briefly describe them. In the Additional file 1, we describe these metrics in detail and some of their uses and meanings in biology. Figure Figure44 shows the topological properties panel.
These network topological properties are displayed in VERMONT using a heatmap constructed based on the multiple sequence alignment visualization. Each measure (degree in orange, betweenness in blue and closeness in yellow) is shown on a specific heatmap panel. Individual residues contained in the alignment visualization are represented as color intensities.
This heatmap representation supports users by detecting relevant residues/positions in the alignment from the complex network perspective. Columns with high values of topological properties are shown in a dark shade of the selected color and columns with low values are shown in light shades. As a column corresponds to a specific position in the alignment, columns that exhibit a trend should be further investigated.
Solvent accessibilities were computed through Lee and Richards algorithm  using the software Naccess. This software calculates the accessible area by rolling a probe of a given radius (typically 1.4 Å, as it is the water radius) around the Van der Waal’s surface of the protein. The path traced out by the probe center is the accessible surface. Figure Figure55 shows the Accessibility module.
Hydrophobic interactions are important forces in initializing protein folding and stabilizing 3D structures of proteins. Hydrophobicity and the packing of hydrophobes in the hydrophobic core of a protein can affect protein stability . In globular proteins, the hydrophobic (apolar) residues are bounded towards the protein core, forming hydrophobic cores, whereas hydrophilic (polar) residues are more exposed to solvent. This hydrophobic packing in the protein core tends to be conserved in protein families. Thus, we believe a mutation in the protein core is more likely to be destabilizing than a mutation on the protein surface, with some exceptions as, for instance, mutations in the binding site and the active site.
We combined a multiple sequence alignment visualization with a heatmap to display accessibilities. We provide one heatmap for each accessibility computed using Naccess, which are all-atoms relative, total-side relative, main-chain relative, nonpolar relative, all polar relative, all-atoms absolute, total-side absolute, main-chain absolute, nonpolar absolute and all polar absolute. Each alignment position, which corresponds to a residue, is associated with a color intensity. The higher the value, the more intense the color. The lower the value, the less intense the color. This heatmap allows users to detect conserved columns (correspondent positions) in the alignment, which means columns that have high or low values of accessibility.
To assess the ability of VERMONT to support domain specialists when analyzing a large amount of structural properties to gain insights on the impact of point mutations, selecting those that are potentially damaging for further investigation, we performed a use case in which we selected a classical mutation dataset from Bongo , which has been used in many subsequent studies as [7, 8, 11, 12, 19]. We visually examine the mutations by integrating the sequence conservation, intramolecular interaction network, solvent accessibility, physicochemical properties and complex network topological parameters to gain insights into the impact of mutations. Additionally, we note a few mutations that could be potentially damaging according to VERMONT.
The p53 gene encodes a transcription factor with multiple, anti-proliferative functions activated in response to several forms of cellular stress. The core domain of tumor suppressor protein, p53, is responsible for approximately 50% of the mutations that lead to human cancers . Eight disease-associated mutations in the p53 core domain that were analyzed experimentally by Fersht and co-workers [37, 38] were used in this use case. In Table Table2,2, we provide these eight mutations. Next, we describe how two of these mutations, Arg273His and Ile195Thr, could be visually analyzed as illustrative cases using VERMONT. The other six mutations are described in the Additional file 1 due to space limitations. In this analysis, we considered the all-atoms relative accessibility. We worked with relative accessibilities as they express the accessible surface as a percentage of that observed in an Ala-X-Ala tripeptide.
The input parameters used in VERMONT were (i) PDB id 1TSR.A as the wild protein; (ii) the mutant fasta file, generated by manually changing original residues in the 1TSR.A fasta file by those that are the result of mutations; (iii) PSI-BLAST as the alignment method; and (iv) 70% identity. The results are available to be explored and analyzed in VERMONT. A summary of the results obtained for accessibility, topological properties, and interactions are presented in Tables Tables33 and and44.
The mutation Arg273His, which is the position 180 in the structural alignment, is a conservative mutation as both residues are polar positive according to the CINEMA color scheme. The Structure-based sequence alignment module shows that this column is highly conserved with Arg in approximately 89% of chains, His and Cys in approximately 5% each. The conservation on alignment position 180 is shown in Figure S2 from the Additional file 1. The accessibility, which is provided in Fig. Fig.6,6, is conserved but does not have very low values (ranges from 4 up to 39.7) (Table (Table3),3), as the column presents a light shade of blue. In regard to the topological properties (complex network metrics) (Table (Table3),3), shown in Figure S3 from the Additional file 1, the degree is conserved (3 up to 9); betweenness is not conserved as the column does not have a very similar shade; closeness is relatively conserved. Actually, in closeness, we see regions (a set) of conserved conserved columns, which makes sense considering that if a vertex (residue) has a high closeness value, it is close to many vertices and it is likely that his neighbors present similar behavior. The same holds for vertices with low closeness values. Regarding the interactions established by column 180 (Table (Table4),4), the majority of residues in this position establish charged attractive, charged repulsive and hydrogen bonds, so these interactions are highly conserved. Hydrophobic interactions, provided in Fig. Fig.7,7, are not conserved, as there are only 8 chains (approximately 8%) that establish such interactions in this position. In Figure S4 from the Additional file 1, we show an example of how the domain specialist can inspect the specific interactions established by a residue at the atomic level. By clicking on any residue of the Interactions module, VERMONT shows the interactions established by a particular residue/atom in the context of protein 3D structure in a molecular viewer and in a 2D graph schematic representation. To sum up, we would not consider this mutation as damaging (which is in accordance with FoldX, which outlines this position with a gray rectangle) because the residue change is conservative, the accessibility is not low and there are few, non-conserved hydrophobic interactions in this position.
Ile195Thr corresponds to position 102 in the structural alignment and is non-conservative, as Ile is nonpolar aliphatic and Thr is polar neutral. Figure Figure88 shows column 102 is highly conserved, presenting only Ile residues. The accessibility in this column, provided in Fig. Fig.99 and Table Table3,3, is low and conserved as the whole column presents a light shade of gray (0.3 up to 7.6). With regard to the topological properties (Table (Table3),3), shown in Figure S5 of the Additional file 1, the degree is relatively conserved (2 to 5); betweenness is not conserved and closeness is relatively conserved. In regard to the interactions in the alignment position 102, 100% of residues establish hydrogen bonds and 98% establish hydrophobic interactions (Table (Table4).4). Therefore, the hydrogen bonds and hydrophobic interactions are highly conserved. Figure Figure1010 shows the conservation of hydrophobic interactions for alignment position 102. Figure Figure1111 provides 3D and 2D views of interactions established by p53 wild type protein (1TSR.A, Ile195), showing that on the alignment position 102, the wild type protein presents only hydrogen bonds and hydrophobic interactions. Having these aspects in mind, we consider Ile195Thr as likely damaging, as it is non-conservative, with low and conserved accessibility and highly conserved hydrophobic interactions. This conclusion is in accordance with FoldeX, which outlines this position with a red rectangle.
Analyzing these experimentally studied mutations through our visual platform and combining different structural and physicochemical data in a totally visual and interpretable manner, gives relevant clues that support the identification of damaging mutations. In the case of p53, it seems that mutations in positions with conserved and low accessibilities, coupled with conserved hydrophobic interactions, are likely to have an impact on protein structure/function. We could also analyze this dataset by looking for these specific characteristics to identify critical positions. For instance, the positions 145(Leu), 157 (Val) and 254 (Ile) in p53 (1TSR), which correspond to positions 52, 64 and 161 in the alignment, have very low and conserved accessibilities and have highly conserved hydrophobic interactions.
This report presents VERMONT 2.0, a visual interactive platform that integrates sequence conservation, the intramolecular interaction network, solvent accessibility, physicochemical properties and complex network topological parameters, combining them with powerful interactive visualizations to make the impact of protein point mutations more understandable.
To assess the ability of VERMONT to gain insight into the impact of point mutations, we presented a use case in which we analyzed mutations that have been experimentally characterized. We show that VERMONT is able to identify these mutations in a completely visual manner, providing clues that help to identify those that potentially have an impact on structure/function. In this specific dataset, harmful mutations tend to present low and conserved values for accessibility, combined with conserved hydrophobic interactions.
As future work, we intend to design an automatic strategy to support users on the detection of harmful mutations, based on the structural and physicochemical properties computed by VERMONT. Additionally, we would like to investigate how VERMONT can be extended to address the multi-chain protein complex as a whole, as currently, we use individual chains. Last, we consider allowing domain specialists to use not only structures that are experimentally solved and available in PDB but also their own models.
Supplementary material. Additional details and figures to support on the understanding and usage of VERMONT. http://bioinfo.dcc.ufmg.br/vermont/download/supplementary-material.pdf. (PDF 3070 kb)
This work has been supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG). Neither of the funding agencies influenced the study design and collection, analysis and interpretation of data, nor in the writing of the manuscript. The publication costs are funded by CAPES through the process 23038.004007/2014-82, whose project was contemplated in edict 51/2013 - Computational biology.
Vermont interactive platform is available at: http://bioinfo.dcc.ufmg.br/vermont/.
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 10, 2017: Proceedings of the Symposium on Biological Data Visualization (BioVis) at ISMB 2017. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-10.
SAS and RCM conceived the VERMONT platform. AVF and PMM designed and implemented the tool. SSG, SSA, and VSR implemented algorithms for property computation. AVF, SAS and RCM analyzed the results and wrote the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
From Symposium on Biological Data Visualization (BioVis) 2017 Prague, Czech Republic. 24 July 17
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-017-1789-3) contains supplementary material, which is available to authorized users.
Alexandre V. Fassio, Email: rb.gmfu.ccd@oissaferdnaxela.
Samuel da S. Guimarães, Email: email@example.com.
Sócrates S. A. Junior, Email: firstname.lastname@example.org.
Vagner S. Ribeiro, Email: email@example.com.
Raquel C. de Melo-Minardi, Email: rb.gmfu.ccd@mcleuqar.
Sabrina de A. Silveira, Email: rb.vfu@anirbas.