VENN is a Java application interfaced to a local MySQL database. Users begin by selecting a protein structure, which is retrieved from the Protein Data Bank and displayed using the Jmol molecular viewer (http://www.jmol.org
). A BLAST alignment identifies up to 500 putative homologs. Users interactively select among these homologs, and the calculated amino acid conservation at each position is mapped onto the protein structure as a heat map. The application and help videos are at http://sbtools.uchc.edu/venn/
The VENN workflow is shown in A. The user loads the protein structure and sequence into VENN via the Protein Data Bank (PDB) accession number (5
). Similar matches to the individual chain sequences (which are putative orthologs or paralogs) in the structure are remotely retrieved from EBI (6
) or locally via NCBI using BLAST and stored in the local VENN database. The user selects a set of sequences and initiates an alignment of these filtered sequences, shown in the alignment display. Sequence conservation at each position is calculated from the sequence alignment and used to generate a heat map that is used to color the protein structure in the Jmol structural display window. The user can repeat the filtration process selecting more, fewer, or different groups of sequences to titrate the sequence homology and map it onto the surface of the protein structure. A screen shot of the structural display and alignment windows is shown in B.
Figure 1. (A) Data processing model for VENN. Processes are shown as boxes (cyan); products are ellipses (orange); displays are yellow. (B) Screen shot of VENN analyzed with a complex of Fibroblast Growth Factor 8 (FGF8) bound to a FGF receptor 2c homodimer (blue) (more ...)
We have identified four principal strategies for using homolog titration in VENN; users are encouraged to create their own, novel titration protocols: (i) Select all orthologs or paralogs; choosing proteins with the same name can be used for this analysis. This allows a user to determine which regions of the protein are evolutionarily conserved (e.g. A); (ii) Select sequences with similar BLAST scores that include different proteins from different species. This reveals important functional sites that are conserved in protein families (e.g. B); (iii) Select sparsely distributed sequences with a wide range of BLAST scores. In addition to identifying conserved functional sites in gene families, non-conserved residues can provide clues to the specificity of family members (e.g. B and C); and (iv) Select sequences that have low BLAST scores to reveal the modularity of functional sites in proteins (e.g. D).
Figure 2. Homology titration of C/EBPβ using VENN. (A–D) Images from VENN analysis of C/EBPβ homodimer (1GU4); chain A is shown using larger spheres. DNA (green) and a heatmap coloring code of residue conservation are shown. Residue conservation (more ...)
To demonstrate the utility of VENN we explored these four strategies by examining CCAAT/enhancer-binding protein β (C/EBPβ; PDB: 1GU4), a transcription factor of the bZIP family. The automated BLAST analysis identified 500 C/EBPβ homologs for homology titration. Comparing four orthologs from human, frog, flounder and pufferfish shows high conservation of almost all residues (A). As the user titrates in the 50 sequences with the highest BLAST scores representing C/EBP family α, β, δ and γ members from many species, functional sites for coil–coil homodimerization and DNA binding emerge (B, cyan and yellow arrows, respectively). At the dimerization interface, residues L306, N310, L313, L320, E323, L324 and L327 are completely conserved among distant homologs and form contacts at the dimerization interface. Residues R278, N281, N282, A284, K287, S288, R289 and R295 comprise a DNA-binding site.
To identify differences among closely related members of the bZIP protein family, we selected every 20th sequence in the top 160 BLAST scores (C). Within the highly conserved DNA-binding site, V285 (yellow arrow) was poorly conserved. Closer examination reveals that this residue is juxtaposed to a guanine base in the DNA. A literature search revealed that this residue is known to be important for base selectivity in bZIP transcription factors (7
). In a similar type of analysis, VENN was used to identify a similar recognition determinant among 15 different FGF family members for binding their receptors (B). From this analysis we hypothesize that the critical Thr–Phe residues are specificity determinants for FGF receptor recognition of FGF8 ligands; this was previously recognized for FGF8 isoforms (8
The BLAST results also revealed several myosins and centrosomal proteins that are not thought to bind DNA, which is supported by a VENN analysis. When the conservation between these proteins is plotted onto the transcription factor, it is clear that the coil–coil dimerization interface remains conserved while the DNA-binding region is not (D).
VENN has other unique capabilities. VENN accommodates all protein chains in structures of protein complexes in a single analysis which facilitates analysis of multiprotein complexes. VENN also provides different sequence alignment strategies. A neutral sequence alignment places no weight on any individual amino acid, whereas a BLOSUM alignment weights residues based on the BLOSUM62 matrix (9
). VENN also offers a parametric sequence alignment where weights of alignment can be based on the existence of chemical and physical properties of amino acids (for instance, aliphatic, aromatic, acidic, basic, polar). From the visualization perspective, VENN can be used to interactively identify and color regions of protein by searching for a regular expression. Thus, a user could search with ‘P.P’ to identify any motif that has two prolines separated by one residue. Alternatively, by entering a single amino acid ‘M’ all methionines can be colored. These features can be used to examine the 3D location of conservation motifs or residues.