We present OnTheFly (http://bhapp.c2b2.columbia.edu/OnTheFly/index.php), a database comprising a systematic collection of transcription factors (TFs) of Drosophila melanogaster and their DNA-binding sites. TFs predicted in the Drosophila melanogaster genome are annotated and classified and their structures, obtained via experiment or homology models, are provided. All known preferred TF DNA-binding sites obtained from the B1H, DNase I and SELEX methodologies are presented. DNA shape parameters predicted for these sites are obtained from a high throughput server or from crystal structures of protein–DNA complexes where available. An important feature of the database is that all DNA-binding domains and their binding sites are fully annotated in a eukaryote using structural criteria and evolutionary homology. OnTheFly thus provides a comprehensive view of TFs and their binding sites that will be a valuable resource for deciphering non-coding regulatory DNA.
Cadherins embody a superfamily of cell-surface glycoproteins whose ectodomains contain multiple repeats of β-sandwich EC (extracellular cadherin) domains that adopt a similar fold to immunoglobulin domains. The best characterized cadherins are the vertebrate “classical” cadherins, which mediate adhesion via trans homodimerization between their membrane-distal EC1 domains that extend from apposed cells, and assemble intercellular adherens junctions through cis clustering. To form mature trans adhesive dimers, cadherin domains from apposed cells dimerize in a “strand-swapped” conformation. This occurs in a two-step binding process involving a fast-binding intermediate called the “X-dimer”. Trans dimers are less flexible than cadherin monomers, a factor which drives junction assembly following cell-cell contact by reducing the entropic cost associated with the formation of lateral cis oligomers. Cadherins outside of the classical subfamily appear to have evolved distinct adhesive mechanisms which are just now beginning to be understood.
High-quality NMR structures of the homo-dimeric proteins Bvu3908 (69-residues in monomeric unit) from Bacteroides vulgatus and Bt2368 (74-residues) from Bacteroides thetaiotaomicron reveal the presence of winged helix-turn-helix (wHTH) motifs mediating tight complex formation. Such homo-dimer formation by winged HTH motifs is otherwise found only in two DNA-binding proteins with known structure: the C-terminal wHTH domain of transcriptional activator FadR from E. coli and protein TubR from B. thurigensis, which is involved in plasmid DNA segregation. However, the relative orientation of the wHTH motifs is different and residues involved in DNA-binding are not conserved in Bvu3908 and Bt2368. Hence, the proteins of the present study are not very likely to bind DNA, but are likely to exhibit a function that has thus far not been ascribed to homo-dimers formed by winged HTH motifs. The structures of Bvu3908 and Bt2368 are the first atomic resolution structures for PFAM family PF10771, a family of unknown function (DUF2582) currently containing 128 members.
Bvu3908; Bt2368; PF10771; DUF2582; Winged helix-turn-helix; Structural genomics
Emerging evidence indicates that membrane lipids regulate protein networking by directly interacting with protein-interaction domains (PIDs). As a pilot study to identify and functionally annodate lipid-binding PIDs on a genomic scale, we performed experimental and computational studies of PDZ domains. Characterization of 70 PDZ domains showed that 40% had submicromolar membrane affinity. Using a computational model built from these data, we predicted the membrane binding properties of 2000 PDZ domains from 20 species. The accuracy of the prediction was experimentally validated for 26 PDZ domains. We also subdivided lipid-binding PDZ domains into three classes based on the interplay between membrane and protein binding sites. For different classes of PDZ domains, lipid binding regulates their protein interactions by different mechanisms. Functional studies of a PDZ domain protein, rhophilin2 suggest that all classes of lipid binding PDZ domains serve as genuine dual-specificity modules regulating protein interactions at the membrane under physiological conditions.
The genome-wide identification of pairs of interacting proteins is an important step in the elucidation of cell regulatory mechanisms1,2. Much of our current knowledge derives from high-throughput techniques such as yeast two hybrid and affinity purification3, as well as from manual curation of experiments on individual systems4. A variety of computational approaches based, for example, on sequence homology, gene co-expression, and phylogenetic profiles have also been developed for the genome-wide inference of protein-protein interactions (PPIs)5,6. Yet, comparative studies suggest that the development of accurate and complete repertoires of PPIs is still in its early stages7–9. Here we show that three-dimensional structural information can be used to predict PPIs with an accuracy and coverage that are superior to predictions based on non-structural evidence. Moreover, an algorithm, PrePPI, that combines structural information with other functional clues is comparable in accuracy to high-throughput experiments, yielding over 30,000 high confidence interactions for yeast and over 300,000 for human. Experimental tests of a number of predictions demonstrate the ability of the PrePPI algorithm to identify unexpected PPIs of significant biological interest. The surprising effectiveness of three-dimensional structural information can be attributed to the use of homology models combined with the exploitation of both close and remote geometric relationships between proteins.
The Center for the Multiscale Analysis of Genetic Networks (MAGNet, http://magnet.c2b2.columbia.edu) was established in 2005, with the mission of providing the biomedical research community with Structural and Systems Biology algorithms and software tools for the dissection of molecular interactions and for the interaction-based elucidation of cellular phenotypes. Over the last 7 years, MAGNet investigators have developed many novel analysis methodologies, which have led to important biological discoveries, including understanding the role of the DNA shape in protein–DNA binding specificity and the discovery of genes causally related to the presentation of malignant phenotypes, including lymphoma, glioma, and melanoma. Software tools implementing these methodologies have been broadly adopted by the research community and are made freely available through geWorkbench, the Center's integrated analysis platform. Additionally, MAGNet has been instrumental in organizing and developing key conferences and meetings focused on the emerging field of systems biology and regulatory genomics, with special focus on cancer-related research.
Nectins are immunoglobulin superfamily glycoproteins that mediate intercellular adhesion in many vertebrate tissues. Homophilic and heterophilic interactions between nectin family members help to mediate tissue patterning. We determined homophilic binding affinities and heterophilic specificities of all four nectins and the related protein nectin-like 5 from human and mouse, revealing a range of homophilic strengths and a defined heterophilic specificity pattern. To understand the molecular basis of adhesion and specificity, we determined crystal structures of natively glycosylated full ectodomains or adhesive fragments of nectins 1–4 and nectin-like 5. All crystal structures reveal dimeric nectins bound through a stereotyped interface previously proposed to represent a cis dimer. However, conservation of this interface and results of targeted cross-linking experiments show that this dimer likely represents the adhesive trans interaction. Its structure provides a simple molecular explanation for the adhesive binding specificity of nectins.
DNA shape variation and the associated variation in minor groove electrostatic potential are widely exploited by proteins for DNA recognition. Here we show that the hydroxyl radical cleavage pattern is a quantitative measure of DNA backbone solvent accessibility, minor groove width, and minor groove electrostatic potential, at single nucleotide resolution. We introduce maps of DNA shape and electrostatic potential as tools for understanding how proteins recognize binding sites in a genome. These maps reveal periodic structural signals in yeast and Drosophila genomic DNA sequences that are associated with positioned nucleosomes.
Members of transcription factor families typically have similar DNA binding specificities yet execute unique functions in vivo. Transcription factors often bind DNA as multiprotein complexes, raising the possibility that complex formation might modify their DNA binding specificities. To test this hypothesis, we developed an experimental and computational platform, SELEX-seq, that can be used to determine the relative affinities to any DNA sequence for any transcription factor complex. Applying this method to all eight Drosophila Hox proteins, we show that they obtain novel recognition properties when they bind DNA with the dimeric cofactor Extradenticle-Homothorax (Exd). Exd-Hox specificities group into three main classes that obey Hox gene collinearity rules and DNA structure predictions suggest that anterior and posterior Hox proteins prefer DNA sequences with distinct minor groove topographies. Together, these data suggest that emergent DNA recognition properties revealed by interactions with cofactors contribute to transcription factor specificities in vivo.
PrePPI (http://bhapp.c2b2.columbia.edu/PrePPI) is a database that combines predicted and experimentally determined protein–protein interactions (PPIs) using a Bayesian framework. Predicted interactions are assigned probabilities of being correct, which are derived from calculated likelihood ratios (LRs) by combining structural, functional, evolutionary and expression information, with the most important contribution coming from structure. Experimentally determined interactions are compiled from a set of public databases that manually collect PPIs from the literature and are also assigned LRs. A final probability is then assigned to every interaction by combining the LRs for both predicted and experimentally determined interactions. The current version of PrePPI contains ∼2 million PPIs that have a probability more than ∼0.1 of which ∼60 000 PPIs for yeast and ∼370 000 PPIs for human are considered high confidence (probability > 0.5). The PrePPI database constitutes an integrated resource that enables users to examine aggregate information on PPIs, including both known and potentially novel interactions, and that provides structural models for many of the PPIs.
Specific interactions between proteins and DNA are fundamental to many biological processes. In this review, we provide a revised view of protein-DNA interactions that emphasizes the importance of the three-dimensional structures of both macromolecules. We divide protein-DNA interactions into two categories: those where the protein recognizes the unique chemical signatures of the DNA bases (base readout) and those where the protein recognizes a sequence-dependent DNA shape (shape readout). We further divide base readout into those interactions that occur in the major groove from those that occur in the minor groove. Analogously, the readout of DNA shape is subdivided into global shape recognition, for example when the DNA helix exhibits an overall bend, and local shape recognition, for example when a base pair step is kinked or when a region of the minor groove is narrow. Based on the >1500 structures of protein-DNA complexes now available in the Protein Data Base, we argue that individual DNA binding proteins combine multiple readout mechanisms to achieve DNA binding specificity. Specificity that distinguishes between families frequently involves base readout in the major groove while shape readout is often exploited for higher resolution specificity, to distinguish between members within the same DNA-binding protein family.
Protein-DNA binding; Direct readout; Indirect readout; DNA base recognition; DNA shape recognition; Narrow minor groove; DNA kinks; DNA bending; B-DNA; A-DNA; Z-DNA
p53 binds as a tetramer to DNA targets consisting of two decameric half-sites separated by a variable spacer. Here we present high-resolution crystal structures of complexes between p53 core-domain tetramers and DNA targets consisting of contiguous half-sites. In contrast to previously reported p53-DNA complexes that display standard Watson-Crick base pairs, the newly reported structures exhibit non-canonical Hoogsteen base-pairing geometry at the central A/T doublet of each half-site. Structural and computational analyses demonstrate that the Hoogsteen geometry distinctly modulates the B-DNA helix in terms of local shape and electrostatic potential which together with the contiguous DNA configuration results in enhanced protein-DNA and protein-protein interactions compared to non-contiguous half-sites. Our results suggest a mechanism, which relates spacer length to protein-DNA binding affinity. Our findings also expand the current understanding of protein-DNA recognition and establish the structural and chemical properties of Hoogsteen base pairs as the basis for a novel mode of sequence readout.
Membrane-bound receptors often form large assemblies resulting from binding to soluble ligands, cell-surface molecules on other cells, and extracellular matrix proteins1. For example, the association of membrane proteins with proteins on different cells (trans interactions) can drive the oligomerization of proteins on the same cell (cis interactions)2. A central problem in understanding the molecular basis of such phenomena is that equilibrium constants are generally measured in three-dimensional (3D) solution and are thus difficult to relate to the two-dimensional (2D) environment of a membrane surface. Here we present a theoretical treatment that converts 3D to 2D affinities accounting directly for the structure and dynamics of the membrane-bound molecules. Using a multi-scale simulation approach we apply the theory to explain the formation of ordered junction-like clusters by classical cadherin adhesion proteins. The approach includes atomic-scale molecular dynamics simulations to determine inter-domain flexibility, Monte-Carlo simulations of multi-domain motion, and lattice simulations of junction formation3. A finding of general relevance is that changes in inter-domain motion upon trans binding plays a crucial role in driving the lateral, cis, clustering of adhesion receptors.
Cell adhesion by classical cadherins is mediated by dimerization of their EC1 domains through the “swapping” of N-terminal β-strands. We use molecular simulations, measurements of binding affinities, and x-ray crystallography to provide a detailed picture of the structural and energetic factors that control the adhesive dimerization of cadherins. We show that strand swapping in EC1 is driven by conformational strain in cadherin monomers which arises from the anchoring of their short N-terminal strand at one end by the conserved Trp2 and at the other by ligation to Ca2+ ions. We also demonstrate that a conserved pro-pro motif functions to avoid the formation of an overly tight interface where affinity differences between different cadherins, crucial at the cellular level, are lost. We use these findings to design site-directed mutations which transform a monomeric EC2-EC3 domain cadherin construct, into a strand-swapped dimer.
Vascular endothelial (VE)–cadherin, a divergent member of the type II classical cadherin family of cell adhesion proteins, mediates homophilic adhesion in the vascular endothelium. Previous investigations with a bacterially-produced protein suggested that VE-cadherin forms cell surface trimers which bind between apposed cells to form hexamers. Here we report studies of mammalian-produced VE-cadherin ectodomains which suggest that, like other classical cadherins, VE-cadherin forms adhesive trans-dimers between monomers located on opposing cell surfaces. Trimerization of the bacterially-produced protein appears to be an artifact that arises from a lack of glycosylation. We also present the 2.1Å resolution crystal structure of the VE-cadherin EC1-2 adhesive region which reveals homodimerization via the strand swap mechanism common to classical cadherins. In common with type II cadherins, strand swap binding involves two tryptophan anchor residues, but the adhesive interface resembles type I cadherins in that VE-cadherin does not form a large non-swapped hydrophobic surface. Thus, VE-cadherin is an outlier among classical cadherins, with characteristics of both type I and type II subfamilies.
Cell-cell adhesion; N-glycosylation; cadherin adhesive binding; domain swapping
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is “optimal” in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are “suboptimal” in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for “modelability”, we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.
It has been suggested that, for nearly every protein sequence, there is already a protein with a similar structure in current protein structure databases. However, with poor or undetectable sequence relationships, it is expected that accurate alignments and models cannot be generated. Here we show that this is not the case, and that whenever structural relationship exists, there are usually local sequence relationships that can be used to generate an accurate alignment, no matter what the global sequence identity. However, this requires an alternative to the traditional dynamic programming algorithm and the consideration of a small ensemble of alignments. We present an algorithm, S4, and demonstrate that it is capable of generating accurate alignments in nearly all cases where a structural relationship exists between two proteins. Our results thus constitute an important advance in the full exploitation of the information in structural databases. That is, the expectation of an accurate alignment suggests that a meaningful model can be generated for nearly every sequence for which a suitable template exists.
The New York Consortium on Membrane Protein Structure (NYCOMPS) was formed to accelerate the acquisition of structural information on membrane proteins by applying a structural genomics approach. NY-COMPS comprises a bioinformatics group, a centralized facility operating a high-throughput cloning and screening pipeline, a set of associated wet labs that perform high-level protein production and structure determination by x-ray crystallography and NMR, and a set of investigators focused on methods development. In the first three years of operation, the NYCOMPS pipeline has so far produced and screened 7,250 expression constructs for 8,045 target proteins. Approximately 600 of these verified targets were scaled up to levels required for structural studies, so far yielding 24 membrane protein crystals. Here we describe the overall structure of NYCOMPS and provide details on the high-throughput pipeline.
Membrane proteins; Structural genomics; High throughput; NMR; X-ray
Adherens junctions, which play a central role in intercellular adhesion, comprise clusters of type I classical cadherins that bind via extracellular domains extended from opposing cell surfaces. We show that a molecular layer seen in crystal structures of E- and N-cadherin ectodomains reported here and in the C-cadherin structure corresponds to the extracellular architecture of adherens junctions. In all three ectodomain crystals, cadherins dimerize through a trans adhesive interface and are connected by a second, cis, interface. Assemblies formed by E-cadherin ectodomains coated on liposomes also appear to adopt this structure. Fluorescent imaging of junctions formed from wild-type and mutant E-cadherins in cultured cells confirm conclusions derived from structural evidence. Mutations that interfere with the trans interface ablate adhesion, whereas cis interface mutations disrupt stable junction formation. Our observations are consistent with a model for junction assembly involving strong trans and weak cis interactions localized in the ectodomain.
Alternatively spliced β-neurexins (β-NRXs) and neuroligins (NLs) are thought to have distinct extracellular binding affinities, potentially providing a β-NRX/NL synaptic recognition code. We have utilized surface plasmon resonance to measure binding affinities between all sixty combinations of alternatively spliced ectodomains of β-NRXs 1–3 and NLs 1–3. Binding was observed for all β-NRX/NL pairs. The presence of the NL1 B splice insertion lowers β-NRX binding affinity by ~2-fold, while β-NRX splice insertion 4 has small effects that do not synergize with NL splicing. New structures of glycosylated β-NRXs 1 and 2 containing splice insertion 4 reveal that the insertion forms a new β-strand that replaces the β10-strand, leaving the NL binding site intact. This helps to explain the limited effect of splice insert 4 on NRX/NL binding affinities. These results provide new structural insights and quantitative binding information to help determine whether and how splice isoform choice plays a role in β-NRX/NL mediated synaptic recognition.
We describe MarkUs, a web server for analysis and comparison of the structural and functional properties of proteins. In contrast to a ‘structure in/function out’ approach to protein function annotation, the server is designed to be highly interactive and to allow flexibility in the examination of possible functions, suggested either automatically by various similarity measures or specified by a user directly. This is combined with tools that allow a user to assess independently whether or not a suggested function is consistent with the bioinformatic and biophysical properties of a given query structure, further allowing the user to generate testable hypotheses. The server is available at http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:Mark-Us.
We describe PredUs, an interactive web server for the prediction of protein–protein interfaces. Potential interfacial residues for a query protein are identified by ‘mapping’ contacts from known interfaces of the query protein’s structural neighbors to surface residues of the query. We calculate a score for each residue to be interfacial with a support vector machine. Results can be visualized in a molecular viewer and a number of interactive features allow users to tailor a prediction to a particular hypothesis. The PredUs server is available at: http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:PredUs.
Automatic modeling methods using cryo-electron microscopy (cryoEM) density maps as constrains are promising approaches to building atomic models of individual proteins or protein domains. However, their application to large macromolecular assemblies has not been possible largely due to computational limitations inherent to such unsupervised methods. Here we describe a new method, EM-IMO, for building, modifying and refining local structures of protein models using cryoEM maps as a constraint. As a supervised refinement method, EM-IMO allows users to specify parameters derived from inspections, so as to guide, and as a consequence, significantly speed up the refinement. An EM-IMO-based refinement protocol is first benchmarked on a data set of 50 homology models using simulated density maps. A multi-scale refinement strategy that combines EM-IMO-based and molecular dynamics (MD)-based refinement is then applied to build backbone models for the seven conformers of the five capsid proteins in our near-atomic resolution cryoEM map of the grass carp reovirus (GCRV) virion, a member of the aquareovirus genus of the Reoviridae family. The refined models allow us to reconstruct a backbone model of the entire GCRV capsid and provide valuable functional insights that are described in the accompanying publication. Our study demonstrates that the integrated use of homology modeling and a multi-scale refinement protocol that combines supervised and automated structure refinement offers a practical strategy for building atomic models based on medium- to high-resolution cryoEM density maps.
cryo-electron microscopy; density fitting; homology modeling; structure refinement; protein structure prediction
VPA0419; yiiS; PFAM 04175; structural genomics; GFT NMR
Proteins rely on a variety of readout mechanisms to preferentially bind specific DNA sequences. The nucleosome offers a prominent example of a shape readout mechanism where arginines insert into narrow minor groove regions that face the histone core. Here we compare DNA shape and arginine recognition of three nucleosome core particle structures, expanding on our previous study by characterizing two additional structures, one with a different protein sequence and one with a different DNA sequence. The electrostatic potential in the minor groove is shown to be largely independent of the underlying sequence but is, however, dominated by groove geometry. Our results extend and generalize our previous observation that the interaction of arginines with narrow minor grooves plays an important role in stabilizing the deformed DNA in the nucleosome.
Crystal structures of classical cadherins have revealed two dimeric configurations: in the first, N-terminal β-strands of EC1 domains “swap” between partner molecules. The second configuration (the “X-dimer”), also observed for T-cadherin, is mediated by residues near the EC1-2 calcium binding sites, and N-terminal β-strands of partner EC1 domains, though held adjacent, do not swap. Here we show that strand swapping mutants of type I and II classical cadherins form X-dimers. Mutant cadherins impaired for X-dimer formation show no binding in short timeframe surface plasmon resonance assays but in long timeframe experiments, have homophilic binding affinities close to wild-type. Further experiments show that exchange between monomers and dimers is slowed in these mutants. These results reconcile apparently disparate results from prior structural studies, and suggest that X-dimers are binding intermediates that facilitate the formation of strand swapped dimers.