A new module, Guided Ligand Replacement (GLR), has been developed in Phenix to increase the ease and success rate of ligand placement when prior protein-ligand complexes are available.
The process of iterative structure-based drug design involves the X-ray crystal structure determination of upwards of 100 ligands with the same general scaffold (i.e. chemotype) complexed with very similar, if not identical, protein targets. In conjunction with insights from computational models and assays, this collection of crystal structures is analyzed to improve potency, to achieve better selectivity and to reduce liabilities such as absorption, distribution, metabolism, excretion and toxicology. Current methods for modeling ligands into electron-density maps typically do not utilize information on how similar ligands bound in related structures. Even if the electron density is of sufficient quality and resolution to allow de novo placement, the process can take considerable time as the size, complexity and torsional degrees of freedom of the ligands increase. A new module, Guided Ligand Replacement (GLR), was developed in Phenix to increase the ease and success rate of ligand placement when prior protein–ligand complexes are available. At the heart of GLR is an algorithm based on graph theory that associates atoms in the target ligand with analogous atoms in the reference ligand. Based on this correspondence, a set of coordinates is generated for the target ligand. GLR is especially useful in two situations: (i) modeling a series of large, flexible, complicated or macrocyclic ligands in successive structures and (ii) modeling ligands as part of a refinement pipeline that can automatically select a reference structure. Even in those cases for which no reference structure is available, if there are multiple copies of the bound ligand per asymmetric unit GLR offers an efficient way to complete the model after the first ligand has been placed. In all of these applications, GLR leverages prior knowledge from earlier structures to facilitate ligand placement in the current structure.
ligand placement; guided ligand-replacement method; GLR
A software system for automated protein–ligand crystallography has been implemented in the Phenix suite. This significantly reduces the manual effort required in high-throughput crystallographic studies.
High-throughput drug-discovery and mechanistic studies often require the determination of multiple related crystal structures that only differ in the bound ligands, point mutations in the protein sequence and minor conformational changes. If performed manually, solution and refinement requires extensive repetition of the same tasks for each structure. To accelerate this process and minimize manual effort, a pipeline encompassing all stages of ligand building and refinement, starting from integrated and scaled diffraction intensities, has been implemented in Phenix. The resulting system is able to successfully solve and refine large collections of structures in parallel without extensive user intervention prior to the final stages of model completion and validation.
protein–ligand complexes; automation; crystallographic structure solution and refinement
A low flow rate liquid microjet method for delivery of hydrated protein crystals to X-ray lasers is presented. Linac Coherent Light Source data demonstrates serial femtosecond protein crystallography with micrograms, a reduction of sample consumption by orders of magnitude.
An electrospun liquid microjet has been developed that delivers protein microcrystal suspensions at flow rates of 0.14–3.1 µl min−1 to perform serial femtosecond crystallography (SFX) studies with X-ray lasers. Thermolysin microcrystals flowed at 0.17 µl min−1 and diffracted to beyond 4 Å resolution, producing 14 000 indexable diffraction patterns, or four per second, from 140 µg of protein. Nanoflow electrospinning extends SFX to biological samples that necessitate minimal sample consumption.
serial femtosecond crystallography; nanoflow electrospinning
The functionality of the molecular-replacement pipeline phaser.MRage is introduced and illustrated with examples.
Phaser.MRage is a molecular-replacement automation framework that implements a full model-generation workflow and provides several layers of model exploration to the user. It is designed to handle a large number of models and can distribute calculations efficiently onto parallel hardware. In addition, phaser.MRage can identify correct solutions and use this information to accelerate the search. Firstly, it can quickly score all alternative models of a component once a correct solution has been found. Secondly, it can perform extensive analysis of identified solutions to find protein assemblies and can employ assembled models for subsequent searches. Thirdly, it is able to use a priori assembly information (derived from, for example, homologues) to speculatively place and score molecules, thereby customizing the search procedure to a certain class of protein molecule (for example, antibodies) and incorporating additional biological information into molecular replacement.
molecular replacement; pipeline; automation; phaser.MRage
Intense femtosecond X-ray pulses produced at the Linac Coherent Light Source (LCLS) were used for simultaneous X-ray diffraction (XRD) and X-ray emission spectroscopy (XES) of microcrystals of Photosystem II (PS II) at room temperature. This method probes the overall protein structure and the electronic structure of the Mn4CaO5 cluster in the oxygen-evolving complex of PS II. XRD data are presented from both the dark state (S1) and the first illuminated state (S2) of PS II. Our simultaneous XRD/XES study shows that the PS II crystals are intact during our measurements at the LCLS, not only with respect to the structure of PS II, but also with regard to the electronic structure of the highly radiation sensitive Mn4CaO5 cluster, opening new directions for future dynamics studies.
The Computational Crystallography Toolbox (cctbx) is a flexible software platform that has been used to develop high-throughput crystal-screening tools for both synchrotron sources and X-ray free-electron lasers. Plans for data-processing and visualization applications are discussed, and the benefits and limitations of using graphics-processing units are evaluated.
Current pixel-array detectors produce diffraction images at extreme data rates (of up to 2 TB h−1) that make severe demands on computational resources. New multiprocessing frameworks are required to achieve rapid data analysis, as it is important to be able to inspect the data quickly in order to guide the experiment in real time. By utilizing readily available web-serving tools that interact with the Python scripting language, it was possible to implement a high-throughput Bragg-spot analyzer (cctbx.spotfinder) that is presently in use at numerous synchrotron-radiation beamlines. Similarly, Python interoperability enabled the production of a new data-reduction package (cctbx.xfel) for serial femtosecond crystallography experiments at the Linac Coherent Light Source (LCLS). Future data-reduction efforts will need to focus on specialized problems such as the treatment of diffraction spots on interleaved lattices arising from multi-crystal specimens. In these challenging cases, accurate modeling of close-lying Bragg spots could benefit from the high-performance computing capabilities of graphics-processing units.
data processing; reusable code; multiprocessing; cctbx
X-ray crystallography is a critical tool in the study of biological systems. It is able to provide information that has been a prerequisite to understanding the fundamentals of life. It is also a method that is central to the development of new therapeutics for human disease. Significant time and effort are required to determine and optimize many macromolecular structures because of the need for manual interpretation of complex numerical data, often using many different software packages, and the repeated use of interactive three-dimensional graphics. The Phenix software package has been developed to provide a comprehensive system for macromolecular crystallographic structure solution with an emphasis on automation. This has required the development of new algorithms that minimize or eliminate subjective input in favour of built-in expert-systems knowledge, the automation of procedures that are traditionally performed by hand, and the development of a computational framework that allows a tight integration between the algorithms. The application of automated methods is particularly appropriate in the field of structural proteomics, where high throughput is desired. Features in Phenix for the automation of experimental phasing with subsequent model building, molecular replacement, structure refinement and validation are described and examples given of running Phenix from both the command line and graphical user interface.
Macromolecular Crystallography; Automation; Phenix; X-ray; Diffraction; Python
The foundations and current features of a widely used graphical user interface for macromolecular crystallography are described.
A new Python-based graphical user interface for the PHENIX suite of crystallography software is described. This interface unifies the command-line programs and their graphical displays, simplifying the development of new interfaces and avoiding duplication of function. With careful design, graphical interfaces can be displayed automatically, instead of being manually constructed. The resulting package is easily maintained and extended as new programs are added or modified.
macromolecular crystallography; graphical user interfaces; PHENIX
phenix.refine is a program within the PHENIX package that supports crystallographic structure refinement against experimental data with a wide range of upper resolution limits using a large repertoire of model parameterizations. This paper presents an overview of the major phenix.refine features, with extensive literature references for readers interested in more detailed discussions of the methods.
phenix.refine is a program within the PHENIX package that supports crystallographic structure refinement against experimental data with a wide range of upper resolution limits using a large repertoire of model parameterizations. It has several automation features and is also highly flexible. Several hundred parameters enable extensive customizations for complex use cases. Multiple user-defined refinement strategies can be applied to specific parts of the model in a single refinement run. An intuitive graphical user interface is available to guide novice users and to assist advanced users in managing refinement projects. X-ray or neutron diffraction data can be used separately or jointly in refinement. phenix.refine is tightly integrated into the PHENIX suite, where it serves as a critical component in automated model building, final structure refinement, structure validation and deposition to the wwPDB. This paper presents an overview of the major phenix.refine features, with extensive literature references for readers interested in more detailed discussions of the methods.
structure refinement; PHENIX; joint X-ray/neutron refinement; maximum likelihood; TLS; simulated annealing; subatomic resolution; real-space refinement; twinning; NCS
Recent developments in PHENIX are reported that allow the use of reference-model torsion restraints, secondary-structure hydrogen-bond restraints and Ramachandran restraints for improved macromolecular refinement in phenix.refine at low resolution.
Traditional methods for macromolecular refinement often have limited success at low resolution (3.0–3.5 Å or worse), producing models that score poorly on crystallographic and geometric validation criteria. To improve low-resolution refinement, knowledge from macromolecular chemistry and homology was used to add three new coordinate-restraint functions to the refinement program phenix.refine. Firstly, a ‘reference-model’ method uses an identical or homologous higher resolution model to add restraints on torsion angles to the geometric target function. Secondly, automatic restraints for common secondary-structure elements in proteins and nucleic acids were implemented that can help to preserve the secondary-structure geometry, which is often distorted at low resolution. Lastly, we have implemented Ramachandran-based restraints on the backbone torsion angles. In this method, a ϕ,ψ term is added to the geometric target function to minimize a modified Ramachandran landscape that smoothly combines favorable peaks identified from nonredundant high-quality data with unfavorable peaks calculated using a clash-based pseudo-energy function. All three methods show improved MolProbity validation statistics, typically complemented by a lowered R
free and a decreased gap between R
work and R
macromolecular crystallography; low resolution; refinement; automation
The combination of algorithms from the structure-modeling field with those of crystallographic structure determination can broaden the range of templates that are useful for structure determination by the method of molecular replacement. Automated tools in phenix.mr_rosetta simplify the application of these combined approaches by integrating Phenix crystallographic algorithms and Rosetta structure-modeling algorithms and by systematically generating and evaluating models with a combination of these methods. The phenix.mr_rosetta algorithms can be used to automatically determine challenging structures. The approaches used in phenix.mr_rosetta are described along with examples that show roles that structure-modeling can play in molecular replacement.
Molecular replacement; Automation; Macromolecular crystallography; Rosetta; Phenix
Structural biology and structural genomics projects routinely rely on recombinantly expressed proteins, but many proteins and complexes are difficult to obtain by this approach. We investigated native source proteins for high-throughput protein crystallography applications. The Escherichia coli proteome was fractionated, purified, crystallized, and structurally characterized. Macro-scale fermentation and fractionation were used to subdivide the soluble proteome into 408 unique fractions of which 295 fractions yielded crystals in microfluidic crystallization chips. Of the 295 crystals, 152 were selected for optimization, diffraction screening, and data collection. Twenty-three structures were determined, four of which were novel. This study demonstrates the utility of native source proteins for high-throughput crystallography.
The essential Mycobacterium tuberculosis Ser/Thr protein kinase (STPK), PknB, plays a key role in regulating growth and division, but the structural basis of activation has not been defined. Here we provide biochemical and structural evidence that dimerization through the kinase-domain (KD) N-lobe activates PknB by an allosteric mechanism. Promoting KD pairing using a small-molecule dimerizer stimulates the unphosphorylated kinase, and substitutions that disrupt N-lobe pairing decrease phosphorylation activity in vitro and in vivo. Multiple crystal structures of two monomeric PknB KD mutants in complex with nucleotide reveal diverse inactive conformations that contain large active-site distortions that propagate >30 Å from the mutation site. These results define flexible, inactive structures of a monomeric bacterial receptor KD and show how “back-to-back” N-lobe dimerization stabilizes the active KD conformation. This general mechanism of bacterial receptor STPK activation affords insights into the regulation of homologous eukaryotic kinases that form structurally similar dimers.
The PHENIX software for macromolecular structure determination is described.
Macromolecular X-ray crystallography is routinely applied to understand biological processes at a molecular level. However, significant time and effort are still required to solve and complete many of these structures because of the need for manual interpretation of complex numerical data using many software packages and the repeated use of interactive three-dimensional graphics. PHENIX has been developed to provide a comprehensive system for macromolecular crystallographic structure solution with an emphasis on the automation of all procedures. This has relied on the development of algorithms that minimize or eliminate subjective input, the development of algorithms that automate procedures that are traditionally performed by hand and, finally, the development of a framework that allows a tight integration between the algorithms.
PHENIX; Python; macromolecular crystallography; algorithms
The Myxococcus xanthus FrzS protein transits from pole-to-pole within the cell, accumulating at the pole that defines the direction of movement in social (S) motility. Here we show using atomic-resolution crystallography and NMR that the FrzS receiver domain (RD) displays the conserved switch Tyr102 in an unusual conformation, lacks the conserved Asp phosphorylation site, and fails to bind Mg2+ or the phosphoryl analogue, Mg2+·BeF3. Mutation of Asp55, closest to the canonical site of RD phosphorylation, showed no motility phenotype in vivo, demonstrating that phosphorylation at this site is not necessary for domain function. In contrast, the Tyr102Ala and His92Phe substitutions on the canonical output face of the FrzS RD abolished S-motility in vivo. Single-cell fluorescence microscopy measurements revealed a striking mislocalization of these mutant FrzS proteins to the trailing cell pole in vivo. The crystal structures of the mutants suggested that the observed conformation of Tyr102 in the wild-type FrzS RD is not sufficient for function. These results support the model that FrzS contains a novel ‘pseudo-receiver domain’ whose function requires recognition of the RD output face but not Asp phosphorylation.
The database of molecular motions, MolMovDB (), has been in existence for the past decade. It classifies macromolecular motions and provides tools to interpolate between two conformations (the Morph Server) and predict possible motions in a single structure. In 2005, we expanded the services offered on MolMovDB. In particular, we further developed the Morph Server to produce improved interpolations between two submitted structures. We added support for multiple chains to the original adiabatic mapping interpolation, allowing the analysis of subunit motions. We also added the option of using FRODA interpolation, which allows for more complex pathways, potentially overcoming steric barriers. We added an interface to a hinge prediction service, which acts on single structures and predicts likely residue points for flexibility. We developed tools to relate such points of flexibility in a structure to particular key residue positions, i.e. active sites or highly conserved positions. Lastly, we began relating our motion classification scheme to function using descriptions from the Gene Ontology Consortium.
DNA microarrays are widely used in biological research; by analyzing differential hybridization on a single microarray slide, one can detect changes in mRNA expression levels, increases in DNA copy numbers and the location of transcription factor binding sites on a genomic scale. Having performed the experiments, the major challenge is to process large, noisy datasets in order to identify the specific array elements that are significantly differentially hybridized. This normally requires aggregating different, often incompatible programs into a multi-step pipeline. Here we present ExpressYourself, a fully integrated platform for processing microarray data. In completely automated fashion, it will correct the background array signal, normalize the Cy5 and Cy3 signals, score levels of differential hybridization, combine the results of replicate experiments, filter problematic regions of the array and assess the quality of individual and replicate experiments. ExpressYourself is designed with a highly modular architecture so various types of microarray analysis algorithms can readily be incorporated as they are developed; for example, the system currently implements several normalization methods, including those that simultaneously consider signal intensity and slide location. The processed data are presented using a web-based graphical interface to facilitate comparison with the original images of the array slides. In particular, Express Yourself is able to regenerate images of the original microarray after applying various steps of processing, which greatly facilities identification of position-specific artifacts. The program is freely available for use at http://bioinfo.mbb.yale.edu/expressyourself.
We present version 2 of the SPINE system for structural proteomics. SPINE is available over the web at http://nesg.org. It serves as the central hub for the Northeast Structural Genomics Consortium, allowing collaborative structural proteomics to be carried out in a distributed fashion. The core of SPINE is a laboratory information management system (LIMS) for key bits of information related to the progress of the consortium in cloning, expressing and purifying proteins and then solving their structures by NMR or X-ray crystallography. Originally, SPINE focused on tracking constructs, but, in its current form, it is able to track target sample tubes and store detailed sample histories. The core database comprises a set of standard relational tables and a data dictionary that form an initial ontology for proteomic properties and provide a framework for large-scale data mining. Moreover, SPINE sits at the center of a federation of interoperable information resources. These can be divided into (i) local resources closely coupled with SPINE that enable it to handle less standardized information (e.g. integrated mailing and publication lists), (ii) other information resources in the NESG consortium that are inter-linked with SPINE (e.g. crystallization LIMS local to particular laboratories) and (iii) international archival resources that SPINE links to and passes on information to (e.g. TargetDB at the PDB).
The Database of Macromolecular Movements (http://MolMovDB.org) is a collection of data and software pertaining to flexibility in protein and RNA structures. The database is organized into two parts. Firstly, a collection of ‘morphs’ of solved structures representing different states of a molecule provides quantitative data for flexibility and a number of graphical representations. Secondly, a classification of known motions according to type of conformational change (e.g. ‘hinged domain’ or ‘allosteric’) incorporates textual annotation and information from the literature relating to the motion, linking together many of the morphs. A variety of subsets of the morphs are being developed for use in statistical analyses. In particular, for each subset it is possible to derive distributions of various motional quantities (e.g. maximum rotation) that can be used to place a specific motion in context as being typical or atypical for a given population. Over the past year, the database has been greatly expanded and enhanced to incorporate new structures and to improve the quality of data. The ‘morph server’, which enables users of the database to add new morphs either from their own research or the PDB, has also been enhanced to handle nucleic acid structures and multi-chain complexes.
Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes—the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into ‘ancient’ and ‘modern’ subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.
As the number of protein folds is quite limited, a mode of analysis
that will be increasingly common in the future, especially with
the advent of structural genomics, is to survey and re-survey the
finite parts list of folds from an expanding number of perspectives.
We have developed a new resource, called PartsList, that lets one
dynamically perform these comparative fold surveys. It is available
on the web at http://bioinfo.mbb.yale.edu/partslist and http://www.partslist.org.
The system is based on the existing fold classifications and functions
as a form of companion annotation for them, providing ‘global views’ of
many already completed fold surveys. The central idea in the system
is that of comparison through ranking; PartsList will rank the approximately
420 folds based on more than 180 attributes. These include: (i)
occurrence in a number of completely sequenced genomes (e.g. it
will show the most common folds in the worm versus yeast); (ii) occurrence
in the structure databank (e.g. most common folds in the PDB); (iii)
both absolute and relative gene expression information (e.g. most changing
folds in expression over the cell cycle); (iv) protein–protein
interactions, based on experimental data in yeast and comprehensive
PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons;
(vi) the number of functions associated with the fold (e.g. most
multi-functional folds); (vii) amino acid composition (e.g. most
Cys-rich folds); (viii) protein motions (e.g. most mobile folds);
and (ix) the level of similarity based on a comprehensive set of
structural alignments (e.g. most structurally variable
folds). The integration of whole-genome expression and protein–protein
interaction data with structural information is a particularly novel
feature of our system. We provide three ways of visualizing the
rankings: a profiler emphasizing the progression of high and low
ranks across many pre-selected attributes, a dynamic comparer for
custom comparisons and a numerical rankings correlator. These allow
one to directly compare very different attributes of a fold (e.g. expression
level, genome occurrence and maximum motion) in the uniform numerical
format of ranks. This uniform framework, in turn, highlights the
way that the frequency of many of the attributes falls off with
approximate power-law behavior (i.e. according to V–b,
for attribute value V and constant exponent b), with a few folds having
large values and most having small values.
Pseudogenes are non-functioning copies of genes in genomic DNA,
which may either result from reverse transcription from an mRNA
transcript (processed pseudogenes) or from gene duplication and
subsequent disablement (non-processed pseudogenes). As pseudogenes
are apparently ‘dead’, they usually have a variety
of obvious disablements (e.g., insertions, deletions, frameshifts
and truncations) relative to their functioning homologs. We have
derived an initial estimate of the size, distribution and characteristics
of the pseudogene population in the Caenorhabditis elegans genome,
performing a survey in ‘molecular archaeology’.
Corresponding to the 18 576 annotated proteins in the worm
(i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes,
about one for every eight genes. Few of these appear to be processed.
Details of our pseudogene assignments are available from http://bioinfo.mbb.yale.edu/genome/worm/pseudogene.
The population of pseudogenes differs significantly from that of
genes in a number of respects: (i) pseudogenes are distributed unevenly across
the genome relative to genes, with a disproportionate number on
chromosome IV; (ii) the density of pseudogenes is higher on the
arms of the chromosomes; (iii) the amino acid composition of pseudogenes
is midway between that of genes and (translations of) random intergenic
DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of
Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the
most common protein folds and families differ somewhat between genes
and pseudogenes—whereas the most common fold found in the
worm proteome is the immunoglobulin fold and the most common ‘pseudofold’ is
the C-type lectin. In addition, the size of a gene family bears
little overall relationship to the size of its corresponding pseudogene
complement, indicating a highly dynamic genome. There are in fact
a number of families associated with large populations of pseudogenes.
For example, one family of seven-transmembrane receptors (represented by
gene B0334.7) has one pseudogene for every four genes, and another
uncharacterized family (represented by gene B0403.1) is approximately
two-thirds pseudogenic. Furthermore, over a hundred apparent pseudogenic
fragments do not have any obvious homologs in the worm.