Search tips
Search criteria

Results 1-22 (22)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Ligand placement based on prior structures: the guided ligand-replacement method 
A new module, Guided Ligand Replacement (GLR), has been developed in Phenix to increase the ease and success rate of ligand placement when prior protein-ligand complexes are available.
The process of iterative structure-based drug design involves the X-ray crystal structure determination of upwards of 100 ligands with the same general scaffold (i.e. chemotype) complexed with very similar, if not identical, protein targets. In conjunction with insights from computational models and assays, this collection of crystal structures is analyzed to improve potency, to achieve better selectivity and to reduce liabilities such as absorption, distribution, metabolism, excretion and toxicology. Current methods for modeling ligands into electron-density maps typically do not utilize information on how similar ligands bound in related structures. Even if the electron density is of sufficient quality and resolution to allow de novo placement, the process can take considerable time as the size, complexity and torsional degrees of freedom of the ligands increase. A new module, Guided Ligand Replacement (GLR), was developed in Phenix to increase the ease and success rate of ligand placement when prior protein–ligand complexes are available. At the heart of GLR is an algorithm based on graph theory that associates atoms in the target ligand with analogous atoms in the reference ligand. Based on this correspondence, a set of coordinates is generated for the target ligand. GLR is especially useful in two situations: (i) modeling a series of large, flexible, complicated or macrocyclic ligands in successive structures and (ii) modeling ligands as part of a refinement pipeline that can automatically select a reference structure. Even in those cases for which no reference structure is available, if there are multiple copies of the bound ligand per asymmetric unit GLR offers an efficient way to complete the model after the first ligand has been placed. In all of these applications, GLR leverages prior knowledge from earlier structures to facilitate ligand placement in the current structure.
PMCID: PMC3919265  PMID: 24419386
ligand placement; guided ligand-replacement method; GLR
2.  Automating crystallographic structure solution and refinement of protein–ligand complexes 
A software system for automated protein–ligand crystallography has been implemented in the Phenix suite. This significantly reduces the manual effort required in high-throughput crystallographic studies.
High-throughput drug-discovery and mechanistic studies often require the determination of multiple related crystal structures that only differ in the bound ligands, point mutations in the protein sequence and minor conformational changes. If performed manually, solution and refinement requires extensive repetition of the same tasks for each structure. To accelerate this process and minimize manual effort, a pipeline encompassing all stages of ligand building and refinement, starting from integrated and scaled diffraction intensities, has been implemented in Phenix. The resulting system is able to successfully solve and refine large collections of structures in parallel without extensive user intervention prior to the final stages of model completion and validation.
PMCID: PMC3919266  PMID: 24419387
protein–ligand complexes; automation; crystallographic structure solution and refinement
3.  Nanoflow electrospinning serial femtosecond crystallography 
A low flow rate liquid microjet method for delivery of hydrated protein crystals to X-ray lasers is presented. Linac Coherent Light Source data demonstrates serial femtosecond protein crystallography with micrograms, a reduction of sample consumption by orders of magnitude.
An electrospun liquid microjet has been developed that delivers protein microcrystal suspensions at flow rates of 0.14–3.1 µl min−1 to perform serial femtosecond crystallography (SFX) studies with X-ray lasers. Thermolysin microcrystals flowed at 0.17 µl min−1 and diffracted to beyond 4 Å resolution, producing 14 000 indexable diffraction patterns, or four per second, from 140 µg of protein. Nanoflow electrospinning extends SFX to biological samples that necessitate minimal sample consumption.
PMCID: PMC3478121  PMID: 23090408
serial femtosecond crystallography; nanoflow electrospinning
4.  Phaser.MRage: automated molecular replacement 
The functionality of the molecular-replacement pipeline phaser.MRage is introduced and illustrated with examples.
Phaser.MRage is a molecular-replacement automation framework that implements a full model-generation workflow and provides several layers of model exploration to the user. It is designed to handle a large number of models and can distribute calculations efficiently onto parallel hardware. In addition, phaser.MRage can identify correct solutions and use this information to accelerate the search. Firstly, it can quickly score all alternative models of a component once a correct solution has been found. Secondly, it can perform extensive analysis of identified solutions to find protein assemblies and can employ assembled models for subsequent searches. Thirdly, it is able to use a priori assembly information (derived from, for example, homologues) to speculatively place and score molecules, thereby customizing the search procedure to a certain class of protein molecule (for example, antibodies) and incorporating additional biological information into molecular replacement.
PMCID: PMC3817702  PMID: 24189240
molecular replacement; pipeline; automation; phaser.MRage
5.  Simultaneous Femtosecond X-ray Spectroscopy and Diffraction of Photosystem II at Room Temperature 
Science (New York, N.Y.)  2013;340(6131):491-495.
Intense femtosecond X-ray pulses produced at the Linac Coherent Light Source (LCLS) were used for simultaneous X-ray diffraction (XRD) and X-ray emission spectroscopy (XES) of microcrystals of Photosystem II (PS II) at room temperature. This method probes the overall protein structure and the electronic structure of the Mn4CaO5 cluster in the oxygen-evolving complex of PS II. XRD data are presented from both the dark state (S1) and the first illuminated state (S2) of PS II. Our simultaneous XRD/XES study shows that the PS II crystals are intact during our measurements at the LCLS, not only with respect to the structure of PS II, but also with regard to the electronic structure of the highly radiation sensitive Mn4CaO5 cluster, opening new directions for future dynamics studies.
PMCID: PMC3732582  PMID: 23413188
6.  New Python-based methods for data processing 
The Computational Crystallography Toolbox (cctbx) is a flexible software platform that has been used to develop high-throughput crystal-screening tools for both synchrotron sources and X-ray free-electron lasers. Plans for data-processing and visualization applications are discussed, and the benefits and limitations of using graphics-processing units are evaluated.
Current pixel-array detectors produce diffraction images at extreme data rates (of up to 2 TB h−1) that make severe demands on computational resources. New multiprocessing frameworks are required to achieve rapid data analysis, as it is important to be able to inspect the data quickly in order to guide the experiment in real time. By utilizing readily available web-serving tools that interact with the Python scripting language, it was possible to implement a high-throughput Bragg-spot analyzer (cctbx.spotfinder) that is presently in use at numerous synchrotron-radiation beamlines. Similarly, Python interoperability enabled the production of a new data-reduction package (cctbx.xfel) for serial femto­second crystallography experiments at the Linac Coherent Light Source (LCLS). Future data-reduction efforts will need to focus on specialized problems such as the treatment of diffraction spots on interleaved lattices arising from multi-crystal specimens. In these challenging cases, accurate modeling of close-lying Bragg spots could benefit from the high-performance computing capabilities of graphics-processing units.
PMCID: PMC3689530  PMID: 23793153
data processing; reusable code; multiprocessing; cctbx
7.  The Phenix Software for Automated Determination of Macromolecular Structures 
Methods (San Diego, Calif.)  2011;55(1):94-106.
X-ray crystallography is a critical tool in the study of biological systems. It is able to provide information that has been a prerequisite to understanding the fundamentals of life. It is also a method that is central to the development of new therapeutics for human disease. Significant time and effort are required to determine and optimize many macromolecular structures because of the need for manual interpretation of complex numerical data, often using many different software packages, and the repeated use of interactive three-dimensional graphics. The Phenix software package has been developed to provide a comprehensive system for macromolecular crystallographic structure solution with an emphasis on automation. This has required the development of new algorithms that minimize or eliminate subjective input in favour of built-in expert-systems knowledge, the automation of procedures that are traditionally performed by hand, and the development of a computational framework that allows a tight integration between the algorithms. The application of automated methods is particularly appropriate in the field of structural proteomics, where high throughput is desired. Features in Phenix for the automation of experimental phasing with subsequent model building, molecular replacement, structure refinement and validation are described and examples given of running Phenix from both the command line and graphical user interface.
PMCID: PMC3193589  PMID: 21821126
Macromolecular Crystallography; Automation; Phenix; X-ray; Diffraction; Python
8.  Graphical tools for macromolecular crystallography in PHENIX  
Journal of Applied Crystallography  2012;45(Pt 3):581-586.
The foundations and current features of a widely used graphical user interface for macromolecular crystallography are described.
A new Python-based graphical user interface for the PHENIX suite of crystallography software is described. This interface unifies the command-line programs and their graphical displays, simplifying the development of new interfaces and avoiding duplication of function. With careful design, graphical interfaces can be displayed automatically, instead of being manually constructed. The resulting package is easily maintained and extended as new programs are added or modified.
PMCID: PMC3359726  PMID: 22675231
macromolecular crystallography; graphical user interfaces; PHENIX
9.  Towards automated crystallographic structure refinement with phenix.refine  
phenix.refine is a program within the PHENIX package that supports crystallographic structure refinement against experimental data with a wide range of upper resolution limits using a large repertoire of model parameterizations. This paper presents an overview of the major phenix.refine features, with extensive literature references for readers interested in more detailed discussions of the methods.
phenix.refine is a program within the PHENIX package that supports crystallographic structure refinement against experimental data with a wide range of upper resolution limits using a large repertoire of model parameterizations. It has several automation features and is also highly flexible. Several hundred parameters enable extensive customizations for complex use cases. Multiple user-defined refinement strategies can be applied to specific parts of the model in a single refinement run. An intuitive graphical user interface is available to guide novice users and to assist advanced users in managing refinement projects. X-ray or neutron diffraction data can be used separately or jointly in refinement. phenix.refine is tightly integrated into the PHENIX suite, where it serves as a critical component in automated model building, final structure refinement, structure validation and deposition to the wwPDB. This paper presents an overview of the major phenix.refine features, with extensive literature references for readers interested in more detailed discussions of the methods.
PMCID: PMC3322595  PMID: 22505256
structure refinement; PHENIX; joint X-ray/neutron refinement; maximum likelihood; TLS; simulated annealing; subatomic resolution; real-space refinement; twinning; NCS
10.  Use of knowledge-based restraints in phenix.refine to improve macromolecular refinement at low resolution 
Recent developments in PHENIX are reported that allow the use of reference-model torsion restraints, secondary-structure hydrogen-bond restraints and Ramachandran restraints for improved macromolecular refinement in phenix.refine at low resolution.
Traditional methods for macromolecular refinement often have limited success at low resolution (3.0–3.5 Å or worse), producing models that score poorly on crystallographic and geometric validation criteria. To improve low-resolution refinement, knowledge from macromolecular chemistry and homology was used to add three new coordinate-restraint functions to the refinement program phenix.refine. Firstly, a ‘reference-model’ method uses an identical or homologous higher resolution model to add restraints on torsion angles to the geometric target function. Secondly, automatic restraints for common secondary-structure elements in proteins and nucleic acids were implemented that can help to preserve the secondary-structure geometry, which is often distorted at low resolution. Lastly, we have implemented Ramachandran-based restraints on the backbone torsion angles. In this method, a ϕ,ψ term is added to the geometric target function to minimize a modified Ramachandran landscape that smoothly combines favorable peaks identified from non­redundant high-quality data with unfavorable peaks calculated using a clash-based pseudo-energy function. All three methods show improved MolProbity validation statistics, typically complemented by a lowered R free and a decreased gap between R work and R free.
PMCID: PMC3322597  PMID: 22505258
macromolecular crystallography; low resolution; refinement; automation
11.  phenix.mr_rosetta: molecular replacement and model rebuilding with Phenix and Rosetta 
The combination of algorithms from the structure-modeling field with those of crystallographic structure determination can broaden the range of templates that are useful for structure determination by the method of molecular replacement. Automated tools in phenix.mr_rosetta simplify the application of these combined approaches by integrating Phenix crystallographic algorithms and Rosetta structure-modeling algorithms and by systematically generating and evaluating models with a combination of these methods. The phenix.mr_rosetta algorithms can be used to automatically determine challenging structures. The approaches used in phenix.mr_rosetta are described along with examples that show roles that structure-modeling can play in molecular replacement.
PMCID: PMC3375004  PMID: 22418934
Molecular replacement; Automation; Macromolecular crystallography; Rosetta; Phenix
12.  Macro-to-Micro Structural Proteomics: Native Source Proteins for High-Throughput Crystallization 
PLoS ONE  2012;7(2):e32498.
Structural biology and structural genomics projects routinely rely on recombinantly expressed proteins, but many proteins and complexes are difficult to obtain by this approach. We investigated native source proteins for high-throughput protein crystallography applications. The Escherichia coli proteome was fractionated, purified, crystallized, and structurally characterized. Macro-scale fermentation and fractionation were used to subdivide the soluble proteome into 408 unique fractions of which 295 fractions yielded crystals in microfluidic crystallization chips. Of the 295 crystals, 152 were selected for optimization, diffraction screening, and data collection. Twenty-three structures were determined, four of which were novel. This study demonstrates the utility of native source proteins for high-throughput crystallography.
PMCID: PMC3290569  PMID: 22393408
13.  Allosteric activation mechanism of the Mycobacterium tuberculosis receptor Ser/Thr protein kinase, PknB 
Structure (London, England : 1993)  2010;18(12):1667-1677.
The essential Mycobacterium tuberculosis Ser/Thr protein kinase (STPK), PknB, plays a key role in regulating growth and division, but the structural basis of activation has not been defined. Here we provide biochemical and structural evidence that dimerization through the kinase-domain (KD) N-lobe activates PknB by an allosteric mechanism. Promoting KD pairing using a small-molecule dimerizer stimulates the unphosphorylated kinase, and substitutions that disrupt N-lobe pairing decrease phosphorylation activity in vitro and in vivo. Multiple crystal structures of two monomeric PknB KD mutants in complex with nucleotide reveal diverse inactive conformations that contain large active-site distortions that propagate >30 Å from the mutation site. These results define flexible, inactive structures of a monomeric bacterial receptor KD and show how “back-to-back” N-lobe dimerization stabilizes the active KD conformation. This general mechanism of bacterial receptor STPK activation affords insights into the regulation of homologous eukaryotic kinases that form structurally similar dimers.
PMCID: PMC3181147  PMID: 21134645
14.  PHENIX: a comprehensive Python-based system for macromolecular structure solution 
The PHENIX software for macromolecular structure determination is described.
Macromolecular X-ray crystallography is routinely applied to understand biological processes at a molecular level. How­ever, significant time and effort are still required to solve and complete many of these structures because of the need for manual interpretation of complex numerical data using many software packages and the repeated use of interactive three-dimensional graphics. PHENIX has been developed to provide a comprehensive system for macromolecular crystallo­graphic structure solution with an emphasis on the automation of all procedures. This has relied on the development of algorithms that minimize or eliminate subjective input, the development of algorithms that automate procedures that are traditionally performed by hand and, finally, the development of a framework that allows a tight integration between the algorithms.
PMCID: PMC2815670  PMID: 20124702
PHENIX; Python; macromolecular crystallography; algorithms
15.  An atypical receiver domain controls the dynamic polar localization of the Myxococcus xanthus social motility protein FrzS 
Molecular Microbiology  2007;65(2):319-332.
The Myxococcus xanthus FrzS protein transits from pole-to-pole within the cell, accumulating at the pole that defines the direction of movement in social (S) motility. Here we show using atomic-resolution crystallography and NMR that the FrzS receiver domain (RD) displays the conserved switch Tyr102 in an unusual conformation, lacks the conserved Asp phosphorylation site, and fails to bind Mg2+ or the phosphoryl analogue, Mg2+·BeF3. Mutation of Asp55, closest to the canonical site of RD phosphorylation, showed no motility phenotype in vivo, demonstrating that phosphorylation at this site is not necessary for domain function. In contrast, the Tyr102Ala and His92Phe substitutions on the canonical output face of the FrzS RD abolished S-motility in vivo. Single-cell fluorescence microscopy measurements revealed a striking mislocalization of these mutant FrzS proteins to the trailing cell pole in vivo. The crystal structures of the mutants suggested that the observed conformation of Tyr102 in the wild-type FrzS RD is not sufficient for function. These results support the model that FrzS contains a novel ‘pseudo-receiver domain’ whose function requires recognition of the RD output face but not Asp phosphorylation.
PMCID: PMC1974792  PMID: 17573816
16.  The Database of Macromolecular Motions: new features added at the decade mark 
Nucleic Acids Research  2005;34(Database issue):D296-D301.
The database of molecular motions, MolMovDB (), has been in existence for the past decade. It classifies macromolecular motions and provides tools to interpolate between two conformations (the Morph Server) and predict possible motions in a single structure. In 2005, we expanded the services offered on MolMovDB. In particular, we further developed the Morph Server to produce improved interpolations between two submitted structures. We added support for multiple chains to the original adiabatic mapping interpolation, allowing the analysis of subunit motions. We also added the option of using FRODA interpolation, which allows for more complex pathways, potentially overcoming steric barriers. We added an interface to a hinge prediction service, which acts on single structures and predicts likely residue points for flexibility. We developed tools to relate such points of flexibility in a structure to particular key residue positions, i.e. active sites or highly conserved positions. Lastly, we began relating our motion classification scheme to function using descriptions from the Gene Ontology Consortium.
PMCID: PMC1347409  PMID: 16381870
17.  ExpressYourself: a modular platform for processing and visualizing microarray data 
Nucleic Acids Research  2003;31(13):3477-3482.
DNA microarrays are widely used in biological research; by analyzing differential hybridization on a single microarray slide, one can detect changes in mRNA expression levels, increases in DNA copy numbers and the location of transcription factor binding sites on a genomic scale. Having performed the experiments, the major challenge is to process large, noisy datasets in order to identify the specific array elements that are significantly differentially hybridized. This normally requires aggregating different, often incompatible programs into a multi-step pipeline. Here we present ExpressYourself, a fully integrated platform for processing microarray data. In completely automated fashion, it will correct the background array signal, normalize the Cy5 and Cy3 signals, score levels of differential hybridization, combine the results of replicate experiments, filter problematic regions of the array and assess the quality of individual and replicate experiments. ExpressYourself is designed with a highly modular architecture so various types of microarray analysis algorithms can readily be incorporated as they are developed; for example, the system currently implements several normalization methods, including those that simultaneously consider signal intensity and slide location. The processed data are presented using a web-based graphical interface to facilitate comparison with the original images of the array slides. In particular, Express Yourself is able to regenerate images of the original microarray after applying various steps of processing, which greatly facilities identification of position-specific artifacts. The program is freely available for use at
PMCID: PMC169034  PMID: 12824348
18.  SPINE 2: a system for collaborative structural proteomics within a federated database framework 
Nucleic Acids Research  2003;31(11):2833-2838.
We present version 2 of the SPINE system for structural proteomics. SPINE is available over the web at It serves as the central hub for the Northeast Structural Genomics Consortium, allowing collaborative structural proteomics to be carried out in a distributed fashion. The core of SPINE is a laboratory information management system (LIMS) for key bits of information related to the progress of the consortium in cloning, expressing and purifying proteins and then solving their structures by NMR or X-ray crystallography. Originally, SPINE focused on tracking constructs, but, in its current form, it is able to track target sample tubes and store detailed sample histories. The core database comprises a set of standard relational tables and a data dictionary that form an initial ontology for proteomic properties and provide a framework for large-scale data mining. Moreover, SPINE sits at the center of a federation of interoperable information resources. These can be divided into (i) local resources closely coupled with SPINE that enable it to handle less standardized information (e.g. integrated mailing and publication lists), (ii) other information resources in the NESG consortium that are inter-linked with SPINE (e.g. crystallization LIMS local to particular laboratories) and (iii) international archival resources that SPINE links to and passes on information to (e.g. TargetDB at the PDB).
PMCID: PMC156730  PMID: 12771210
19.  MolMovDB: analysis and visualization of conformational change and structural flexibility 
Nucleic Acids Research  2003;31(1):478-482.
The Database of Macromolecular Movements ( is a collection of data and software pertaining to flexibility in protein and RNA structures. The database is organized into two parts. Firstly, a collection of ‘morphs’ of solved structures representing different states of a molecule provides quantitative data for flexibility and a number of graphical representations. Secondly, a classification of known motions according to type of conformational change (e.g. ‘hinged domain’ or ‘allosteric’) incorporates textual annotation and information from the literature relating to the motion, linking together many of the morphs. A variety of subsets of the morphs are being developed for use in statistical analyses. In particular, for each subset it is possible to derive distributions of various motional quantities (e.g. maximum rotation) that can be used to place a specific motion in context as being typical or atypical for a given population. Over the past year, the database has been greatly expanded and enhanced to incorporate new structures and to improve the quality of data. The ‘morph server’, which enables users of the database to add new morphs either from their own research or the PDB, has also been enhanced to handle nucleic acid structures and multi-chain complexes.
PMCID: PMC165551  PMID: 12520056
20.  Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes 
Nucleic Acids Research  2002;30(11):2515-2523.
Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes—the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into ‘ancient’ and ‘modern’ subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at
PMCID: PMC117176  PMID: 12034841
21.  PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information 
Nucleic Acids Research  2001;29(8):1750-1764.
As the number of protein folds is quite limited, a mode of analysis that will be increasingly common in the future, especially with the advent of structural genomics, is to survey and re-survey the finite parts list of folds from an expanding number of perspectives. We have developed a new resource, called PartsList, that lets one dynamically perform these comparative fold surveys. It is available on the web at and The system is based on the existing fold classifications and functions as a form of companion annotation for them, providing ‘global views’ of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the approximately 420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm versus yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein–protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein–protein interaction data with structural information is a particularly novel feature of our system. We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V–b, for attribute value V and constant exponent b), with a few folds having large values and most having small values.
PMCID: PMC31319  PMID: 11292848
22.  Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome 
Nucleic Acids Research  2001;29(3):818-830.
Pseudogenes are non-functioning copies of genes in genomic DNA, which may either result from reverse transcription from an mRNA transcript (processed pseudogenes) or from gene duplication and subsequent disablement (non-processed pseudogenes). As pseudogenes are apparently ‘dead’, they usually have a variety of obvious disablements (e.g., insertions, deletions, frameshifts and truncations) relative to their functioning homologs. We have derived an initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome, performing a survey in ‘molecular archaeology’. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes, about one for every eight genes. Few of these appear to be processed. Details of our pseudogene assignments are available from The population of pseudogenes differs significantly from that of genes in a number of respects: (i) pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) the density of pseudogenes is higher on the arms of the chromosomes; (iii) the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the most common protein folds and families differ somewhat between genes and pseudogenes—whereas the most common fold found in the worm proteome is the immunoglobulin fold and the most common ‘pseudofold’ is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large populations of pseudogenes. For example, one family of seven-transmembrane receptors (represented by gene B0334.7) has one pseudogene for every four genes, and another uncharacterized family (represented by gene B0403.1) is approximately two-thirds pseudogenic. Furthermore, over a hundred apparent pseudogenic fragments do not have any obvious homologs in the worm.
PMCID: PMC30377  PMID: 11160906

Results 1-22 (22)