Mallott, Jacob | Kwan, Antonia | Church, Joseph | Gonzalez-Espinosa, Diana | Lorey, Fred | Tang, Ling Fung | Sunderam, Uma | Rana, Sadhna | Srinivasan, Rajgopal | Brenner, Steven E. | Puck, Jennifer
Purpose
Severe combined immunodeficiency (SCID) is characterized by failure of T lymphocyte development and absent or very low T cell receptor excision circles (TRECs), DNA byproducts of T cell maturation. Newborn screening for TRECs to identify SCID is now performed in several states using PCR of DNA from universally collected dried blood spots (DBS). In addition to infants with typical SCID, TREC screening identifies infants with T lymphocytopenia who appear healthy and in whom a SCID diagnosis cannot be confirmed. Deep sequencing was employed to find causes of T lymphocytopenia in such infants.
Methods
Whole exome sequencing and analysis were performed in infants and their parents. Upon finding deleterious mutations in the ataxia telangiectasia mutated (ATM) gene, we confirmed the diagnosis of ataxia telangiectasia (AT) in two infants and then tested archival newborn DBS of additional AT patients for TREC copy number.
Results
Exome sequencing and analysis led to 2 unsuspected gene diagnoses of AT. Of 13 older AT patients for whom newborn DBS had been stored, 7 samples tested positive for SCID under the criteria of California’s newborn screening program. AT children with low neonatal TRECs had low CD4 T cell counts subsequently detected (R = 0.64).
Conclusions
T lymphocytopenia in newborns can be a feature of AT, as revealed by TREC screening and exome sequencing. Although there is no current cure for the progressive neurological impairment of AT, early detection permits avoidance of infectious complications, while providing information for families regarding reproductive recurrence risks and increased cancer risks in patients and carriers.
doi:10.1007/s10875-012-9846-1
PMCID: PMC3591536
PMID: 23264026
Ataxia telangiectasia; SCID; newborn screening; TREC; whole exome sequencing
Microbial community profiling using 16S rRNA gene sequences requires accurate taxonomy assignments. ‘Universal' primers target conserved sequences and amplify sequences from many taxa, but they provide variable coverage of different environments, and regions of the rRNA gene differ in taxonomic informativeness—especially when high-throughput short-read sequencing technologies (for example, 454 and Illumina) are used. We introduce a new evaluation procedure that provides an improved measure of expected taxonomic precision when classifying environmental sequence reads from a given primer. Applying this measure to thousands of combinations of primers and read lengths, simulating single-ended and paired-end sequencing, reveals that these choices greatly affect taxonomic informativeness. The most informative sequence region may differ by environment, partly due to variable coverage of different environments in reference databases. Using our Rtax method of classifying paired-end reads, we found that paired-end sequencing provides substantial benefit in some environments including human gut, but not in others. Optimal primer choice for short reads totaling 96 nt provides 82–100% of the confident genus classifications available from longer reads.
doi:10.1038/ismej.2011.208
PMCID: PMC3379642
PMID: 22237546
16S ribosomal RNA; taxonomy; phylogeny; classification; bacteria; sequencing
doi:10.1371/journal.pcbi.0010004
PMCID: PMC1183510
PMID: 16103905
Biological macromolecules can adopt multiple conformational and compositional states due to structural flexibility and alternative subunit assemblies. This structural heterogeneity poses a major challenge in the study of macromolecular structure using single particle electron microscopy. We propose a fully automated, unsupervised method for the three-dimensional reconstruction of multiple structural models from heterogeneous data. As a starting reference, our method employs an initial structure that does not account for any heterogeneity. Then, a multi-stage clustering is used to create multiple models representative of the heterogeneity within the sample. The multi-stage clustering combines an existing approach based on Multivariate Statistical Analysis to perform clustering within individual Euler angles, and a newly developed approach to sort out class-averages from individual Euler angles into homogeneous groups. Structural models are computed from individual clusters. The whole data classification is further refined using an iterative multi-model projection matching approach. We tested our method on one synthetic and three distinct experimental datasets. The tests include the cases where a macromolecular complex exhibits structural flexibility and cases where a molecule is found in ligand-bound and unbound states. We propose the use of our approach as an efficient way to reconstruct distinct multiple models from heterogeneous data.
doi:10.1016/j.jsb.2010.01.007
PMCID: PMC2841227
PMID: 20085819
Heterogeneous reconstruction; heterogeneous data; multi-model reconstruction
Splicing factor 1 (SF1) binds to the branch point sequence (BPS) of mammalian introns and is believed to be important for the splicing of some, but not all, introns. To help identify BPSs, particularly those that depend on SF1, we generated a BPS profile model in which SF1 binding affinity data, validated by branch point mapping, were iteratively incorporated into computational models. We searched a data set of 117 499 human introns for best matches to the SF1 Affinity Model above a threshold, and counted the number of matches at each intronic position. After subtracting a background value, we found that 87.9% of remaining high-scoring matches identified were located in a region upstream of 3′-splice sites where BPSs are typically found. Since U2AF65 recognizes the polypyrimidine tract (PPT) and forms a cooperative RNA complex with SF1, we combined the SF1 model with a PPT model computed from high affinity binding sequences for U2AF65. The combined model, together with binding site location constraints, accurately identified introns bound by SF1 that are candidates for SF1-dependent splicing.
doi:10.1093/nar/gkq1046
PMCID: PMC3064769
PMID: 21071404
It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called “phylogenomics”) is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.
doi:10.1088/1742-6596/180/1/012024
PMCID: PMC2909777
PMID: 20664722
Generation of cDNA using random hexamer priming induces biases in the nucleotide composition at the beginning of transcriptome sequencing reads from the Illumina Genome Analyzer. The bias is independent of organism and laboratory and impacts the uniformity of the reads along the transcriptome. We provide a read count reweighting scheme, based on the nucleotide frequencies of the reads, that mitigates the impact of the bias.
doi:10.1093/nar/gkq224
PMCID: PMC2896536
PMID: 20395217
Motivation: Rapid methods for protein structure search enable biological discoveries based on flexibly defined structural similarity, unleashing the power of the ever greater number of solved protein structures. Projection methods show promise for the development of fast structural database search solutions. Projection methods map a structure to a point in a high-dimensional space and compare two structures by measuring distance between their projected points. These methods offer a tremendous increase in speed over residue-level structural alignment methods. However, current projection methods are not practical, partly because they are unable to identify local similarities.
Results: We propose a new projection-based approach that can rapidly detect global as well as local structural similarities. Local structural search is enabled by a topology-inspired writhe decomposition protocol that produces a small number of fragments while ensuring that similar structures are cut in a similar manner. In benchmark tests, we show that our method, writher, improves accuracy over existing projection methods in terms of recognizing scop domains out of multi-domain proteins, while maintaining accuracy comparable with existing projection methods in a standard single-domain benchmark test.
Availability: The source code is available at the following website: http://compbio.berkeley.edu/proj/writher/
Contact: dzhi@compbio.berkeley.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq127
PMCID: PMC2859133
PMID: 20371498
We propose a feature-based image alignment method for single-particle electron microscopy that is able to accommodate various similarity scoring functions while efficiently sampling the two-dimensional transformational space. We use this image alignment method to evaluate the performance of a scoring function that is based on the Mutual Information (MI) of two images rather than one that is based on the cross-correlation function. We show that alignment using MI for the scoring function has far less model-dependent bias than is found with cross-correlation based alignment. We also demonstrate that MI improves the alignment of some types of heterogeneous data, provided that the signal to noise ratio is relatively high. These results indicate, therefore, that use of MI as the scoring function is well suited for the alignment of class-averages computed from single particle images. Our method is tested on data from three model structures and one real dataset.
doi:10.1016/j.jsb.2008.12.008
PMCID: PMC2740748
PMID: 19166941
Particle alignment; heterogeneous data; 2D alignment; EM reconstruction
Summary
Heterogeneous nuclear ribonucleoproteins (hnRNPs) have been traditionally seen as proteins packaging RNA nonspecifically into ribonucleoprotein particles (RNPs), but evidence suggests specific cellular functions on discrete target pre-mRNAs. Here we report genome-wide analysis of alternative splicing patterns regulated by four Drosophila homologues of the mammalian hnRNP A/B family (hrp36, hrp38, hrp40 and hrp48). Analysis of the global RNA binding distributions of each protein revealed both small and also extensively bound regions on target transcripts. A significant subset of RNAs were bound and regulated by more than one hnRNP protein, revealing a combinatorial network of interactions. In vitro RNA binding site selection experiments (SELEX) identified distinct binding motif specificities for each protein that were over-represented in their respective regulated and bound transcripts. These results indicate that individual heterogeneous ribonucleoproteins have specific affinities for overlapping, but distinct, populations of target pre-mRNAs controlling their patterns of RNA processing.
doi:10.1016/j.molcel.2009.01.022
PMCID: PMC2674966
PMID: 19250905
alternative splicing; hnRNP proteins; RNA binding proteins; microarray; Drosophila melanogaster
Schwede, Torsten | Sali, Andrej | Honig, Barry | Levitt, Michael | Berman, Helen M. | Jones, David | Brenner, Steven E. | Burley, Stephen K. | Das, Rhiju | Dokholyan, Nikolay V. | Dunbrack, Roland L. | Fidelis, Krzysztof | Fiser, Andras | Godzik, Adam | Huang, Yuanpeng Janet | Humblet, Christine | Jacobson, Matthew P. | Joachimiak, Andrzej | Krystek, Stanley R. | Kortemme, Tanja | Kryshtafovych, Andriy | Montelione, Gaetano T. | Moult, John | Murray, Diana | Sanchez, Roberto | Sosnick, Tobin R. | Standley, Daron M. | Stouch, Terry | Vajda, Sandor | Vasquez, Max | Westbrook, John D. | Wilson, Ian A.
Summary
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.
doi:10.1016/j.str.2008.12.014
PMCID: PMC2739730
PMID: 19217386
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.
doi:10.1093/nar/gkm993
PMCID: PMC2238974
PMID: 18000004
doi:10.1371/journal.pcbi.0030157
PMCID: PMC1994973
PMID: 17907793
doi:10.1371/journal.pcbi.0030143
PMCID: PMC1994971
PMID: 17907791
Metal ions are essential for the folding of RNA into stable tertiary structures and for the catalytic activity of some RNA enzymes. To aid in the study of the roles of metal ions in RNA structural biology, we have created MeRNA (Metals in RNA), a comprehensive compilation of all metal binding sites identified in RNA 3D structures available from the PDB and Nucleic Acid Database. Currently, our database contains information relating to binding of 9764 metal ions corresponding to 23 distinct elements, in 256 RNA structures. The metal ion locations were confirmed and ligands characterized using original literature references. MeRNA includes eight manually identified metal-ion binding motifs, which are described in the literature. MeRNA is searchable by PDB identifier, metal ion, method of structure determination, resolution and R-values for X-ray structure and distance from metal to any RNA atom or to water. New structures with their respective binding motifs will be added to the database as they become available. The MeRNA database will further our understanding of the roles of metal ions in RNA folding and catalysis and have applications in structural and functional analysis, RNA design and engineering. The MeRNA database is accessible at .
doi:10.1093/nar/gkj058
PMCID: PMC1347421
PMID: 16381830
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5′-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
Synopsis
New genome sequences continue to be published at a prodigious rate. However, unannotated sequences are of limited use to biologists. To computationally annotate a hypothetical protein for molecular function, researchers generally attempt to carry out some form of information transfer from evolutionarily related proteins. Such transfer is most successfully achieved within the context of phylogenetic relationships, exploiting the comprehensive knowledge that is available regarding molecular evolution within a given protein family. This general approach to molecular function annotation is known as phylogenomics, and it is the best method currently available for providing high-quality annotations. A drawback of phylogenomics, however, is that it is a time-consuming manual process requiring expert knowledge. In the current paper, the authors have developed a statistical approach—referred to as SIFTER (Statistical Inference of Function Through Evolutionary Relationships)—that allows phylogenomic analyses to be carried out automatically.
The authors present the results of running SIFTER on a collection of 100 protein families. They also validate their method on a specific family for which a gold standard set of experimental annotations is available. They show that SIFTER annotates 96% of the gold standard proteins correctly, outperforming popular annotation methods including BLAST-based annotation (75%), GOtcha (89%), GeneQuiz (64%), and Orthostrapper (11%). The results support the feasibility of carrying out high-quality phylogenomic analyses of entire genomes.
doi:10.1371/journal.pcbi.0010045
PMCID: PMC1246806
PMID: 16217548
A report on the Keystone Symposium 'Structural Genomics', held concurrently with the 'Frontiers in Structural Biology' symposium, Snowbird, USA, 13-19 April 2004.
A report on the Keystone Symposium 'Structural Genomics', held concurrently with the 'Frontiers in Structural Biology' symposium, Snowbird, USA, 13-19 April 2004.
doi:10.1186/gb-2004-5-9-343
PMCID: PMC522866
PMID: 15345043
Release 2.0.1 of the Structural Classification of RNA (SCOR) database, http://scor.lbl.gov, contains a classification of the internal and hairpin loops in a comprehensive collection of 497 NMR and X-ray RNA structures. This report discusses findings of the classification that have not been reported previously. The SCOR database contains multiple examples of a newly described RNA motif, the extruded helical single strand. Internal loop base triples are classified in SCOR according to their three-dimensional context. These internal loop triples contain several examples of a frequently found motif, the minor groove AGC triple. SCOR also presents the predominant and alternate conformations of hairpin loops, as shown in the most well represented tetraloops, with consensus sequences GNRA, UNCG and ANYA. The ubiquity of the GNRA hairpin turn motif is illustrated by its presence in complex internal loops.
doi:10.1093/nar/gkh537
PMCID: PMC419439
PMID: 15121895
Following the hypothesis that the public databases contain cloned mRNAs that would be degraded in vivo by the nonsense-mediated mRNA decay mechanism, 144 isoform sequences deposited in SWISS-PROT have been identified that derive from mRNAs with premature termination codons
Background
Nonsense-mediated mRNA decay (NMD) is a eukaryotic mRNA surveillance mechanism that detects and degrades mRNAs with premature termination codons (PTC+ mRNAs). In mammals, a termination codon is recognized as premature if it lies more than about 50 nucleotides upstream of the final intron position. More than a third of reliably inferred alternative splicing events in humans have been shown to result in PTC+ mRNA isoforms. As the mechanistic details of NMD have only recently been elucidated, we hypothesized that many PTC+ isoforms may have been cloned, characterized and deposited in the public databases, even though they would be targeted for degradation in vivo.
Results
We analyzed the human alternative protein isoforms described in the SWISS-PROT database and found that 144 (5.8% of 2,483) isoform sequences amenable to analysis, from 107 (7.9% of 1,363) SWISS-PROT entries, derive from PTC+ mRNA.
Conclusions
For several of the PTC+ isoforms we identified, existing experimental evidence can be reinterpreted and is consistent with the action of NMD to degrade the transcripts. Several genes with mRNA isoforms that we identified as PTC+ - calpain-10, the CDC-like kinases (CLKs) and LARD - show how previous experimental results may be understood in light of NMD.
doi:10.1186/gb-2004-5-2-r8
PMCID: PMC395752
PMID: 14759258
The ASTRAL Compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54 745 domains, more than three times as many as the initial release 4 years ago. ASTRAL has undergone major transformations in the past 2 years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods. ASTRAL may be accessed at http://astral.stanford.edu/.
doi:10.1093/nar/gkh034
PMCID: PMC308768
PMID: 14681391
SCOR, the Structural Classification of RNA (http://scor.lbl.gov), is a database designed to provide a comprehensive perspective and understanding of RNA motif three-dimensional structure, function, tertiary interactions and their relationships. SCOR 2.0 represents a major expansion and introduces a new classification organization. The new version represents the classification as a Directed Acyclic Graph (DAG), which allows a classification node to have multiple parents, in contrast to the strictly hierarchical classification used in SCOR 1.2. SCOR 2.0 supports three types of query terms in the updated search engine: PDB or NDB identifier, nucleotide sequence and keyword. We also provide parseable XML files for all information. This new release contains 511 RNA entries from the PDB as of 15 May 2003. A total of 5880 secondary structural elements are classified: 2104 hairpin loops and 3776 internal loops. RNA motifs reported in the literature, such as ‘Kink turn’ and ‘GNRA loops’, are now incorporated into the structural classification along with definitions and descriptions.
doi:10.1093/nar/gkh080
PMCID: PMC308814
PMID: 14681389
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are hierarchically classified into families, superfamilies, folds and classes. The continual accumulation of sequence and structural data allows more rigorous analysis and provides important information for understanding the protein world and its evolutionary repertoire. SCOP participates in a project that aims to rationalize and integrate the data on proteins held in several sequence and structure databases. As part of this project, starting with release 1.63, we have initiated a refinement of the SCOP classification, which introduces a number of changes mostly at the levels below superfamily. The pending SCOP reclassification will be carried out gradually through a number of future releases. In addition to the expanded set of static links to external resources, available at the level of domain entries, we have started modernization of the interface capabilities of SCOP allowing more dynamic links with other databases. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.
doi:10.1093/nar/gkh039
PMCID: PMC308773
PMID: 14681400
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. It is partially derived from the SCOP database of protein domains, and it includes sequences for each domain as well as other resources useful for studying these sequences and domain structures. Several major improvements have been made to the ASTRAL compendium since its initial release 2 years ago. The number of protein domain sequences included has doubled from 15 190 to 30 867, and additional databases have been added. The Rapid Access Format (RAF) database contains manually curated mappings linking the biological amino acid sequences described in the SEQRES records of PDB entries to the amino acid sequences structurally observed (provided in the ATOM records) in a format designed for rapid access by automated tools. This information is used to derive sequences for protein domains in the SCOP database. In cases where a SCOP domain spans several protein chains, all of which can be traced back to a single genetic source, a ‘genetic domain’ sequence is created by concatenating the sequences of each chain in the order found in the original gene sequence. Both the original-style library of SCOP sequences and a new library including genetic domain sequences are available. Selected representative subsets of each of these libraries, based on multiple criteria and degrees of similarity, are also included. ASTRAL may be accessed at http://astral.stanford.edu/.
PMCID: PMC99063
PMID: 11752310
The SCOP (Structural Classification of Proteins) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are grouped into species and hierarchically classified into families, superfamilies, folds and classes. Recently, we introduced a new set of features with the aim of standardizing access to the database, and providing a solid basis to manage the increasing number of experimental structures expected from structural genomics projects. These features include: a new set of identifiers, which uniquely identify each entry in the hierarchy; a compact representation of protein domain classification; a new set of parseable files, which fully describe all domains in SCOP and the hierarchy itself. These new features are reflected in the ASTRAL compendium. The SCOP search engine has also been updated, and a set of links to external resources added at the level of domain entries. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.
PMCID: PMC99154
PMID: 11752311
The Structural Classification of RNA (SCOR) database provides a survey of the three-dimensional motifs contained in 259 NMR and X-ray RNA structures. In one classification, the structures are grouped according to function. The RNA motifs, including internal and external loops, are also organized in a hierarchical classification. The 259 database entries contain 223 internal and 203 external loops; 52 entries consist of fully complementary duplexes. A classification of the well-characterized tertiary interactions found in the larger RNA structures is also included along with examples. The SCOR database is accessible at http://scor.lbl.gov.
PMCID: PMC99131
PMID: 11752346