Search tips
Search criteria

Results 1-25 (39)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Newborn Screening for SCID Identifies Patients with Ataxia Telangiectasia 
Journal of clinical immunology  2012;33(3):540-549.
Severe combined immunodeficiency (SCID) is characterized by failure of T lymphocyte development and absent or very low T cell receptor excision circles (TRECs), DNA byproducts of T cell maturation. Newborn screening for TRECs to identify SCID is now performed in several states using PCR of DNA from universally collected dried blood spots (DBS). In addition to infants with typical SCID, TREC screening identifies infants with T lymphocytopenia who appear healthy and in whom a SCID diagnosis cannot be confirmed. Deep sequencing was employed to find causes of T lymphocytopenia in such infants.
Whole exome sequencing and analysis were performed in infants and their parents. Upon finding deleterious mutations in the ataxia telangiectasia mutated (ATM) gene, we confirmed the diagnosis of ataxia telangiectasia (AT) in two infants and then tested archival newborn DBS of additional AT patients for TREC copy number.
Exome sequencing and analysis led to 2 unsuspected gene diagnoses of AT. Of 13 older AT patients for whom newborn DBS had been stored, 7 samples tested positive for SCID under the criteria of California’s newborn screening program. AT children with low neonatal TRECs had low CD4 T cell counts subsequently detected (R=0.64).
T lymphocytopenia in newborns can be a feature of AT, as revealed by TREC screening and exome sequencing. Although there is no current cure for the progressive neurological impairment of AT, early detection permits avoidance of infectious complications, while providing information for families regarding reproductive recurrence risks and increased cancer risks in patients and carriers.
PMCID: PMC3591536  PMID: 23264026
ataxia telangiectasia; SCID; newborn screening; TREC; whole exome sequencing
2.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures 
Nucleic Acids Research  2013;42(Database issue):D304-D309.
Structural Classification of Proteins—extended (SCOPe, is a database of protein structural relationships that extends the SCOP database. SCOP is a manually curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. Development of the SCOP 1.x series concluded with SCOP 1.75. The ASTRAL compendium provides several databases and tools to aid in the analysis of the protein structures classified in SCOP, particularly through the use of their sequences. SCOPe extends version 1.75 of the SCOP database, using automated curation methods to classify many structures released since SCOP 1.75. We have rigorously benchmarked our automated methods to ensure that they are as accurate as manual curation, though there are many proteins to which our methods cannot be applied. SCOPe is also partially manually curated to correct some errors in SCOP. SCOPe aims to be backward compatible with SCOP, providing the same parseable files and a history of changes between all stable SCOP and SCOPe releases. SCOPe also incorporates and updates the ASTRAL database. The latest release of SCOPe, 2.03, contains 59 514 Protein Data Bank (PDB) entries, increasing the number of structures classified in SCOP by 55% and including more than 65% of the protein structures in the PDB.
PMCID: PMC3965108  PMID: 24304899
3.  A large-scale evaluation of computational protein function prediction 
Radivojac, Predrag | Clark, Wyatt T | Ronnen Oron, Tal | Schnoes, Alexandra M | Wittkop, Tobias | Sokolov, Artem | Graim, Kiley | Funk, Christopher | Verspoor, Karin | Ben-Hur, Asa | Pandey, Gaurav | Yunes, Jeffrey M | Talwalkar, Ameet S | Repo, Susanna | Souza, Michael L | Piovesan, Damiano | Casadio, Rita | Wang, Zheng | Cheng, Jianlin | Fang, Hai | Gough, Julian | Koskinen, Patrik | Törönen, Petri | Nokso-Koivisto, Jussi | Holm, Liisa | Cozzetto, Domenico | Buchan, Daniel W A | Bryson, Kevin | Jones, David T | Limaye, Bhakti | Inamdar, Harshal | Datta, Avik | Manjari, Sunitha K | Joshi, Rajendra | Chitale, Meghana | Kihara, Daisuke | Lisewski, Andreas M | Erdin, Serkan | Venner, Eric | Lichtarge, Olivier | Rentzsch, Robert | Yang, Haixuan | Romero, Alfonso E | Bhat, Prajwal | Paccanaro, Alberto | Hamp, Tobias | Kassner, Rebecca | Seemayer, Stefan | Vicedo, Esmeralda | Schaefer, Christian | Achten, Dominik | Auer, Florian | Böhm, Ariane | Braun, Tatjana | Hecht, Maximilian | Heron, Mark | Hönigschmid, Peter | Hopf, Thomas | Kaufmann, Stefanie | Kiening, Michael | Krompass, Denis | Landerer, Cedric | Mahlich, Yannick | Roos, Manfred | Björne, Jari | Salakoski, Tapio | Wong, Andrew | Shatkay, Hagit | Gatzmann, Fanny | Sommer, Ingolf | Wass, Mark N | Sternberg, Michael J E | Škunca, Nives | Supek, Fran | Bošnjak, Matko | Panov, Panče | Džeroski, Sašo | Šmuc, Tomislav | Kourmpetis, Yiannis A I | van Dijk, Aalt D J | ter Braak, Cajo J F | Zhou, Yuanpeng | Gong, Qingtian | Dong, Xinran | Tian, Weidong | Falda, Marco | Fontana, Paolo | Lavezzo, Enrico | Di Camillo, Barbara | Toppo, Stefano | Lan, Liang | Djuric, Nemanja | Guo, Yuhong | Vucetic, Slobodan | Bairoch, Amos | Linial, Michal | Babbitt, Patricia C | Brenner, Steven E | Orengo, Christine | Rost, Burkhard | Mooney, Sean D | Friedberg, Iddo
Nature methods  2013;10(3):221-227.
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools.
PMCID: PMC3584181  PMID: 23353650
4.  The COMBREX Project: Design, Methodology, and Initial Results 
Anton, Brian P. | Chang, Yi-Chien | Brown, Peter | Choi, Han-Pil | Faller, Lina L. | Guleria, Jyotsna | Hu, Zhenjun | Klitgord, Niels | Levy-Moonshine, Ami | Maksad, Almaz | Mazumdar, Varun | McGettrick, Mark | Osmani, Lais | Pokrzywa, Revonda | Rachlin, John | Swaminathan, Rajeswari | Allen, Benjamin | Housman, Genevieve | Monahan, Caitlin | Rochussen, Krista | Tao, Kevin | Bhagwat, Ashok S. | Brenner, Steven E. | Columbus, Linda | de Crécy-Lagard, Valérie | Ferguson, Donald | Fomenkov, Alexey | Gadda, Giovanni | Morgan, Richard D. | Osterman, Andrei L. | Rodionov, Dmitry A. | Rodionova, Irina A. | Rudd, Kenneth E. | Söll, Dieter | Spain, James | Xu, Shuang-yong | Bateman, Alex | Blumenthal, Robert M. | Bollinger, J. Martin | Chang, Woo-Suk | Ferrer, Manuel | Friedberg, Iddo | Galperin, Michael Y. | Gobeill, Julien | Haft, Daniel | Hunt, John | Karp, Peter | Klimke, William | Krebs, Carsten | Macelis, Dana | Madupu, Ramana | Martin, Maria J. | Miller, Jeffrey H. | O'Donovan, Claire | Palsson, Bernhard | Ruch, Patrick | Setterdahl, Aaron | Sutton, Granger | Tate, John | Yakunin, Alexander | Tchigvintsev, Dmitri | Plata, Germán | Hu, Jie | Greiner, Russell | Horn, David | Sjölander, Kimmen | Salzberg, Steven L. | Vitkup, Dennis | Letovsky, Stanley | Segrè, Daniel | DeLisi, Charles | Roberts, Richard J. | Steffen, Martin | Kasif, Simon
PLoS Biology  2013;11(8):e1001638.
Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
PMCID: PMC3754883  PMID: 24013487
5.  Association of gut microbiota with post-operative clinical course in Crohn’s disease 
BMC Gastroenterology  2013;13:131.
The gut microbiome is altered in Crohn’s disease. Although individual taxa have been correlated with post-operative clinical course, global trends in microbial diversity have not been described in this context.
We collected mucosal biopsies from the terminal ileum and ascending colon during surgery and post-operative colonoscopy in 6 Crohn’s patients undergoing ileocolic resection (and 40 additional Crohn’s and healthy control patients undergoing either surgery or colonoscopy). Using next-generation sequencing technology, we profiled the gut microbiota in order to identify changes associated with remission or recurrence of inflammation.
We performed 16S ribosomal profiling using 101 base-pair single-end sequencing on the Illumina GAIIx platform with deep coverage, at an average depth of 1.3 million high quality reads per sample. At the time of surgery, Crohn’s patients who would remain in remission were more similar to controls and more species-rich than Crohn’s patients with subsequent recurrence. Patients remaining in remission also exhibited greater stability of the microbiota through time.
These observations permitted an association of gut microbial profiles with probability of recurrence in this limited single-center study. These results suggest that profiling the gut microbiota may be useful in guiding treatment of Crohn’s patients undergoing surgery.
PMCID: PMC3848607  PMID: 23964800
Crohn's disease; Gut microbiome; Next-generation sequencing; Microbial profiling; 16S rRNA gene
6.  A continuous fluorescence assay for the characterization of Nudix hydrolases 
Analytical Biochemistry  2013;437(2):178-184.
The common substrate structure for the functionally diverse Nudix protein superfamily is nucleotide-diphosphate-X, where X is a large variety of leaving groups. The substrate specificity is known for less than 1% of the 29,400 known members. Most activities result in the release of an inorganic phosphate ion or of a product bearing a terminal phosphate moiety. Reactions have typically been monitored by a modification of the discontinuous Fiske–SubbaRow assay, which is relatively insensitive and slow. We report here the development of a continuous fluorescence assay that enables the rapid and accurate determination of substrate specificities in a 96-well format. We used this novel assay to confirm the reported substrate characterizations of MutT and NudD of Escherichia coli and to characterize DR_1025 of Deinococcus radiodurans and MM_0920 of Methanosarcina mazei. Novel findings enabled by the new assay include the following. First, in addition to the well-characterized hydrolysis of 8-oxo-dGTP at the α–β position, MutT cleaves at the β–γ phosphate bond at a rate of 3% of that recorded for hydrolysis at the α–β position. Second, MutT also catalyzes the hydrolysis of 5-methyl-dCTP. Third, 8-oxo-dGTP was observed to be the best substrate for DR_1025 of the 41 compounds screened.
PMCID: PMC3744803  PMID: 23481913
Nudix; Continuous assay; Fluorescence; Substrate screening; Kinetics
7.  Developing Computational Biology 
PLoS Computational Biology  2007;3(9):e157.
PMCID: PMC1994973  PMID: 17907793
8.  Newborn Screening for SCID Identifies Patients with Ataxia Telangiectasia 
Journal of Clinical Immunology  2012;33(3):540-549.
Severe combined immunodeficiency (SCID) is characterized by failure of T lymphocyte development and absent or very low T cell receptor excision circles (TRECs), DNA byproducts of T cell maturation. Newborn screening for TRECs to identify SCID is now performed in several states using PCR of DNA from universally collected dried blood spots (DBS). In addition to infants with typical SCID, TREC screening identifies infants with T lymphocytopenia who appear healthy and in whom a SCID diagnosis cannot be confirmed. Deep sequencing was employed to find causes of T lymphocytopenia in such infants.
Whole exome sequencing and analysis were performed in infants and their parents. Upon finding deleterious mutations in the ataxia telangiectasia mutated (ATM) gene, we confirmed the diagnosis of ataxia telangiectasia (AT) in two infants and then tested archival newborn DBS of additional AT patients for TREC copy number.
Exome sequencing and analysis led to 2 unsuspected gene diagnoses of AT. Of 13 older AT patients for whom newborn DBS had been stored, 7 samples tested positive for SCID under the criteria of California’s newborn screening program. AT children with low neonatal TRECs had low CD4 T cell counts subsequently detected (R = 0.64).
T lymphocytopenia in newborns can be a feature of AT, as revealed by TREC screening and exome sequencing. Although there is no current cure for the progressive neurological impairment of AT, early detection permits avoidance of infectious complications, while providing information for families regarding reproductive recurrence risks and increased cancer risks in patients and carriers.
PMCID: PMC3591536  PMID: 23264026
Ataxia telangiectasia; SCID; newborn screening; TREC; whole exome sequencing
9.  PLoS Computational Biology: A New Community Journal 
PMCID: PMC1183510  PMID: 16103905
10.  Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences 
The ISME Journal  2012;6(7):1440-1444.
Microbial community profiling using 16S rRNA gene sequences requires accurate taxonomy assignments. ‘Universal' primers target conserved sequences and amplify sequences from many taxa, but they provide variable coverage of different environments, and regions of the rRNA gene differ in taxonomic informativeness—especially when high-throughput short-read sequencing technologies (for example, 454 and Illumina) are used. We introduce a new evaluation procedure that provides an improved measure of expected taxonomic precision when classifying environmental sequence reads from a given primer. Applying this measure to thousands of combinations of primers and read lengths, simulating single-ended and paired-end sequencing, reveals that these choices greatly affect taxonomic informativeness. The most informative sequence region may differ by environment, partly due to variable coverage of different environments in reference databases. Using our Rtax method of classifying paired-end reads, we found that paired-end sequencing provides substantial benefit in some environments including human gut, but not in others. Optimal primer choice for short reads totaling 96 nt provides 82–100% of the confident genus classifications available from longer reads.
PMCID: PMC3379642  PMID: 22237546
16S ribosomal RNA; taxonomy; phylogeny; classification; bacteria; sequencing
11.  Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE 
Roy, Sushmita | Ernst, Jason | Kharchenko, Peter V. | Kheradpour, Pouya | Negre, Nicolas | Eaton, Matthew L. | Landolin, Jane M. | Bristow, Christopher A. | Ma, Lijia | Lin, Michael F. | Washietl, Stefan | Arshinoff, Bradley I. | Ay, Ferhat | Meyer, Patrick E. | Robine, Nicolas | Washington, Nicole L. | Di Stefano, Luisa | Berezikov, Eugene | Brown, Christopher D. | Candeias, Rogerio | Carlson, Joseph W. | Carr, Adrian | Jungreis, Irwin | Marbach, Daniel | Sealfon, Rachel | Tolstorukov, Michael Y. | Will, Sebastian | Alekseyenko, Artyom A. | Artieri, Carlo | Booth, Benjamin W. | Brooks, Angela N. | Dai, Qi | Davis, Carrie A. | Duff, Michael O. | Feng, Xin | Gorchakov, Andrey A. | Gu, Tingting | Henikoff, Jorja G. | Kapranov, Philipp | Li, Renhua | MacAlpine, Heather K. | Malone, John | Minoda, Aki | Nordman, Jared | Okamura, Katsutomo | Perry, Marc | Powell, Sara K. | Riddle, Nicole C. | Sakai, Akiko | Samsonova, Anastasia | Sandler, Jeremy E. | Schwartz, Yuri B. | Sher, Noa | Spokony, Rebecca | Sturgill, David | van Baren, Marijke | Wan, Kenneth H. | Yang, Li | Yu, Charles | Feingold, Elise | Good, Peter | Guyer, Mark | Lowdon, Rebecca | Ahmad, Kami | Andrews, Justen | Berger, Bonnie | Brenner, Steven E. | Brent, Michael R. | Cherbas, Lucy | Elgin, Sarah C. R. | Gingeras, Thomas R. | Grossman, Robert | Hoskins, Roger A. | Kaufman, Thomas C. | Kent, William | Kuroda, Mitzi I. | Orr-Weaver, Terry | Perrimon, Norbert | Pirrotta, Vincenzo | Posakony, James W. | Ren, Bing | Russell, Steven | Cherbas, Peter | Graveley, Brenton R. | Lewis, Suzanna | Micklem, Gos | Oliver, Brian | Park, Peter J. | Celniker, Susan E. | Henikoff, Steven | Karpen, Gary H. | Lai, Eric C. | MacAlpine, David M. | Stein, Lincoln D. | White, Kevin P. | Kellis, Manolis
Science (New York, N.Y.)  2010;330(6012):1787-1797.
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
PMCID: PMC3192495  PMID: 21177974
12.  The Developmental Transcriptome of Drosophila melanogaster 
Nature  2010;471(7339):473-479.
Drosophila melanogaster is one of the most well studied genetic model organisms, nonetheless its genome still contains unannotated coding and non-coding genes, transcripts, exons, and RNA editing sites. Full discovery and annotation are prerequisites for understanding how the regulation of transcription, splicing, and RNA editing directs development of this complex organism. We used RNA-Seq, tiling microarrays, and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. Together, these data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.
PMCID: PMC3075879  PMID: 21179090
13.  Automated Multi-model Reconstruction from Single-Particle Electron Microscopy Data 
Journal of structural biology  2010;170(1):98-108.
Biological macromolecules can adopt multiple conformational and compositional states due to structural flexibility and alternative subunit assemblies. This structural heterogeneity poses a major challenge in the study of macromolecular structure using single particle electron microscopy. We propose a fully automated, unsupervised method for the three-dimensional reconstruction of multiple structural models from heterogeneous data. As a starting reference, our method employs an initial structure that does not account for any heterogeneity. Then, a multi-stage clustering is used to create multiple models representative of the heterogeneity within the sample. The multi-stage clustering combines an existing approach based on Multivariate Statistical Analysis to perform clustering within individual Euler angles, and a newly developed approach to sort out class-averages from individual Euler angles into homogeneous groups. Structural models are computed from individual clusters. The whole data classification is further refined using an iterative multi-model projection matching approach. We tested our method on one synthetic and three distinct experimental datasets. The tests include the cases where a macromolecular complex exhibits structural flexibility and cases where a molecule is found in ligand-bound and unbound states. We propose the use of our approach as an efficient way to reconstruct distinct multiple models from heterogeneous data.
PMCID: PMC2841227  PMID: 20085819
Heterogeneous reconstruction; heterogeneous data; multi-model reconstruction
14.  An SF1 affinity model to identify branch point sequences in human introns 
Nucleic Acids Research  2010;39(6):2344-2356.
Splicing factor 1 (SF1) binds to the branch point sequence (BPS) of mammalian introns and is believed to be important for the splicing of some, but not all, introns. To help identify BPSs, particularly those that depend on SF1, we generated a BPS profile model in which SF1 binding affinity data, validated by branch point mapping, were iteratively incorporated into computational models. We searched a data set of 117 499 human introns for best matches to the SF1 Affinity Model above a threshold, and counted the number of matches at each intronic position. After subtracting a background value, we found that 87.9% of remaining high-scoring matches identified were located in a region upstream of 3′-splice sites where BPSs are typically found. Since U2AF65 recognizes the polypyrimidine tract (PPT) and forms a cooperative RNA complex with SF1, we combined the SF1 model with a PPT model computed from high affinity binding sequences for U2AF65. The combined model, together with binding site location constraints, accurately identified introns bound by SF1 that are candidates for SF1-dependent splicing.
PMCID: PMC3064769  PMID: 21071404
15.  Phylogenetic molecular function annotation 
It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called “phylogenomics”) is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.
PMCID: PMC2909777  PMID: 20664722
16.  Biases in Illumina transcriptome sequencing caused by random hexamer priming 
Nucleic Acids Research  2010;38(12):e131.
Generation of cDNA using random hexamer priming induces biases in the nucleotide composition at the beginning of transcriptome sequencing reads from the Illumina Genome Analyzer. The bias is independent of organism and laboratory and impacts the uniformity of the reads along the transcriptome. We provide a read count reweighting scheme, based on the nucleotide frequencies of the reads, that mitigates the impact of the bias.
PMCID: PMC2896536  PMID: 20395217
17.  Alignment-free local structural search by writhe decomposition 
Bioinformatics  2010;26(9):1176-1184.
Motivation: Rapid methods for protein structure search enable biological discoveries based on flexibly defined structural similarity, unleashing the power of the ever greater number of solved protein structures. Projection methods show promise for the development of fast structural database search solutions. Projection methods map a structure to a point in a high-dimensional space and compare two structures by measuring distance between their projected points. These methods offer a tremendous increase in speed over residue-level structural alignment methods. However, current projection methods are not practical, partly because they are unable to identify local similarities.
Results: We propose a new projection-based approach that can rapidly detect global as well as local structural similarities. Local structural search is enabled by a topology-inspired writhe decomposition protocol that produces a small number of fragments while ensuring that similar structures are cut in a similar manner. In benchmark tests, we show that our method, writher, improves accuracy over existing projection methods in terms of recognizing scop domains out of multi-domain proteins, while maintaining accuracy comparable with existing projection methods in a standard single-domain benchmark test.
Availability: The source code is available at the following website:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2859133  PMID: 20371498
18.  A Method for the Alignment of Heterogeneous Macromolecules from Electron Microscopy 
Journal of structural biology  2008;166(1):67-78.
We propose a feature-based image alignment method for single-particle electron microscopy that is able to accommodate various similarity scoring functions while efficiently sampling the two-dimensional transformational space. We use this image alignment method to evaluate the performance of a scoring function that is based on the Mutual Information (MI) of two images rather than one that is based on the cross-correlation function. We show that alignment using MI for the scoring function has far less model-dependent bias than is found with cross-correlation based alignment. We also demonstrate that MI improves the alignment of some types of heterogeneous data, provided that the signal to noise ratio is relatively high. These results indicate, therefore, that use of MI as the scoring function is well suited for the alignment of class-averages computed from single particle images. Our method is tested on data from three model structures and one real dataset.
PMCID: PMC2740748  PMID: 19166941
Particle alignment; heterogeneous data; 2D alignment; EM reconstruction
19.  Genome-wide analysis of alternative pre-mRNA splicing and RNA binding specificities of the Drosophila hnRNP A/B family members 
Molecular cell  2009;33(4):438-449.
Heterogeneous nuclear ribonucleoproteins (hnRNPs) have been traditionally seen as proteins packaging RNA nonspecifically into ribonucleoprotein particles (RNPs), but evidence suggests specific cellular functions on discrete target pre-mRNAs. Here we report genome-wide analysis of alternative splicing patterns regulated by four Drosophila homologues of the mammalian hnRNP A/B family (hrp36, hrp38, hrp40 and hrp48). Analysis of the global RNA binding distributions of each protein revealed both small and also extensively bound regions on target transcripts. A significant subset of RNAs were bound and regulated by more than one hnRNP protein, revealing a combinatorial network of interactions. In vitro RNA binding site selection experiments (SELEX) identified distinct binding motif specificities for each protein that were over-represented in their respective regulated and bound transcripts. These results indicate that individual heterogeneous ribonucleoproteins have specific affinities for overlapping, but distinct, populations of target pre-mRNAs controlling their patterns of RNA processing.
PMCID: PMC2674966  PMID: 19250905
alternative splicing; hnRNP proteins; RNA binding proteins; microarray; Drosophila melanogaster
20.  Outcome of a Workshop on Applications of Protein Models in Biomedical Research 
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.
PMCID: PMC2739730  PMID: 19217386
21.  Genome-Wide Identification of Alternative Splice Forms Down-Regulated by Nonsense-Mediated mRNA Decay in Drosophila 
PLoS Genetics  2009;5(6):e1000525.
Alternative mRNA splicing adds a layer of regulation to the expression of thousands of genes in Drosophila melanogaster. Not all alternative splicing results in functional protein; it can also yield mRNA isoforms with premature stop codons that are degraded by the nonsense-mediated mRNA decay (NMD) pathway. This coupling of alternative splicing and NMD provides a mechanism for gene regulation that is highly conserved in mammals. NMD is also active in Drosophila, but its effect on the repertoire of alternative splice forms has been unknown, as has the mechanism by which it recognizes targets. Here, we have employed a custom splicing-sensitive microarray to globally measure the effect of alternative mRNA processing and NMD on Drosophila gene expression. We have developed a new algorithm to infer the expression change of each mRNA isoform of a gene based on the microarray measurements. This method is of general utility for interpreting splicing-sensitive microarrays and high-throughput sequence data. Using this approach, we have identified a high-confidence set of 45 genes where NMD has a differential effect on distinct alternative isoforms, including numerous RNA–binding and ribosomal proteins. Coupled alternative splicing and NMD decrease expression of these genes, which may in turn have a downstream effect on expression of other genes. The NMD–affected genes are enriched for roles in translation and mitosis, perhaps underlying the previously observed role of NMD factors in cell cycle progression. Our results have general implications for understanding the NMD mechanism in fly. Most notably, we found that the NMD–target mRNAs had significantly longer 3′ untranslated regions (UTRs) than the nontarget isoforms of the same genes, supporting a role for 3′ UTR length in the recognition of NMD targets in fly.
Author Summary
A gene can be processed into multiple mRNAs through alternative splicing. Alternative splicing increases the number of proteins encoded by the genome, but not all alternative mRNAs produce protein. Instead, some are degraded by nonsense-mediated mRNA decay (NMD), a surveillance system that was originally identified as a means of clearing the cell of mRNAs with nonsense, or stop codon, mutations. Alternative splicing that introduces early stop codons will lead to NMD, offering a way for the cell to down-regulate gene expression after a gene has been transcribed. In this paper, we have developed a new analysis method to study the combined effect of alternative splicing and degradation in the fruit fly Drosophila melanogaster using microarrays. We have found a stringently defined set of 45 genes that can be spliced either into an mRNA that encodes a protein or into an mRNA that is degraded by NMD, down-regulating the overall gene expression. The affected genes include a number that are central to the cell's regulatory processes, including translation, RNA splicing, and cell cycle progression. Our results also help shed light on how NMD determines whether a stop codon is premature, and thus whether to target an mRNA for degradation.
PMCID: PMC2689934  PMID: 19543372
22.  Data growth and its impact on the SCOP database: new developments 
Nucleic Acids Research  2007;36(Database issue):D419-D425.
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at
PMCID: PMC2238974  PMID: 18000004
23.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families 
PLoS Biology  2007;5(3):e16.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Author Summary
The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature.
The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.
PMCID: PMC1821046  PMID: 17355171
24.  MeRNA: a database of metal ion binding sites in RNA structures 
Nucleic Acids Research  2005;34(Database issue):D131-D134.
Metal ions are essential for the folding of RNA into stable tertiary structures and for the catalytic activity of some RNA enzymes. To aid in the study of the roles of metal ions in RNA structural biology, we have created MeRNA (Metals in RNA), a comprehensive compilation of all metal binding sites identified in RNA 3D structures available from the PDB and Nucleic Acid Database. Currently, our database contains information relating to binding of 9764 metal ions corresponding to 23 distinct elements, in 256 RNA structures. The metal ion locations were confirmed and ligands characterized using original literature references. MeRNA includes eight manually identified metal-ion binding motifs, which are described in the literature. MeRNA is searchable by PDB identifier, metal ion, method of structure determination, resolution and R-values for X-ray structure and distance from metal to any RNA atom or to water. New structures with their respective binding motifs will be added to the database as they become available. The MeRNA database will further our understanding of the roles of metal ions in RNA folding and catalysis and have applications in structural and functional analysis, RNA design and engineering. The MeRNA database is accessible at .
PMCID: PMC1347421  PMID: 16381830
25.  Protein Molecular Function Prediction by Bayesian Phylogenomics 
PLoS Computational Biology  2005;1(5):e45.
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5′-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
New genome sequences continue to be published at a prodigious rate. However, unannotated sequences are of limited use to biologists. To computationally annotate a hypothetical protein for molecular function, researchers generally attempt to carry out some form of information transfer from evolutionarily related proteins. Such transfer is most successfully achieved within the context of phylogenetic relationships, exploiting the comprehensive knowledge that is available regarding molecular evolution within a given protein family. This general approach to molecular function annotation is known as phylogenomics, and it is the best method currently available for providing high-quality annotations. A drawback of phylogenomics, however, is that it is a time-consuming manual process requiring expert knowledge. In the current paper, the authors have developed a statistical approach—referred to as SIFTER (Statistical Inference of Function Through Evolutionary Relationships)—that allows phylogenomic analyses to be carried out automatically.
The authors present the results of running SIFTER on a collection of 100 protein families. They also validate their method on a specific family for which a gold standard set of experimental annotations is available. They show that SIFTER annotates 96% of the gold standard proteins correctly, outperforming popular annotation methods including BLAST-based annotation (75%), GOtcha (89%), GeneQuiz (64%), and Orthostrapper (11%). The results support the feasibility of carrying out high-quality phylogenomic analyses of entire genomes.
PMCID: PMC1246806  PMID: 16217548

Results 1-25 (39)