The aptamer database is designed to contain comprehensive sequence information on aptamers and unnatural ribozymes that have been generated by in vitro selection methods. Such data are not normally collected in ‘natural’ sequence databases, such as GenBank. Besides serving as a storehouse of sequences that may have diagnostic or therapeutic utility, the database serves as a valuable resource for theoretical biologists who describe and explore fitness landscapes. The database is updated monthly and is publicly available at http://aptamer.icmb.utexas.edu/.
We have created an Amino Acid–Nucleotide Interaction Database (AANT; http://aant.icmb.utexas.edu/) that categorizes all amino acid–nucleotide interactions from experimentally determined protein–nucleic acid structures, and provides users with a graphic interface for visualizing these interactions in aggregate. AANT accomplishes this by extracting individual amino acid–nucleotide interactions from structures in the Protein Data Bank, combining and superimposing these interactions into multiple structure files (e.g. 20 amino acids × 5 nucleotides) and grouping structurally similar interactions into more readily identifiable clusters. Using the Chime web browser plug-in, users can view 3D representations of the superimpositions and clusters. The unique collection and representation of data on amino acid–nucleotide interactions facilitates understanding the specificity of protein–nucleic acid interactions at a more fundamental level, and allows comparison of otherwise extremely disparate sets of structures. Moreover, by modularly representing the fundamental interactions that govern binding specificity it may prove possible to better engineer nucleic acid binding proteins.
Motivation: We present the ‘Dynamic Packing Grid’ (DPG), a neighborhood data structure for maintaining and manipulating flexible molecules and assemblies, for efficient computation of binding affinities in drug design or in molecular dynamics calculations.
Results: DPG can efficiently maintain the molecular surface using only linear space and supports quasi-constant time insertion, deletion and movement (i.e. updates) of atoms or groups of atoms. DPG also supports constant time neighborhood queries from arbitrary points. Our results for maintenance of molecular surface and polarization energy computations using DPG exhibit marked improvement in time and space requirements.
Supplementary information: Supplementary data are available at Bioinformatics online.
Isotope labeling combined with liquid chromatography–mass spectrometry (LC–MS) provides a robust platform for analyzing differential protein expression in proteomics research. We present a web service, called MaXIC-Q Web (http://ms.iis.sinica.edu.tw/MaXIC-Q_Web/), for quantitation analysis of large-scale datasets generated from proteomics experiments using various stable isotope-labeling techniques, e.g. SILAC, ICAT and user-developed labeling methods. It accepts spectral files in the standard mzXML format and search results from SEQUEST, Mascot and ProteinProphet as input. Furthermore, MaXIC-Q Web uses statistical and computational methods to construct two kinds of elution profiles for each ion, namely, PIMS (projected ion mass spectrum) and XIC (extracted ion chromatogram) from MS data. Toward accurate quantitation, a stringent validation procedure is performed on PIMSs to filter out peptide ions interfered with co-eluting peptides or noise. The areas of XICs determine ion abundances, which are used to calculate peptide and protein ratios. Since MaXIC-Q Web adopts stringent validation on spectral data, it achieves high accuracy so that manual validation effort can be substantially reduced. Furthermore, it provides various visualization diagrams and comprehensive quantitation reports so that users can conveniently inspect quantitation results. In summary, MaXIC-Q Web is a user-friendly, interactive, robust, generic web service for quantitation based on ICAT and SILAC labeling techniques.
Despite the growing volumes of proteomic data, integration of the underlying results remains problematic owing to differences in formats, data captured, protein accessions and services available from the individual repositories. To address this, we present the ISPIDER Central Proteomic Database search (http://www.ispider.manchester.ac.uk/cgi-bin/ProteomicSearch.pl), an integration service offering novel search capabilities over leading, mature, proteomic repositories including PRoteomics IDEntifications database (PRIDE), PepSeeker, PeptideAtlas and the Global Proteome Machine. It enables users to search for proteins and peptides that have been characterised in mass spectrometry-based proteomics experiments from different groups, stored in different databases, and view the collated results with specialist viewers/clients. In order to overcome limitations imposed by the great variability in protein accessions used by individual laboratories, the European Bioinformatics Institute's Protein Identifier Cross-Reference (PICR) service is used to resolve accessions from different sequence repositories. Custom-built clients allow users to view peptide/protein identifications in different contexts from multiple experiments and repositories, as well as integration with the Dasty2 client supporting any annotations available from Distributed Annotation System servers. Further information on the protein hits may also be added via external web services able to take a protein as input. This web server offers the first truly integrated access to proteomics repositories and provides a unique service to biologists interested in mass spectrometry-based proteomics.
Summary: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser.
Availability: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from http://www.bioruby.org/.
The Plant Proteomics Database (PPDB; http://ppdb.tc.cornell.edu), launched in 2004, provides an integrated resource for experimentally identified proteins in Arabidopsis and maize (Zea mays). Internal BLAST alignments link maize and Arabidopsis information. Experimental identification is based on in-house mass spectrometry (MS) of cell type-specific proteomes (maize), or specific subcellular proteomes (e.g. chloroplasts, thylakoids, nucleoids) and total leaf proteome samples (maize and Arabidopsis). So far more than 5000 accessions both in maize and Arabidopsis have been identified. In addition, more than 80 published Arabidopsis proteome datasets from subcellular compartments or organs are stored in PPDB and linked to each locus. Using MS-derived information and literature, more than 1500 Arabidopsis proteins have a manually assigned subcellular location, with a strong emphasis on plastid proteins. Additional new features of PPDB include searchable posttranslational modifications and searchable experimental proteotypic peptides and spectral count information for each identified accession based on in-house experiments. Various search methods are provided to extract more than 40 data types for each accession and to extract accessions for different functional categories or curated subcellular localizations. Protein report pages for each accession provide comprehensive overviews, including predicted protein properties, with hyperlinks to the most relevant databases.
We describe in this communication a set of functional perl script utilities for use in peptide mass spectral database searching and proteomics experiments, known as the Wildcat Toolbox. These are all freely available for download from our laboratory Web site (http://proteomics.arizona.edu/toolbox.html) as a combined zip file, and can also be accessed via the Proteome Commons Web site (www.proteomecommons.org) in the tools section. We make them available to other potential users in the spirit of open source software development; we do not have the resources to provide any significant technical support for them, but we hope users will share both bugs and improvements with the community at large.
Tandem mass spectrometry; protein identification; proteomics; perl; software development; fasta
Serial section electron microscopy (ssEM) is rapidly expanding as a primary tool to investigate synaptic circuitry and plasticity. The ultrastructural images collected through ssEM are content rich and their comprehensive analysis is beyond the capacity of an individual laboratory. Hence, sharing ultrastructural data is becoming crucial to visualize, analyze, and discover the structural basis of synaptic circuitry and function in the brain. We devised a web-based management system called SynapticDB (http://synapses.clm.utexas.edu/synapticdb/) that catalogues, extracts, analyzes, and shares experimental data from ssEM. The management strategy involves a library with check-in, checkout and experimental tracking mechanisms. We developed a series of spreadsheet templates (MS Excel, Open Office spreadsheet, etc) that guide users in methods of data collection, structural identification, and quantitative analysis through ssEM. SynapticDB provides flexible access to complete templates, or to individual columns with instructional headers that can be selected to create user-defined templates. New templates can also be generated and uploaded. Research progress is tracked via experimental note management and dynamic PDF forms that allow new investigators to follow standard protocols and experienced researchers to expand the range of data collected and shared. The combined use of templates and tracking notes ensures that the supporting experimental information is populated into the database and associated with the appropriate ssEM images and analyses. We anticipate that SynapticDB will serve future meta-analyses towards new discoveries about the composition and circuitry of neurons and glia, and new understanding about structural plasticity during development, behavior, learning, memory, and neuropathology.
Online database systems; Online data file checkout and check-in; Data and image sharing; Synapse structure and function; Data management; Dynamic PDF forms
FUGOID is a web-based, taxonomically broad organelle intron database that collects and integrates various functional and structural data on organellar (mitochondrial and chloroplast) introns. The main information provided by FUGOID includes intron sequence, subclass, resident ORF, self-splicing capability, host gene, protein factor(s) involved in splicing, mobility, insertion site, twintron, seminal references and taxonomic position of host organism. It is implemented in a relational database management system, allowing sophisticated, user-friendly searching, data entry and revision. Users can access the database by any common web browser using a variety of operating systems. The main page of the database is available at http://wnt.cc.utexas.edu/~ifmr530/introndata/main.htm.
Proteogenomic approaches have gained increasing popularity, however it is still difficult to integrate mass spectrometry identifications with genomic data due to differing data formats. To address this difficulty, we introduce iPiG as a tool for the integration of peptide identifications from mass spectrometry experiments into existing genome browser visualizations. Thereby, the concurrent analysis of proteomic and genomic data is simplified and proteomic results can directly be compared to genomic data. iPiG is freely available from https://sourceforge.net/projects/ipig/. It is implemented in Java and can be run as a stand-alone tool with a graphical user-interface or integrated into existing workflows. Supplementary data are available at PLOS ONE online.
Many proteomics initiatives require a seamless bioinformatics integration of a range of analytical steps between sample collection and systems modeling immediately assessable to the participants involved in the process. Proteomics profiling by 2D gel electrophoresis to the putative identification of differentially expressed proteins by comparison of mass spectrometry results with reference databases, includes many components of sample processing, not just analysis and interpretation, are regularly revisited and updated. In order for such updates and dissemination of data, a suitable data structure is needed. However, there are no such data structures currently available for the storing of data for multiple gels generated through a single proteomic experiments in a single XML file. This paper proposes a data structure based on XML standards to fill the void that exists between data generated by proteomics experiments and storing of data.
In order to address the resulting procedural fluidity we have adopted and implemented a data model centered on the concept of annotated gel (AG) as the format for delivery and management of 2D Gel electrophoresis results. An eXtensible Markup Language (XML) schema is proposed to manage, analyze and disseminate annotated 2D Gel electrophoresis results. The structure of AG objects is formally represented using XML, resulting in the definition of the AGML syntax presented here.
The proposed schema accommodates data on the electrophoresis results as well as the mass-spectrometry analysis of selected gel spots. A web-based software library is being developed to handle data storage, analysis and graphic representation. Computational tools described will be made available at . Our development of AGML provides a simple data structure for storing 2D gel electrophoresis data.
Summary: METAL provides a computationally efficient tool for meta-analysis of genome-wide association scans, which is a commonly used approach for improving power complex traits gene mapping studies. METAL provides a rich scripting interface and implements efficient memory management to allow analyses of very large data sets and to support a variety of input file formats.
Availability and implementation: METAL, including source code, documentation, examples, and executables, is available at http://www.sph.umich.edu/csg/abecasis/metal/
The Protein and Metabolite Analysis Facility at the University of Texas at Austin is a joint effort of the College of Pharmacy, Center for Research on Environmental Disease (CRED), and the Institute for Cellular and Molecular Biology (ICMB). Services and collaborative research are offered for the detection, characterization and quantification of biomolecules. The Facility's goals are to provide sensitive protein identification and modification analyses, to provide custom peptide syntheses, to offer services for the identification and quantification of metabolites, nutrients and xenobiotics, to implement novel analytical methods, to improve the sensitivity of existing analyses, to provide consultation on the selection and implementation of analytical methods, to offer training in the usage and applications of the instrumentation, and to provide technical expertise in support of individual research goals. The ICMB portion of the Core contains an ABI Procise 492 cLC protein sequencer, a Protein Technologies Inc. Symphony peptide synthesizer, two Bio-rad Duoflows and a GE Heathcare AKTA protein purification systems, two Beckman System Gold HPLC systems, a Berthold Technologies Mithras luminescence and fluorescence detector, an Invitrogen gel electrophoresis set-up, an Art Robbins Instruments Phoenix crystallography robot and a LC-MALDI-TOF/TOF (an ABI 4700 with a LC Packings Ultimate Nano-LC system with a Probot spotting robot). In the College of Pharmacy, the Core has an Applied Biosystems 4000 Q-trap LC MS/MS system with ESI, APCI and nanospray sources coupled with a Shimadzu LC-20AD HPLC system, ThermoFinnigan LCQ ion trap mass spectrometer with ESI, APCI and microspray interfaces combined with a Michrom Magic 2002 HPLC system, a ThermoFinnigan Trace MS GC-quadropole with EI positive, negative CI and selected ion monitoring (SIM), an ABI Voyager-DE Pro MALDI-TOF and a Bio-rad Bioplex 200 fluorescent microbead array system.
The amount of information stemming from proteomics experiments involving (multi dimensional) separation techniques, mass spectrometric analysis, and computational analysis is ever-increasing. Data from such an experimental workflow needs to be captured, related and analyzed. Biological experiments within this scope produce heterogenic data ranging from pictures of one or two-dimensional protein maps and spectra recorded by tandem mass spectrometry to text-based identifications made by algorithms which analyze these spectra. Additionally, peptide and corresponding protein information needs to be displayed.
In order to handle the large amount of data from computational processing of mass spectrometric experiments, automatic import scripts are available and the necessity for manual input to the database has been minimized. Information is in a generic format which abstracts from specific software tools typically used in such an experimental workflow. The software is therefore capable of storing and cross analysing results from many algorithms. A novel feature and a focus of this database is to facilitate protein identification by using peptides identified from mass spectrometry and link this information directly to respective protein maps. Additionally, our application employs spectral counting for quantitative presentation of the data. All information can be linked to hot spots on images to place the results into an experimental context. A summary of identified proteins, containing all relevant information per hot spot, is automatically generated, usually upon either a change in the underlying protein models or due to newly imported identifications. The supporting information for this report can be accessed in multiple ways using the user interface provided by the application.
We present a proteomics database which aims to greatly reduce evaluation time of results from mass spectrometric experiments and enhance result quality by allowing consistent data handling. Import functionality, automatic protein detection, and summary creation act together to facilitate data analysis. In addition, supporting information for these findings is readily accessible via the graphical user interface provided. The database schema and the implementation, which can easily be installed on virtually any server, can be downloaded in the form of a compressed file from our project webpage.
Despite the fact that data deposition is not a generalised fact yet in the field of proteomics, several mass spectrometry (MS) based proteomics repositories are publicly available for the scientific community. The main existing resources are: the Global Proteome Machine Database (GPMDB), PeptideAtlas, the PRoteomics IDEntifications database (PRIDE), Tranche, and NCBI Peptidome. In this review the capabilities of each of these will be described, paying special attention to four key properties: data types stored, applicable data submission strategies, supported formats, and available data mining and visualization tools. Additionally, the data contents from model organisms will be enumerated for each resource. There are other valuable smaller and/or more specialized repositories but they will not be covered in this review. Finally, the concept behind the ProteomeXchange consortium, a collaborative effort among the main resources in the field, will be introduced.
CV, Controlled Vocabulary; HGNC, HUGO Gene Nomenclature Committee; MCP, Molecular and Cellular Proteomics; MRM, Multiple Reaction Monitoring; NIH, National Institutes of Health; OLS, Ontology Lookup Service; PICR, Protein Identifier Cross-Referencing; PSI, Proteomics Standards Initiative; QC, Quality Control; SRM, Selected Reaction Monitoring; SBEAMS, Systems Biology Experiment Analysis Management System; TPP, Trans Proteomics Pipeline.; Proteomics; Databases; Bioinformatics; Data standards; Repositories
PRIDE, the ‘PRoteomics IDEntifications database’ () is a database of protein and peptide identifications that have been described in the scientific literature. These identifications will typically be from specific species, tissues and sub-cellular locations, perhaps under specific disease conditions. Any post-translational modifications that have been identified on individual peptides can be described. These identifications may be annotated with supporting mass spectra. At the time of writing, PRIDE includes the full set of identifications as submitted by individual laboratories participating in the HUPO Plasma Proteome Project and a profile of the human platelet proteome submitted by the University of Ghent in Belgium. By late 2005 PRIDE is expected to contain the identifications and spectra generated by the HUPO Brain Proteome Project. Proteomics laboratories are encouraged to submit their identifications and spectra to PRIDE to support their manuscript submissions to proteomics journals. Data can be submitted in PRIDE XML format if identifications are included or mzData format if the submitter is depositing mass spectra without identifications. PRIDE is a web application, so submission, searching and data retrieval can all be performed using an internet browser. PRIDE can be searched by experiment accession number, protein accession number, literature reference and sample parameters including species, tissue, sub-cellular location and disease state. Data can be retrieved as machine-readable PRIDE or mzData XML (the latter for mass spectra without identifications), or as human-readable HTML.
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.
The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
Summary: PIQA is a quality analysis pipeline designed to examine genomic reads produced by Next Generation Sequencing technology (Illumina G1 Genome Analyzer). A short statistical summary, as well as tile-by-tile and cycle-by-cycle graphical representation of clusters density, quality scores and nucleotide frequencies allow easy identification of various technical problems including defective tiles, mistakes in sample/library preparations and abnormalities in the frequencies of appearance of sequenced genomic reads. PIQA is written in the R statistical programming language and is compatible with bustard, fastq and scarf Illumina G1 Genome Analyzer data formats.
Availability: The PIQA pipeline, installation instructions and examples are available at the supplementary web site (http://bioinfo.uh.edu/PIQA).
Bioinformatics is used at three different steps of proteomic studies of sub-cellular compartments. First one is protein identification from mass spectrometry data. Second one is prediction of sub-cellular localization, and third one is the search of functional domains to predict the function of identified proteins in order to answer biological questions. The aim of the work was to get a new tool for improving the quality of proteomics of sub-cellular compartments. Starting from the analysis of problems found in databases, we designed a new Arabidopsis database named ProtAnnDB (http://www.polebio.scsv.ups-tlse.fr/ProtAnnDB/). It collects in one page predictions of sub-cellular localization and of functional domains made by available software. Using this database allows not only improvement of interpretation of proteomic data (top-down analysis), but also of procedures to isolate sub-cellular compartments (bottom-up quality control).
bioinformatics; cell wall; plant; proteomics
Proteomics continues to play a critical role in post-genomic science as continued advances in mass spectrometry and analytical chemistry support the separation and identification of increasing numbers of peptides and proteins from their characteristic mass spectra. In order to facilitate the sharing of this data, various standard formats have been, and continue to be, developed. Still not fully mature however, these are not yet able to cope with the increasing number of quantitative proteomic technologies that are being developed.
We propose an extension to the PRIDE and mzData XML schema to accommodate the concept of multiple samples per experiment, and in addition, capture the intensities of the iTRAQTM reporter ions in the entry. A simple Java-client has been developed to capture and convert the raw data from common spectral file formats, which also uses a third-party open source tool for the generation of iTRAQTM reported intensities from Mascot output, into a valid PRIDE XML entry.
We describe an extension to the PRIDE and mzData schemas to enable the capture of quantitative data. Currently this is limited to iTRAQTM data but is readily extensible for other quantitative proteomic technologies. Furthermore, a software tool has been developed which enables conversion from various mass spectrum file formats and corresponding Mascot peptide identifications to PRIDE formatted XML. The tool represents a simple approach to preparing quantitative and qualitative data for submission to repositories such as PRIDE, which is necessary to facilitate data deposition and sharing in public domain database. The software is freely available from .
Summary: Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be viewed in a variety of ways including as clusters of orthologs and as phylogenetic profiles. Genomic results may be downloaded in formats suitable for functional as well as phylogenetic analysis, including the recent OrthoXML standard. In addition, gene IDs can be retrieved using FASTA sequence search. All source code and orthologs are freely available.
We here present the jmzReader library: a collection of Java application programming interfaces (APIs) to parse the most commonly used peak list and XML-based mass spectrometry (MS) data formats: DTA, MS2, MGF, PKL, mzXML, mzData, and mzML (based on the already existing API jmzML). The library is optimized to be used in conjunction with mzIdentML, the recently released standard data format for reporting protein and peptide identifications, developed by the HUPO proteomics standards initiative (PSI). mzIdentML files do not contain spectra data but contain references to different kinds of external MS data files. As a key functionality, all parsers implement a common interface that supports the various methods used by mzIdentML to reference external spectra. Thus, when developing software for mzIdentML, programmers no longer have to support multiple MS data file formats but only this one interface. The library (which includes a viewer) is open source and, together with detailed documentation, can be downloaded from http://code.google.com/p/jmzreader/.
Bioinformatics; Data standard; Java; MS data processing; Proteomics standards initiative
Summary: Biogem provides a software development environment for the Ruby programming language, which encourages community-based software development for bioinformatics while lowering the barrier to entry and encouraging best practices.
Biogem, with its targeted modular and decentralized approach, software generator, tools and tight web integration, is an improved general model for scaling up collaborative open source software development in bioinformatics.
Availability: Biogem and modules are free and are OSS. Biogem runs on all systems that support recent versions of Ruby, including Linux, Mac OS X and Windows. Further information at http://www.biogems.info. A tutorial is available at http://www.biogems.info/howto.html
Soybean continues to serve as a rich and inexpensive source of protein for humans and animals. A substantial amount of
information has been reported on the genotypic variation and beneficial genetic manipulation of soybeans. For better
understanding of the consequences of genetic manipulation, elucidation of soybean protein composition is necessary, because of its
direct relationship to phenotype. We have conducted studies to determine the composition of storage, allergen and anti-nutritional
proteins in cultivated soybean using a combined proteomics approach. Two-dimensional polyacrylamide gel electrophoresis (2DPAGE)
was implemented for the separation of proteins along with matrix-assisted laser desorption/ionization time of flight mass
spectrometry (MALDI-TOF-MS) and liquid chromatography mass spectrometry (LC-MS/MS) for the identification of proteins. Our
analysis resulted in the identification of several proteins, and a web based database named soybean protein database (SoyProDB)
was subsequently built to house and allow scientists to search the data. This database will be useful to scientists who wish to
genetically alter soybean with higher quality storage proteins, and also helpful for consumers to get a greater understanding about
proteins that compose soy products available in the market. The database is freely accessible.
Soybean; 2D-PAGE; MALDI-TOF-MS; LC-MS/MS; Conglycinin; Glycinin; Allergen proteins