The aptamer database is designed to contain comprehensive sequence information on aptamers and unnatural ribozymes that have been generated by in vitro selection methods. Such data are not normally collected in ‘natural’ sequence databases, such as GenBank. Besides serving as a storehouse of sequences that may have diagnostic or therapeutic utility, the database serves as a valuable resource for theoretical biologists who describe and explore fitness landscapes. The database is updated monthly and is publicly available at http://aptamer.icmb.utexas.edu/.
We have created an Amino Acid–Nucleotide Interaction Database (AANT; http://aant.icmb.utexas.edu/) that categorizes all amino acid–nucleotide interactions from experimentally determined protein–nucleic acid structures, and provides users with a graphic interface for visualizing these interactions in aggregate. AANT accomplishes this by extracting individual amino acid–nucleotide interactions from structures in the Protein Data Bank, combining and superimposing these interactions into multiple structure files (e.g. 20 amino acids × 5 nucleotides) and grouping structurally similar interactions into more readily identifiable clusters. Using the Chime web browser plug-in, users can view 3D representations of the superimpositions and clusters. The unique collection and representation of data on amino acid–nucleotide interactions facilitates understanding the specificity of protein–nucleic acid interactions at a more fundamental level, and allows comparison of otherwise extremely disparate sets of structures. Moreover, by modularly representing the fundamental interactions that govern binding specificity it may prove possible to better engineer nucleic acid binding proteins.
Motivation: We present the ‘Dynamic Packing Grid’ (DPG), a neighborhood data structure for maintaining and manipulating flexible molecules and assemblies, for efficient computation of binding affinities in drug design or in molecular dynamics calculations.
Results: DPG can efficiently maintain the molecular surface using only linear space and supports quasi-constant time insertion, deletion and movement (i.e. updates) of atoms or groups of atoms. DPG also supports constant time neighborhood queries from arbitrary points. Our results for maintenance of molecular surface and polarization energy computations using DPG exhibit marked improvement in time and space requirements.
Supplementary information: Supplementary data are available at Bioinformatics online.
We describe in this communication a set of functional perl script utilities for use in peptide mass spectral database searching and proteomics experiments, known as the Wildcat Toolbox. These are all freely available for download from our laboratory Web site (http://proteomics.arizona.edu/toolbox.html) as a combined zip file, and can also be accessed via the Proteome Commons Web site (www.proteomecommons.org) in the tools section. We make them available to other potential users in the spirit of open source software development; we do not have the resources to provide any significant technical support for them, but we hope users will share both bugs and improvements with the community at large.
Tandem mass spectrometry; protein identification; proteomics; perl; software development; fasta
Isotope labeling combined with liquid chromatography–mass spectrometry (LC–MS) provides a robust platform for analyzing differential protein expression in proteomics research. We present a web service, called MaXIC-Q Web (http://ms.iis.sinica.edu.tw/MaXIC-Q_Web/), for quantitation analysis of large-scale datasets generated from proteomics experiments using various stable isotope-labeling techniques, e.g. SILAC, ICAT and user-developed labeling methods. It accepts spectral files in the standard mzXML format and search results from SEQUEST, Mascot and ProteinProphet as input. Furthermore, MaXIC-Q Web uses statistical and computational methods to construct two kinds of elution profiles for each ion, namely, PIMS (projected ion mass spectrum) and XIC (extracted ion chromatogram) from MS data. Toward accurate quantitation, a stringent validation procedure is performed on PIMSs to filter out peptide ions interfered with co-eluting peptides or noise. The areas of XICs determine ion abundances, which are used to calculate peptide and protein ratios. Since MaXIC-Q Web adopts stringent validation on spectral data, it achieves high accuracy so that manual validation effort can be substantially reduced. Furthermore, it provides various visualization diagrams and comprehensive quantitation reports so that users can conveniently inspect quantitation results. In summary, MaXIC-Q Web is a user-friendly, interactive, robust, generic web service for quantitation based on ICAT and SILAC labeling techniques.
Despite the growing volumes of proteomic data, integration of the underlying results remains problematic owing to differences in formats, data captured, protein accessions and services available from the individual repositories. To address this, we present the ISPIDER Central Proteomic Database search (http://www.ispider.manchester.ac.uk/cgi-bin/ProteomicSearch.pl), an integration service offering novel search capabilities over leading, mature, proteomic repositories including PRoteomics IDEntifications database (PRIDE), PepSeeker, PeptideAtlas and the Global Proteome Machine. It enables users to search for proteins and peptides that have been characterised in mass spectrometry-based proteomics experiments from different groups, stored in different databases, and view the collated results with specialist viewers/clients. In order to overcome limitations imposed by the great variability in protein accessions used by individual laboratories, the European Bioinformatics Institute's Protein Identifier Cross-Reference (PICR) service is used to resolve accessions from different sequence repositories. Custom-built clients allow users to view peptide/protein identifications in different contexts from multiple experiments and repositories, as well as integration with the Dasty2 client supporting any annotations available from Distributed Annotation System servers. Further information on the protein hits may also be added via external web services able to take a protein as input. This web server offers the first truly integrated access to proteomics repositories and provides a unique service to biologists interested in mass spectrometry-based proteomics.
Summary: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser.
Availability: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from http://www.bioruby.org/.
The Plant Proteomics Database (PPDB; http://ppdb.tc.cornell.edu), launched in 2004, provides an integrated resource for experimentally identified proteins in Arabidopsis and maize (Zea mays). Internal BLAST alignments link maize and Arabidopsis information. Experimental identification is based on in-house mass spectrometry (MS) of cell type-specific proteomes (maize), or specific subcellular proteomes (e.g. chloroplasts, thylakoids, nucleoids) and total leaf proteome samples (maize and Arabidopsis). So far more than 5000 accessions both in maize and Arabidopsis have been identified. In addition, more than 80 published Arabidopsis proteome datasets from subcellular compartments or organs are stored in PPDB and linked to each locus. Using MS-derived information and literature, more than 1500 Arabidopsis proteins have a manually assigned subcellular location, with a strong emphasis on plastid proteins. Additional new features of PPDB include searchable posttranslational modifications and searchable experimental proteotypic peptides and spectral count information for each identified accession based on in-house experiments. Various search methods are provided to extract more than 40 data types for each accession and to extract accessions for different functional categories or curated subcellular localizations. Protein report pages for each accession provide comprehensive overviews, including predicted protein properties, with hyperlinks to the most relevant databases.
Serial section electron microscopy (ssEM) is rapidly expanding as a primary tool to investigate synaptic circuitry and plasticity. The ultrastructural images collected through ssEM are content rich and their comprehensive analysis is beyond the capacity of an individual laboratory. Hence, sharing ultrastructural data is becoming crucial to visualize, analyze, and discover the structural basis of synaptic circuitry and function in the brain. We devised a web-based management system called SynapticDB (http://synapses.clm.utexas.edu/synapticdb/) that catalogues, extracts, analyzes, and shares experimental data from ssEM. The management strategy involves a library with check-in, checkout and experimental tracking mechanisms. We developed a series of spreadsheet templates (MS Excel, Open Office spreadsheet, etc) that guide users in methods of data collection, structural identification, and quantitative analysis through ssEM. SynapticDB provides flexible access to complete templates, or to individual columns with instructional headers that can be selected to create user-defined templates. New templates can also be generated and uploaded. Research progress is tracked via experimental note management and dynamic PDF forms that allow new investigators to follow standard protocols and experienced researchers to expand the range of data collected and shared. The combined use of templates and tracking notes ensures that the supporting experimental information is populated into the database and associated with the appropriate ssEM images and analyses. We anticipate that SynapticDB will serve future meta-analyses towards new discoveries about the composition and circuitry of neurons and glia, and new understanding about structural plasticity during development, behavior, learning, memory, and neuropathology.
Online database systems; Online data file checkout and check-in; Data and image sharing; Synapse structure and function; Data management; Dynamic PDF forms
FUGOID is a web-based, taxonomically broad organelle intron database that collects and integrates various functional and structural data on organellar (mitochondrial and chloroplast) introns. The main information provided by FUGOID includes intron sequence, subclass, resident ORF, self-splicing capability, host gene, protein factor(s) involved in splicing, mobility, insertion site, twintron, seminal references and taxonomic position of host organism. It is implemented in a relational database management system, allowing sophisticated, user-friendly searching, data entry and revision. Users can access the database by any common web browser using a variety of operating systems. The main page of the database is available at http://wnt.cc.utexas.edu/~ifmr530/introndata/main.htm.
Proteogenomic approaches have gained increasing popularity, however it is still difficult to integrate mass spectrometry identifications with genomic data due to differing data formats. To address this difficulty, we introduce iPiG as a tool for the integration of peptide identifications from mass spectrometry experiments into existing genome browser visualizations. Thereby, the concurrent analysis of proteomic and genomic data is simplified and proteomic results can directly be compared to genomic data. iPiG is freely available from https://sourceforge.net/projects/ipig/. It is implemented in Java and can be run as a stand-alone tool with a graphical user-interface or integrated into existing workflows. Supplementary data are available at PLOS ONE online.
Summary: Biogem provides a software development environment for the Ruby programming language, which encourages community-based software development for bioinformatics while lowering the barrier to entry and encouraging best practices.
Biogem, with its targeted modular and decentralized approach, software generator, tools and tight web integration, is an improved general model for scaling up collaborative open source software development in bioinformatics.
Availability: Biogem and modules are free and are OSS. Biogem runs on all systems that support recent versions of Ruby, including Linux, Mac OS X and Windows. Further information at http://www.biogems.info. A tutorial is available at http://www.biogems.info/howto.html
Many proteomics initiatives require a seamless bioinformatics integration of a range of analytical steps between sample collection and systems modeling immediately assessable to the participants involved in the process. Proteomics profiling by 2D gel electrophoresis to the putative identification of differentially expressed proteins by comparison of mass spectrometry results with reference databases, includes many components of sample processing, not just analysis and interpretation, are regularly revisited and updated. In order for such updates and dissemination of data, a suitable data structure is needed. However, there are no such data structures currently available for the storing of data for multiple gels generated through a single proteomic experiments in a single XML file. This paper proposes a data structure based on XML standards to fill the void that exists between data generated by proteomics experiments and storing of data.
In order to address the resulting procedural fluidity we have adopted and implemented a data model centered on the concept of annotated gel (AG) as the format for delivery and management of 2D Gel electrophoresis results. An eXtensible Markup Language (XML) schema is proposed to manage, analyze and disseminate annotated 2D Gel electrophoresis results. The structure of AG objects is formally represented using XML, resulting in the definition of the AGML syntax presented here.
The proposed schema accommodates data on the electrophoresis results as well as the mass-spectrometry analysis of selected gel spots. A web-based software library is being developed to handle data storage, analysis and graphic representation. Computational tools described will be made available at . Our development of AGML provides a simple data structure for storing 2D gel electrophoresis data.
Summary: METAL provides a computationally efficient tool for meta-analysis of genome-wide association scans, which is a commonly used approach for improving power complex traits gene mapping studies. METAL provides a rich scripting interface and implements efficient memory management to allow analyses of very large data sets and to support a variety of input file formats.
Availability and implementation: METAL, including source code, documentation, examples, and executables, is available at http://www.sph.umich.edu/csg/abecasis/metal/
Summary:Wave-spec is a pre-processing package for mass spectrometry (MS) data. The package includes several novel algorithms that overcome conventional difficulties with the pre-processing of such data. In this application note, we demonstrate step-by-step use of this package on a real-world MALDI dataset.
Availability: The package can be downloaded at http://www.vicc.org/biostatistics/supp.php. A shared mailbox (firstname.lastname@example.org) also is available for questions regarding application of the package.
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: The Microbial Proteomic Resource (MPR) is a repository service that contains non-redundant protein databases of related bacterial strains, which were generated through an in-house developed software called Multi-Strain Mass Spectrometry Prokaryotic DataBase Builder (MSMSpdbb). MSMSpdbb merges and clusters protein sequences inferred from genomic sequences, and provide a protein list in FASTA format that covers for divergence in gene annotation, translational start site choice and presence of single nucleotide polymorphisms and other mutations.
Availability: MSMSpdbb was developed in C++ using the Qt libraries (Nokia) and licensed under the GNU General Public License version 2. MSMSpdbb is freely available, and its installation files, instructions for use and additional documentation can be found at the MPR web site http://org.uib.no/prokaryotedb/ can also be found at Proteomecommons.org (see Supplementary Methods for Hash number).
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Peptide identification algorithm is a major bottleneck for mass spectrometry based chemical cross-linking experiments. Our lab recently developed an intensity-incorporated peptide identification algorithm, and here we implemented this scheme for cross-linked peptide discovery. Our program, SQID-XLink, searches all regular, dead-end, intra and inter cross-linked peptides simultaneously, and its effectiveness is validated by testing a published dataset. This new algorithm provides an alternative approach for high confidence cross-linking identification.
Availability: SQID-XLink program is freely available for download from http://quiz2.chem.arizona.edu/wysocki/bioinformatics.htm
Supplementary information: Supplementary data are available at Bioinformatics online.
Mass spectrometry (MS) has evolved to become the primary high throughput tool for proteomics based biomarker discovery. Until now, multiple challenges in protein MS data analysis remain: large-scale and complex data set management; MS peak identification, indexing; and high dimensional peak differential analysis with the concurrent statistical tests based false discovery rate (FDR). “Turnkey” solutions are needed for biomarker investigations to rapidly process MS data sets to identify statistically significant peaks for subsequent validation.
Here we present an efficient and effective solution, which provides experimental biologists easy access to “cloud” computing capabilities to analyze MS data. The web portal can be accessed at http://transmed.stanford.edu/ssa/.
Presented web application supplies large scale MS data online uploading and analysis with a simple user interface. This bioinformatic tool will facilitate the discovery of the potential protein biomarkers using MS.
The Protein and Metabolite Analysis Facility at the University of Texas at Austin is a joint effort of the College of Pharmacy, Center for Research on Environmental Disease (CRED), and the Institute for Cellular and Molecular Biology (ICMB). Services and collaborative research are offered for the detection, characterization and quantification of biomolecules. The Facility's goals are to provide sensitive protein identification and modification analyses, to provide custom peptide syntheses, to offer services for the identification and quantification of metabolites, nutrients and xenobiotics, to implement novel analytical methods, to improve the sensitivity of existing analyses, to provide consultation on the selection and implementation of analytical methods, to offer training in the usage and applications of the instrumentation, and to provide technical expertise in support of individual research goals. The ICMB portion of the Core contains an ABI Procise 492 cLC protein sequencer, a Protein Technologies Inc. Symphony peptide synthesizer, two Bio-rad Duoflows and a GE Heathcare AKTA protein purification systems, two Beckman System Gold HPLC systems, a Berthold Technologies Mithras luminescence and fluorescence detector, an Invitrogen gel electrophoresis set-up, an Art Robbins Instruments Phoenix crystallography robot and a LC-MALDI-TOF/TOF (an ABI 4700 with a LC Packings Ultimate Nano-LC system with a Probot spotting robot). In the College of Pharmacy, the Core has an Applied Biosystems 4000 Q-trap LC MS/MS system with ESI, APCI and nanospray sources coupled with a Shimadzu LC-20AD HPLC system, ThermoFinnigan LCQ ion trap mass spectrometer with ESI, APCI and microspray interfaces combined with a Michrom Magic 2002 HPLC system, a ThermoFinnigan Trace MS GC-quadropole with EI positive, negative CI and selected ion monitoring (SIM), an ABI Voyager-DE Pro MALDI-TOF and a Bio-rad Bioplex 200 fluorescent microbead array system.
The amount of information stemming from proteomics experiments involving (multi dimensional) separation techniques, mass spectrometric analysis, and computational analysis is ever-increasing. Data from such an experimental workflow needs to be captured, related and analyzed. Biological experiments within this scope produce heterogenic data ranging from pictures of one or two-dimensional protein maps and spectra recorded by tandem mass spectrometry to text-based identifications made by algorithms which analyze these spectra. Additionally, peptide and corresponding protein information needs to be displayed.
In order to handle the large amount of data from computational processing of mass spectrometric experiments, automatic import scripts are available and the necessity for manual input to the database has been minimized. Information is in a generic format which abstracts from specific software tools typically used in such an experimental workflow. The software is therefore capable of storing and cross analysing results from many algorithms. A novel feature and a focus of this database is to facilitate protein identification by using peptides identified from mass spectrometry and link this information directly to respective protein maps. Additionally, our application employs spectral counting for quantitative presentation of the data. All information can be linked to hot spots on images to place the results into an experimental context. A summary of identified proteins, containing all relevant information per hot spot, is automatically generated, usually upon either a change in the underlying protein models or due to newly imported identifications. The supporting information for this report can be accessed in multiple ways using the user interface provided by the application.
We present a proteomics database which aims to greatly reduce evaluation time of results from mass spectrometric experiments and enhance result quality by allowing consistent data handling. Import functionality, automatic protein detection, and summary creation act together to facilitate data analysis. In addition, supporting information for these findings is readily accessible via the graphical user interface provided. The database schema and the implementation, which can easily be installed on virtually any server, can be downloaded in the form of a compressed file from our project webpage.
PRIDE, the ‘PRoteomics IDEntifications database’ () is a database of protein and peptide identifications that have been described in the scientific literature. These identifications will typically be from specific species, tissues and sub-cellular locations, perhaps under specific disease conditions. Any post-translational modifications that have been identified on individual peptides can be described. These identifications may be annotated with supporting mass spectra. At the time of writing, PRIDE includes the full set of identifications as submitted by individual laboratories participating in the HUPO Plasma Proteome Project and a profile of the human platelet proteome submitted by the University of Ghent in Belgium. By late 2005 PRIDE is expected to contain the identifications and spectra generated by the HUPO Brain Proteome Project. Proteomics laboratories are encouraged to submit their identifications and spectra to PRIDE to support their manuscript submissions to proteomics journals. Data can be submitted in PRIDE XML format if identifications are included or mzData format if the submitter is depositing mass spectra without identifications. PRIDE is a web application, so submission, searching and data retrieval can all be performed using an internet browser. PRIDE can be searched by experiment accession number, protein accession number, literature reference and sample parameters including species, tissue, sub-cellular location and disease state. Data can be retrieved as machine-readable PRIDE or mzData XML (the latter for mass spectra without identifications), or as human-readable HTML.
Summary: pybedtools is a flexible Python software library for manipulating and exploring genomic datasets in many common formats. It provides an intuitive Python interface that extends upon the popular BEDTools genome arithmetic tools. The library is well documented and efficient, and allows researchers to quickly develop simple, yet powerful scripts that enable complex genomic analyses.
Availability: pybedtools is maintained under the GPL license. Stable versions of pybedtools as well as documentation are available on the Python Package Index at http://pypi.python.org/pypi/pybedtools.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
Summary: We report CRdata.org, a cloud-based, free, open-source web server for running analyses and sharing data and R scripts with others. In addition to using the free, public service, CRdata users can launch their own private Amazon Elastic Computing Cloud (EC2) nodes and store private data and scripts on Amazon's Simple Storage Service (S3) with user-controlled access rights. All CRdata services are provided via point-and-click menus.
Availability and Implementation: CRdata is open-source and free under the permissive MIT License (opensource.org/licenses/mit-license.php). The source code is in Ruby (ruby-lang.org/en/) and available at: github.com/seerdata/crdata.
Despite the fact that data deposition is not a generalised fact yet in the field of proteomics, several mass spectrometry (MS) based proteomics repositories are publicly available for the scientific community. The main existing resources are: the Global Proteome Machine Database (GPMDB), PeptideAtlas, the PRoteomics IDEntifications database (PRIDE), Tranche, and NCBI Peptidome. In this review the capabilities of each of these will be described, paying special attention to four key properties: data types stored, applicable data submission strategies, supported formats, and available data mining and visualization tools. Additionally, the data contents from model organisms will be enumerated for each resource. There are other valuable smaller and/or more specialized repositories but they will not be covered in this review. Finally, the concept behind the ProteomeXchange consortium, a collaborative effort among the main resources in the field, will be introduced.
CV, Controlled Vocabulary; HGNC, HUGO Gene Nomenclature Committee; MCP, Molecular and Cellular Proteomics; MRM, Multiple Reaction Monitoring; NIH, National Institutes of Health; OLS, Ontology Lookup Service; PICR, Protein Identifier Cross-Referencing; PSI, Proteomics Standards Initiative; QC, Quality Control; SRM, Selected Reaction Monitoring; SBEAMS, Systems Biology Experiment Analysis Management System; TPP, Trans Proteomics Pipeline.; Proteomics; Databases; Bioinformatics; Data standards; Repositories
The Biomolecular Interaction Network Database (BIND) is a major source of curated biomolecular interactions, which has been unmaintained for the last few years, a trend which will eventually result in the loss of a significant amount of unique biomolecular interaction information, mostly as database identifiers become out of date. To help reverse this trend, we converted BIND to a standard format, Proteomics Standard Initiative-Molecular Interaction 2.5, starting from the last curated data release (from 2005) available in a custom XML format and made the core components (interactions and complexes) plus additional valuable curated information available for download (http://download.baderlab.org/BINDTranslation/). Major work during the conversion process was required to update out of date molecule identifiers resulting in a more comprehensive conversion of BIND, by measures including number of species and interactor types covered, than what is currently accessible elsewhere. This work also highlights issues of data modeling, controlled vocabulary adoption and data cleaning that can serve as a general case study on the future compatibility of interaction databases.
Database URL: http://download.baderlab.org/BINDTranslation/