PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1437291)

Clipboard (0)
None

Related Articles

1.  Automated nucleic acid chain tracing in real time 
IUCrJ  2014;1(Pt 6):387-392.
A method is presented for the automatic building of nucleotide chains into electron density which is fast enough to be used in interactive model-building software. Likely nucleotides lying in the vicinity of the current view are located and then grown into connected chains in a fraction of a second. When this development is combined with existing tools, assisted manual model building is as simple as or simpler than for proteins.
The crystallographic structure solution of nucleotides and nucleotide complexes is now commonplace. The resulting electron-density maps are often poorer than for proteins, and as a result interpretation in terms of an atomic model can require significant effort, particularly in the case of large structures. While model building can be performed automatically, as with proteins, the process is time-consuming, taking minutes to days depending on the software and the size of the structure. A method is presented for the automatic building of nucleotide chains into electron density which is fast enough to be used in interactive model-building software, with extended chain fragments built around the current view position in a fraction of a second. The speed of the method arises from the determination of the ‘fingerprint’ of the sugar and phosphate groups in terms of conserved high-density and low-density features, coupled with a highly efficient scoring algorithm. Use cases include the rapid evaluation of an initial electron-density map, addition of nucleotide fragments to prebuilt protein structures, and in favourable cases the completion of the structure while automated model-building software is still running. The method has been incorporated into the Coot software package.
doi:10.1107/S2052252514019290
PMCID: PMC4224457  PMID: 25485119
nucleic acid chain tracing; Coot
2.  A Probabilistic Fragment-Based Protein Structure Prediction Algorithm 
PLoS ONE  2012;7(7):e38799.
Conformational sampling is one of the bottlenecks in fragment-based protein structure prediction approaches. They generally start with a coarse-grained optimization where mainchain atoms and centroids of side chains are considered, followed by a fine-grained optimization with an all-atom representation of proteins. It is during this coarse-grained phase that fragment-based methods sample intensely the conformational space. If the native-like region is sampled more, the accuracy of the final all-atom predictions may be improved accordingly. In this work we present EdaFold, a new method for fragment-based protein structure prediction based on an Estimation of Distribution Algorithm. Fragment-based approaches build protein models by assembling short fragments from known protein structures. Whereas the probability mass functions over the fragment libraries are uniform in the usual case, we propose an algorithm that learns from previously generated decoys and steers the search toward native-like regions. A comparison with Rosetta AbInitio protocol shows that EdaFold is able to generate models with lower energies and to enhance the percentage of near-native coarse-grained decoys on a benchmark of proteins. The best coarse-grained models produced by both methods were refined into all-atom models and used in molecular replacement. All atom decoys produced out of EdaFold’s decoy set reach high enough accuracy to solve the crystallographic phase problem by molecular replacement for some test proteins. EdaFold showed a higher success rate in molecular replacement when compared to Rosetta. Our study suggests that improving low resolution coarse-grained decoys allows computational methods to avoid subsequent sampling issues during all-atom refinement and to produce better all-atom models. EdaFold can be downloaded from http://www.riken.jp/zhangiru/software/.
doi:10.1371/journal.pone.0038799
PMCID: PMC3400640  PMID: 22829868
3.  Reconstruction of Protein Backbones from the BriX Collection of Canonical Protein Fragments 
PLoS Computational Biology  2008;4(5):e1000083.
As modeling of changes in backbone conformation still lacks a computationally efficient solution, we developed a discretisation of the conformational states accessible to the protein backbone similar to the successful rotamer approach in side chains. The BriX fragment database, consisting of fragments from 4 to 14 residues long, was realized through identification of recurrent backbone fragments from a non-redundant set of high-resolution protein structures. BriX contains an alphabet of more than 1,000 frequently observed conformations per peptide length for 6 different variation levels. Analysis of the performance of BriX revealed an average structural coverage of protein structures of more than 99% within a root mean square distance (RMSD) of 1 Angstrom. Globally, we are able to reconstruct protein structures with an average accuracy of 0.48 Angstrom RMSD. As expected, regular structures are well covered, but, interestingly, many loop regions that appear irregular at first glance are also found to form a recurrent structural motif, albeit with lower frequency of occurrence than regular secondary structures. Larger loop regions could be completely reconstructed from smaller recurrent elements, between 4 and 8 residues long. Finally, we observed that a significant amount of short sequences tend to display strong structural ambiguity between alpha helix and extended conformations. When the sequence length increases, this so-called sequence plasticity is no longer observed, illustrating the context dependency of polypeptide structures.
Author Summary
Large-scale DNA sequencing efforts produce large amounts of protein sequence data. However, in order to understand the function of a protein, its tertiary three-dimensional structure is required. Despite worldwide efforts in structural biology, experimental protein structures are determined at a significantly slower pace. As a result, computational methods for protein structure prediction receive significant attention. A large part of the structure prediction problem lies in the enormous size of the problem: proteins seem to occur in an infinite variety of shapes. Here, we propose that this huge complexity may be overcome by identifying recurrent protein fragments, which are frequently reused as building blocks to construct proteins that were hitherto thought to be unrelated. The BriX database is the outcome of identifying about 2,000 canonical shapes among 1,261 protein structures. We show any given protein can be reconstructed from this library of building blocks at a very high resolution, suggesting that the modelling of protein backbones may be greatly aided by our database.
doi:10.1371/journal.pcbi.1000083
PMCID: PMC2367438  PMID: 18483555
4.  SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information 
Nucleic Acids Research  2014;42(Web Server issue):W252-W258.
Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/.
doi:10.1093/nar/gku340
PMCID: PMC4086089  PMID: 24782522
5.  Automated main-chain model building by template matching and iterative fragment extension 
A method for automated macromolecular main-chain model building is described.
An algorithm for the automated macromolecular model building of polypeptide backbones is described. The procedure is hierarchical. In the initial stages, many overlapping polypeptide fragments are built. In subsequent stages, the fragments are extended and then connected. Identification of the locations of helical and β-strand regions is carried out by FFT-based template matching. Fragment libraries of helices and β-strands from refined protein structures are then positioned at the potential locations of helices and strands and the longest segments that fit the electron-density map are chosen. The helices and strands are then extended using fragment libraries consisting of sequences three amino acids long derived from refined protein structures. The resulting segments of polypeptide chain are then connected by choosing those which overlap at two or more Cα positions. The fully automated procedure has been implemented in RESOLVE and is capable of model building at resolutions as low as 3.5 Å. The algorithm is useful for building a preliminary main-chain model that can serve as a basis for refinement and side-chain addition.
doi:10.1107/S0907444902018036
PMCID: PMC2745878  PMID: 12499537
model building; template matching; fragment extension
6.  Kinesin-like protein CENP-E is upregulated in rheumatoid synovial fibroblasts 
Arthritis Research  1999;1(1):71-80.
Our aim was to identify specifically expressed genes using RNA arbitrarily primed (RAP)-polymerase chain reaction (PCR) for differential display in patients with rheumatoid arthritis (RA). In RA, amplification of a distinct PCR product suitable for sequencing could be observed. Sequence analysis identified the PCR product as highly homologous to a 434 base pair segment of the human centromere kinesin-like protein CENP-E. Differential expression of CENP-E was confirmed by quantitative reverse transcription PCR, immunohistochemistry and in situ hybridization. CENP-E expression was independent from prednisolone and could not be completely inhibited by serum starvation. RAP-PCR is a suitable method to identify differentially expressed genes in rheumatoid synovial fibroblasts. Also, because motifs of CENP-E show homologies to jun and fos oncogene products and are involved in virus assembly, CENP-E may be involved in the pathophysiology of RA.
Introduction:
Articular destruction by invading synovial fibroblasts is a typical feature in rheumatoid arthritis (RA). Recent data support the hypothesis that key players in this scenario are transformed-appearing synovial fibroblasts at the site of invasion into articular cartilage and bone. They maintain their aggressive phenotype toward cartilage, even when first cultured and thereafter coimplanted together with normal human cartilage into severe combined immunodeficient mice for an extended period of time. However, little is known about the upregulation of genes that leads to this aggressive fibroblast phenotype. To inhibit this progressive growth without interfering with pathways of physiological matrix remodelling, identification of pathways that operate specifically in RA synovial fibroblasts is required. In order to achieve this goal, identification of genes showing upregulation restricted to RA synovial fibroblasts is essential.
Aims:
To identify specifically expressed genes using RNA arbitrarily primed (RAP)-polymerase chain reaction (PCR) for differential display in patients with RA.
Methods:
RNA was extracted from cultured synovial fibroblasts from 10 patients with RA, four patients with osteoarthritis (OA), and one patient with psoriatic arthritis. RAP-PCR was performed using different arbitrary primers for first-strand and second-strand synthesis. First-strand and second-strand synthesis were performed using arbitrary primers: US6 (5' -GTGGTGACAG-3') for first strand, and Nuclear 1+ (5' -ACGAAGAAGAG-3'), OPN28 (5' -GCACCAGGGG-3'), Kinase A2+ (5' -GGTGCCTTTGG-3')and OPN24 (5' -AGGGGCACCA-3') for second-strand synthesis. PCR reactions were loaded onto 8 mol/l urea/6% polyacrylamide-sequencing gels and electrophoresed.Gel slices carrying the target fragment were then excised with a razor blade, eluated and reamplified. After verifying their correct size and purity on 4% agarose gels, the reamplified products derived from the single-strand confirmation polymorphism gel were cloned, and five clones per transcript were sequenced. Thereafter, a GenBank® analysis was performed. Quantitative reverse transcription PCR of the segments was performed using the PCR MIMIC® technique.In-situ expression of centromere kinesin-like protein-E (CENP-E) messenger (m)RNA in RA synovium was assessed using digoxigenin-labelled riboprobes, and CENP-E protein expression in fibroblasts and synovium was performed by immunogold-silver immunohistochemistry and cytochemistry. Functional analysis of CENP-E was done using different approaches (eg glucocorticoid stimulation, serum starvation and growth rate analysis of synovial fibroblasts that expressed CENP-E).
Results:
In RA, amplification of a distinct PCR product suitable for sequencing could be observed. The indicated complementary DNA fragment of 434 base pairs from RA mRNA corresponded to nucleotides 6615-7048 in the human centromere kinesin-like protein CENP-E mRNA (GenBank® accession No. emb/Z15005).The isolated sequence shared greater than 99% nucleic acid (P = 2.9e-169) identity with the human centromere kinesin-like protein CENP-E. Two base changes at positions 6624 (A to C) and 6739 (A to G) did not result in alteration in the amino acid sequence, and therefore 100% amino acid identity could be confirmed. The amplification of 10 clones of the cloned RAP product revealed the presence of CENP-E mRNA in every fibroblast culture examined, showing from 50% (271.000 ± 54.000 phosphor imager arbitrary units) up to fivefold (961.000 ± 145.000 phosphor imager arbitrary units) upregulation when compared with OA fibroblasts. Neither therapy with disease-modifying antirheumatic drugs such as methotrexate, gold, resochine or cyclosporine A, nor therapy with oral steroids influenced CENP-E expression in the RA fibroblasts. Of the eight RA fibroblast populations from RA patients who were receiving disease-modifying antirheumatic drugs, five showed CENP-E upregulation; and of the eight fibroblast populations from RA patients receiving steroids, four showed CENP-E upregulation.
Numerous synovial cells of the patients with RA showed a positive in situ signal for the isolated CENP-E gene segment, confirming CENP-E mRNA production in rheumatoid synovium, whereas in OA synovial tissue CENP-E mRNA could not be detected. In addition, CENP-E expression was independent from medication. This was further confirmed by analysis of the effect of prednisolone on CENP-E expression, which revealed no alteration in CENP-E mRNA after exposure to different (physiological) concentrations of prednisolone. Serum starvation also could not suppress CENP-E mRNA completely.
Discussion:
Since its introduction in 1992, numerous variants of the differential display method and continuous improvements including RAP-PCR have proved to have both efficiency and reliability in examination of differentially regulated genes. The results of the present study reveal that RAP-PCR is a suitable method to identify differentially expressed genes in rheumatoid synovial fibroblasts.
The mRNA, which has been found to be upregulated in rheumatoid synovial fibroblasts, codes for a kinesin-like motor protein named CENP-E, which was first characterized in 1991. It is a member of a family of centromere-associated proteins, of which six (CENP-A to CENP-F) are currently known. CENP-E itself is a kinetochore motor, which accumulates transiently at kinetochores in the G2 phase of the cell cycle before mitosis takes place, appears to modulate chromosome movement and spindle elongation,and is degraded at the end of mitosis. The presence or upregulation of CENP-E has never been associated with RA.
The three-dimensional structure of CENP-E includes a coiled-coil domain. This has important functions and shows links to known pathways in RA pathophysiology. Coiled-coil domains can also be found in jun and fos oncogene products, which are frequently upregulated in RA synovial fibroblasts. They are also involved in DNA binding and transactivation processes resembling the situation in AP-1 (Jun/Fos)-dependent DNA-binding in rheumatoid synovium. Most interestingly, these coiled-coil motifs are crucial for the assembly of viral proteins, and the upregulation of CENP-E might reflect the influence of infectious agents in RA synovium. We also performed experiments showing that serum starvation decreased, but did not completely inhibit CENP-E mRNA expression. This shows that CENP-E is related to, but does not completely depend on proliferation of these cells. In addition, we determined the growth rate of CENP-E high and low expressors, showing that it was independent from the amount of CENP-E expression. supporting the statement that upregulation of CENP-E reflects an activated RA fibroblast phenotype. In summary, the results of the present study support the hypothesis that CENP-E, presumably independently from medication, may not only be upregulated, but may also be involved in RA pathophysiology.
PMCID: PMC17776  PMID: 11056662
arthritis; centromere; differential display; immunohistochemistry; in situ hybridization; RNA fingerprinting
7.  A workflow learning model to improve geovisual analytics utility 
Introduction
This paper describes the design and implementation of the G-EX Portal Learn Module, a web-based, geocollaborative application for organizing and distributing digital learning artifacts. G-EX falls into the broader context of geovisual analytics, a new research area with the goal of supporting visually-mediated reasoning about large, multivariate, spatiotemporal information. Because this information is unprecedented in amount and complexity, GIScientists are tasked with the development of new tools and techniques to make sense of it. Our research addresses the challenge of implementing these geovisual analytics tools and techniques in a useful manner.
Objectives
The objective of this paper is to develop and implement a method for improving the utility of geovisual analytics software. The success of software is measured by its usability (i.e., how easy the software is to use?) and utility (i.e., how useful the software is). The usability and utility of software can be improved by refining the software, increasing user knowledge about the software, or both. It is difficult to achieve transparent usability (i.e., software that is immediately usable without training) of geovisual analytics software because of the inherent complexity of the included tools and techniques. In these situations, improving user knowledge about the software through the provision of learning artifacts is as important, if not more so, than iterative refinement of the software itself. Therefore, our approach to improving utility is focused on educating the user.
Methodology
The research reported here was completed in two steps. First, we developed a model for learning about geovisual analytics software. Many existing digital learning models assist only with use of the software to complete a specific task and provide limited assistance with its actual application. To move beyond task-oriented learning about software use, we propose a process-oriented approach to learning based on the concept of scientific workflows. Second, we implemented an interface in the G-EX Portal Learn Module to demonstrate the workflow learning model. The workflow interface allows users to drag learning artifacts uploaded to the G-EX Portal onto a central whiteboard and then annotate the workflow using text and drawing tools. Once completed, users can visit the assembled workflow to get an idea of the kind, number, and scale of analysis steps, view individual learning artifacts associated with each node in the workflow, and ask questions about the overall workflow or individual learning artifacts through the associated forums. An example learning workflow in the domain of epidemiology is provided to demonstrate the effectiveness of the approach.
Results/Conclusions
In the context of geovisual analytics, GIScientists are not only responsible for developing software to facilitate visually-mediated reasoning about large and complex spatiotemporal information, but also for ensuring that this software works. The workflow learning model discussed in this paper and demonstrated in the G-EX Portal Learn Module is one approach to improving the utility of geovisual analytics software. While development of the G-EX Portal Learn Module is ongoing, we expect to release the G-EX Portal Learn Module by Summer 2009.
PMCID: PMC3186065  PMID: 21983545
geovisual analytics; workflows; learning; utility; usability; geocollaboration; G-EX Portal; epidemiology
8.  De novo protein sequence analysis of Macaca mulatta 
BMC Genomics  2007;8:270.
Background
Macaca mulatta is one of the most utilized non-human primate species in biomedical research offering unique behavioral, neuroanatomical, and neurobiochemcial similarities to humans. This makes it a unique organism to model various diseases such as psychiatric and neurodegenerative illnesses while also providing insight into the complexities of the primate brain. A major obstacle in utilizing rhesus monkey models for human disease is the paucity of protein annotations for this species (~42,000 protein annotations) compared to 330,210 protein annotations for humans. The lack of available information limits the use of rhesus monkey for proteomic scale studies which rely heavily on database searches for protein identification. While characterization of proteins of interest from Macaca mulatta using the standard database search engines (e.g., MASCOT) can be accomplished, searches must be performed using a 'broad species database' which does not provide optimal confidence in protein annotation. Therefore, it becomes necessary to determine partial or complete amino acid sequences using either manual or automated de novo peptide sequence analysis methods.
Results
The recently popularized MALDI-TOF-TOF mass spectrometer yields a complex MS/MS fragmentation pattern difficult to characterize by manual de novo sequencing method on a proteomics scale. Therefore, PEAKS assisted de novo sequencing was performed on nucleus accumbens cytosolic proteins from Macaca mulatta. The most abundant peptide fragments 'b-ions and y-ions', the less abundant peptide fragments 'a-ions' as well as the immonium ions were utilized to develop confident and complete peptide sequences de novo from MS/MS spectra. The generated sequences were used to perform homology searches to characterize the protein identification.
Conclusion
The current study validates a robust method to confidently characterize the proteins from an incomplete sequence database of Macaca mulatta, using the PEAKS de novo sequencing software, facilitating the use of this animal model in various neuroproteomics studies.
doi:10.1186/1471-2164-8-270
PMCID: PMC1965481  PMID: 17686166
9.  In silico fragmentation for computer assisted identification of metabolite mass spectra 
BMC Bioinformatics  2010;11:148.
Background
Mass spectrometry has become the analytical method of choice in metabolomics research. The identification of unknown compounds is the main bottleneck. In addition to the precursor mass, tandem MS spectra carry informative fragment peaks, but the coverage of spectral libraries of measured reference compounds are far from covering the complete chemical space. Compound libraries such as PubChem or KEGG describe a larger number of compounds, which can be used to compare their in silico fragmentation with spectra of unknown metabolites.
Results
We created the MetFrag suite to obtain a candidate list from compound libraries based on the precursor mass, subsequently ranked by the agreement between measured and in silico fragments. In the evaluation MetFrag was able to rank most of the correct compounds within the top 3 candidates returned by an exact mass query in KEGG. Compared to a previously published study, MetFrag obtained better results than the commercial MassFrontier software. Especially for large compound libraries, the candidates with a good score show a high structural similarity or just different stereochemistry, a subsequent clustering based on chemical distances reduces this redundancy. The in silico fragmentation requires less than a second to process a molecule, and MetFrag performs a search in KEGG or PubChem on average within 30 to 300 seconds, respectively, on an average desktop PC.
Conclusions
We presented a method that is able to identify small molecules from tandem MS measurements, even without spectral reference data or a large set of fragmentation rules. With today's massive general purpose compound libraries we obtain dozens of very similar candidates, which still allows a confident estimate of the correct compound class. Our tool MetFrag improves the identification of unknown substances from tandem MS spectra and delivers better results than comparable commercial software. MetFrag is available through a web application, web services and as java library. The web frontend allows the end-user to analyse single spectra and browse the results, whereas the web service and console application are aimed to perform batch searches and evaluation.
doi:10.1186/1471-2105-11-148
PMCID: PMC2853470  PMID: 20307295
10.  A computational platform to maintain and migrate manual functional annotations for BioCyc databases 
BMC Systems Biology  2014;8(1):115.
Background
BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database.
Results
We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers.
Conclusions
Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.
Electronic supplementary material
The online version of this article (doi:10.1186/s12918-014-0115-1) contains supplementary material, which is available to authorized users.
doi:10.1186/s12918-014-0115-1
PMCID: PMC4203924  PMID: 25304126
Annotation tool; BioCyc; Pathway/Genome database; JavaCycO
11.  ESTuber db: an online database for Tuber borchii EST sequences 
BMC Bioinformatics  2007;8(Suppl 1):S13.
Background
The ESTuber database () includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface.
Results
Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes.
Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure.
Conclusion
The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.
doi:10.1186/1471-2105-8-S1-S13
PMCID: PMC1885842  PMID: 17430557
12.  CycADS: an annotation database system to ease the development and update of BioCyc databases 
In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms.
Database URL: http://www.cycadsys.org
doi:10.1093/database/bar008
PMCID: PMC3072769  PMID: 21474551
13.  Use of RNA structure flexibility data in nanostructure modeling 
Methods (San Diego, Calif.)  2010;54(2):239-250.
In the emerging field of RNA-based nanotechnology there is a need for automation of the structure design process. Our goal is to develop computer methods for aiding in this process. Towards that end, we created the RNAJunction database, which is a repository of RNA junctions, i.e. internal, multi-branch and kissing loops with emanating stem stubs, extracted from the larger RNA structures stored in the PDB database. These junctions can be used as building blocks for nanostructures. Two programs developed in our laboratory, NanoTiler and RNA2D3D, can combine such building blocks with idealized fragments of A-form helices to produce desired 3D nanostructures. Initially, the building blocks are treated as rigid objects and the resulting geometry is tested against the design objectives. Experimental data, however, shows that RNA accommodates its shape to the constraints of larger structural contexts. Therefore we are adding analysis of the flexibility of our building blocks to the full design process. Here we present an example of RNA-based nanostructure design, putting emphasis on the need to characterize the structural flexibility of the building blocks to induce ring closure in the automated exploration. We focus on the use of kissing loops (KL) in nanostructure design, since they have been shown to play an important role in RNA self-assembly. By using an experimentally proven system, the RNA tectosquare, we show that considering the flexibility of the KLs as well as distortions of helical regions may be necessary to achieve a realistic design.
doi:10.1016/j.ymeth.2010.12.010
PMCID: PMC3107926  PMID: 21163354
RNA; Nanostructure; Design; Modeling; Flexibility; Molecular dynamics
14.  Gene models from ESTs (GeneModelEST): an application on the Solanum lycopersicum genome 
BMC Bioinformatics  2007;8(Suppl 1):S9.
Background
The structure annotation of a genome is based either on ab initio methodologies or on similaritiy searches versus molecules that have been already annotated. Ab initio gene predictions in a genome are based on a priori knowledge of species-specific features of genes. The training of ab initio gene finders is based on the definition of a data-set of gene models. To accomplish this task the common approach is to align species-specific full length cDNA and EST sequences along the genomic sequences in order to define exon/intron structure of mRNA coding genes.
Results
GeneModelEST is the software here proposed for defining a data-set of candidate gene models using exclusively evidence derived from cDNA/EST sequences.
GeneModelEST requires the genome coordinates of the spliced-alignments of ESTs and of contigs (tentative consensus sequences) generated by an EST clustering/assembling procedure to be formatted in a General Feature Format (GFF) standard file. Moreover, the alignments of the contigs versus a protein database are required as an NCBI BLAST formatted report file.
The GeneModelEST analysis aims to i) evaluate each exon as defined from contig spliced alignments onto the genome sequence; ii) classify the contigs according to quality levels in order to select candidate gene models; iii) assign to the candidate gene models preliminary functional annotations.
We discuss the application of the proposed methodology to build a data-set of gene models of Solanum lycopersicum, whose genome sequencing is an ongoing effort by the International Tomato Genome Sequencing Consortium.
Conclusion
The contig classification procedure used by GeneModelEST supports the detection of candidate gene models, the identification of potential alternative transcripts and it is useful to filter out ambiguous information. An automated procedure, such as the one proposed here, is fundamental to support large scale analysis in order to provide species-specific gene models, that could be useful as a training data-set for ab initio gene finders and/or as a reference gene list for a human curated annotation.
doi:10.1186/1471-2105-8-S1-S9
PMCID: PMC1885861  PMID: 17430576
15.  "TOF2H": A precision toolbox for rapid, high density/high coverage hydrogen-deuterium exchange mass spectrometry via an LC-MALDI approach, covering the data pipeline from spectral acquisition to HDX rate analysis 
BMC Bioinformatics  2008;9:387.
Background
Protein-amide proton hydrogen-deuterium exchange (HDX) is used to investigate protein conformation, conformational changes and surface binding sites for other molecules. To our knowledge, software tools to automate data processing and analysis from sample fractionating (LC-MALDI) mass-spectrometry-based HDX workflows are not publicly available.
Results
An integrated data pipeline (Solvent Explorer/TOF2H) has been developed for the processing of LC-MALDI-derived HDX data. Based on an experiment-wide template, and taking an ab initio approach to chromatographic and spectral peak finding, initial data processing is based on accurate mass-matching to fully deisotoped peaklists accommodating, in MS/MS-confirmed peptide library searches, ambiguous mass-hits to non-target proteins. Isotope-shift re-interrogation of library search results allows quick assessment of the extent of deuteration from peaklist data alone. During raw spectrum editing, each spectral segment is validated in real time, consistent with the manageable spectral numbers resulting from LC-MALDI experiments. A semi-automated spectral-segment editor includes a semi-automated or automated assessment of the quality of all spectral segments as they are pooled across an XIC peak for summing, centroid mass determination, building of rates plots on-the-fly, and automated back exchange correction. The resulting deuterium uptake rates plots from various experiments can be averaged, subtracted, re-scaled, error-barred, and/or scatter-plotted from individual spectral segment centroids, compared to solvent exposure and hydrogen bonding predictions and receive a color suggestion for 3D visualization. This software lends itself to a "divorced" HDX approach in which MS/MS-confirmed peptide libraries are built via nano or standard ESI without source modification, and HDX is performed via LC-MALDI using a standard MALDI-TOF. The complete TOF2H package includes additional (eg LC analysis) modules.
Conclusion
"TOF2H" provides a comprehensive HDX data analysis package that has accelerated the processing of LC-MALDI-based HDX data in the authors' lab from weeks to hours. It runs in a standard MS Windows (XP or Vista) environment, and can be downloaded or obtained from the authors at no cost.
doi:10.1186/1471-2105-9-387
PMCID: PMC2561049  PMID: 18803853
16.  A supersecondary structure library and search algorithm for modeling loops in protein structures 
Nucleic Acids Research  2006;34(7):2085-2097.
We present a fragment-search based method for predicting loop conformations in protein models. A hierarchical and multidimensional database has been set up that currently classifies 105 950 loop fragments and loop flanking secondary structures. Besides the length of the loops and types of bracing secondary structures the database is organized along four internal coordinates, a distance and three types of angles characterizing the geometry of stem regions. Candidate fragments are selected from this library by matching the length, the types of bracing secondary structures of the query and satisfying the geometrical restraints of the stems and subsequently inserted in the query protein framework where their fit is assessed by the root mean square deviation (r.m.s.d.) of stem regions and by the number of rigid body clashes with the environment. In the final step remaining candidate loops are ranked by a Z-score that combines information on sequence similarity and fit of predicted and observed ϕ/ψ main chain dihedral angle propensities. Confidence Z-score cut-offs were determined for each loop length that identify those predicted fragments that outperform a competitive ab initio method. A web server implements the method, regularly updates the fragment library and performs prediction. Predicted segments are returned, or optionally, these can be completed with side chain reconstruction and subsequently annealed in the environment of the query protein by conjugate gradient minimization. The prediction method was tested on artificially prepared search datasets where all trivial sequence similarities on the SCOP superfamily level were removed. Under these conditions it is possible to predict loops of length 4, 8 and 12 with coverage of 98, 78 and 28% with at least of 0.22, 1.38 and 2.47 Å of r.m.s.d. accuracy, respectively. In a head-to-head comparison on loops extracted from freshly deposited new protein folds the current method outperformed in a ∼5:1 ratio an earlier developed database search method.
doi:10.1093/nar/gkl156
PMCID: PMC1440879  PMID: 16617149
17.  SciDBMaker: new software for computer-aided design of specialized biological databases 
BMC Bioinformatics  2008;9:121.
Background
The exponential growth of research in molecular biology has brought concomitant proliferation of databases for stocking its findings. A variety of protein sequence databases exist. While all of these strive for completeness, the range of user interests is often beyond their scope. Large databases covering a broad range of domains tend to offer less detailed information than smaller, more specialized resources, often creating a need to combine data from many sources in order to obtain a complete picture. Scientific researchers are continually developing new specific databases to enhance their understanding of biological processes.
Description
In this article, we present the implementation of a new tool for protein data analysis. With its easy-to-use user interface, this software provides the opportunity to build more specialized protein databases from a universal protein sequence database such as Swiss-Prot. A family of proteins known as bacteriocins is analyzed as 'proof of concept'.
Conclusion
SciDBMaker is stand-alone software that allows the extraction of protein data from the Swiss-Prot database, sequence analysis comprising physicochemical profile calculations, homologous sequences search, multiple sequence alignments and the building of new and more specialized databases. It compiles information with relative ease, updates and compares various data relevant to a given protein family and could solve the problem of dispersed biological search results.
doi:10.1186/1471-2105-9-121
PMCID: PMC2267701  PMID: 18298861
18.  Fitting molecular fragments into electron density 
A number of techniques for the location of small and medium-sized model fragments in experimentally phased electron-density maps are explored. The application of one of these techniques to automated model building is discussed.
Molecular replacement is a powerful tool for the location of large models using structure-factor magnitudes alone. When phase information is available, it becomes possible to locate smaller fragments of the structure ranging in size from a few atoms to a single domain. The calculation is demanding, requiring a six-dimensional rotation and translation search. A number of approaches have been developed to this problem and a selection of these are reviewed in this paper. The application of one of these techniques to the problem of automated model building is explored in more detail, with particular reference to the problem of sequencing a protein main-chain trace.
doi:10.1107/S0907444907033938
PMCID: PMC2394793  PMID: 18094471
model fragments; electron-density maps; model building
19.  Integrated web service for improving alignment quality based on segments comparison 
BMC Bioinformatics  2004;5:98.
Background
Defining blocks forming the global protein structure on the basis of local structural regularity is a very fruitful idea, extensively used in description, and prediction of structure from only sequence information. Over many years the secondary structure elements were used as available building blocks with great success. Specially prepared sets of possible structural motifs can be used to describe similarity between very distant, non-homologous proteins. The reason for utilizing the structural information in the description of proteins is straightforward. Structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate.
Results
Here we provide a new fragment library for Local Structure Segment (LSS) prediction called FRAGlib which is integrated with a previously described segment alignment algorithm SEA. A joined FRAGlib/SEA server provides easy access to both algorithms, allowing a one stop alignment service using a novel approach to protein sequence alignment based on a network matching approach. The FRAGlib used as secondary structure prediction achieves only 73% accuracy in Q3 measure, but when combined with the SEA alignment, it achieves a significant improvement in pairwise sequence alignment quality, as compared to previous SEA implementation and other public alignment algorithms. The FRAGlib algorithm takes ~2 min. to search over FRAGlib database for a typical query protein with 500 residues. The SEA service align two typical proteins within circa ~5 min. All supplementary materials (detailed results of all the benchmarks, the list of test proteins and the whole fragments library) are available for download on-line at .
Conclusions
The joined FRAGlib/SEA server will be a valuable tool both for molecular biologists working on protein sequence analysis and for bioinformaticians developing computational methods of structure prediction and alignment of proteins.
doi:10.1186/1471-2105-5-98
PMCID: PMC497040  PMID: 15271224
Library of protein motifs; Profile-profile sequence similarity (BLAST; FFAS); Fragments library (FRAGlib); Predicted Local Structure Segments (PLSSs); Segment Alignment (SEA); Network matching problem
20.  A software pipeline for processing and identification of fungal ITS sequences 
Background
Fungi from environmental samples are typically identified to species level through DNA sequencing of the nuclear ribosomal internal transcribed spacer (ITS) region for use in BLAST-based similarity searches in the International Nucleotide Sequence Databases. These searches are time-consuming and regularly require a significant amount of manual intervention and complementary analyses. We here present software – in the form of an identification pipeline for large sets of fungal ITS sequences – developed to automate the BLAST process and several additional analysis steps. The performance of the pipeline was evaluated on a dataset of 350 ITS sequences from fungi growing as epiphytes on building material.
Results
The pipeline was written in Perl and uses a local installation of NCBI-BLAST for the similarity searches of the query sequences. The variable subregion ITS2 of the ITS region is extracted from the sequences and used for additional searches of higher sensitivity. Multiple alignments of each query sequence and its closest matches are computed, and query sequences sharing at least 50% of their best matches are clustered to facilitate the evaluation of hypothetically conspecific groups. The pipeline proved to speed up the processing, as well as enhance the resolution, of the evaluation dataset considerably, and the fungi were found to belong chiefly to the Ascomycota, with Penicillium and Aspergillus as the two most common genera. The ITS2 was found to indicate a different taxonomic affiliation than did the complete ITS region for 10% of the query sequences, though this figure is likely to vary with the taxonomic scope of the query sequences.
Conclusion
The present software readily assigns large sets of fungal query sequences to their respective best matches in the international sequence databases and places them in a larger biological context. The output is highly structured to be easy to process, although it still needs to be inspected and possibly corrected for the impact of the incomplete and sometimes erroneously annotated fungal entries in these databases. The open source pipeline is available for UNIX-type platforms, and updated releases of the target database are made available biweekly. The pipeline is easily modified to operate on other molecular regions and organism groups.
doi:10.1186/1751-0473-4-1
PMCID: PMC2649129  PMID: 19146660
21.  FragIdent – Automatic identification and characterisation of cDNA-fragments 
BMC Genomics  2009;10:95.
Background
Many genetic studies and functional assays are based on cDNA fragments. After the generation of cDNA fragments from an mRNA sample, their content is at first unknown and must be assigned by sequencing reactions or hybridisation experiments.
Even in characterised libraries, a considerable number of clones are wrongly annotated. Furthermore, mix-ups can happen in the laboratory. It is therefore essential to the relevance of experimental results to confirm or determine the identity of the employed cDNA fragments. However, the manual approach for the characterisation of these fragments using BLAST web interfaces is not suited for larger number of sequences and so far, no user-friendly software is publicly available.
Results
Here we present the development of FragIdent, an application for the automatic identification of open reading frames (ORFs) within cDNA-fragments. The software performs BLAST analyses to identify the genes represented by the sequences and suggests primers to complete the sequencing of the whole insert. Gene-specific information as well as the protein domains encoded by the cDNA fragment are retrieved from Internet-based databases and included in the output. The application features an intuitive graphical interface and is designed for researchers without any bioinformatics skills. It is suited for projects comprising up to several hundred different clones.
Conclusion
We used FragIdent to identify 84 cDNA clones from a yeast two-hybrid experiment. Furthermore, we identified 131 protein domains within our analysed clones. The source code is freely available from our homepage at .
doi:10.1186/1471-2164-10-95
PMCID: PMC2672089  PMID: 19254371
22.  Toward the automated generation of genome-scale metabolic networks in the SEED 
BMC Bioinformatics  2007;8:139.
Background
Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process.
Results
We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis). We have implemented our tools and database within the SEED, an open-source software environment for comparative genome annotation and analysis.
Conclusion
Our method sets the stage for the automated generation of substantially complete metabolic networks for over 400 complete genome sequences currently in the SEED. With each genome that is processed using our tools, the database of common components grows to cover more of the diversity of metabolic pathways. This increases the likelihood that components of reaction networks for subsequently processed genomes can be retrieved from the database, rather than assembled and verified manually.
doi:10.1186/1471-2105-8-139
PMCID: PMC1868769  PMID: 17462086
23.  GPCR-SSFE: A comprehensive database of G-protein-coupled receptor template predictions and homology models 
BMC Bioinformatics  2011;12:185.
Background
G protein-coupled receptors (GPCRs) transduce a wide variety of extracellular signals to within the cell and therefore have a key role in regulating cell activity and physiological function. GPCR malfunction is responsible for a wide range of diseases including cancer, diabetes and hyperthyroidism and a large proportion of drugs on the market target these receptors. The three dimensional structure of GPCRs is important for elucidating the molecular mechanisms underlying these diseases and for performing structure-based drug design. Although structural data are restricted to only a handful of GPCRs, homology models can be used as a proxy for those receptors not having crystal structures. However, many researchers working on GPCRs are not experienced homology modellers and are therefore unable to benefit from the information that can be gleaned from such three-dimensional models. Here, we present a comprehensive database called the GPCR-SSFE, which provides initial homology models of the transmembrane helices for a large variety of family A GPCRs.
Description
Extending on our previous theoretical work, we have developed an automated pipeline for GPCR homology modelling and applied it to a large set of family A GPCR sequences. Our pipeline is a fragment-based approach that exploits available family A crystal structures. The GPCR-SSFE database stores the template predictions, sequence alignments, identified sequence and structure motifs and homology models for 5025 family A GPCRs. Users are able to browse the GPCR dataset according to their pharmacological classification or search for results using a UniProt entry name. It is also possible for a user to submit a GPCR sequence that is not contained in the database for analysis and homology model building. The models can be viewed using a Jmol applet and are also available for download along with the alignments.
Conclusions
The data provided by GPCR-SSFE are useful for investigating general and detailed sequence-structure-function relationships of GPCRs, performing structure-based drug design and for better understanding the molecular mechanisms underlying disease-associated mutations in GPCRs. The effectiveness of our multiple template and fragment approach is demonstrated by the accuracy of our predicted homology models compared to recently published crystal structures.
doi:10.1186/1471-2105-12-185
PMCID: PMC3113946  PMID: 21605354
24.  Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads 
PLoS Computational Biology  2008;4(9):e1000186.
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.
Author Summary
In this paper we demonstrate that a bacterial genome, Pseudomonas aeruginosa, can be decoded using very short DNA sequences, namely, those produced by the newest generation of DNA sequencers such as the Solexa sequencer from Illumina. Our method includes a novel algorithm that uses the protein sequences from other species to assist the assembly of the new genome. This algorithm breaks up the genome into gene-sized chunks that can be put back together relatively easily, even from sequence fragments as short as 30 bases of DNA. We also take advantage of the genomes of related species, using them as reference strains to assist the assembly. By combining these and other techniques, we were able to assemble 94% of the 6.7 million bases of P. aeruginosa into just 76 large pieces. The remaining 6% is contained in 436 smaller fragments. We have made all of our software available for free under open-source licenses, and we have deposited the newly assembled genome in the public GenBank database.
doi:10.1371/journal.pcbi.1000186
PMCID: PMC2529408  PMID: 18818729
25.  FragmentStore—a comprehensive database of fragments linking metabolites, toxic molecules and drugs 
Nucleic Acids Research  2010;39(Database issue):D1049-D1054.
Consideration of biomolecules in terms of their molecular building blocks provides valuable new information regarding their synthesis, degradation and similarity. Here, we present the FragmentStore, a resource for the comparison of fragments found in metabolites, drugs or toxic compounds. Starting from 13 000 metabolites, 16 000 drugs and 2200 toxic compounds we generated 35 000 different building blocks (fragments), which are not only relevant to their biosynthesis and degradation but also provide important information regarding side-effects and toxicity. The FragmentStore provides a variety of search options such as 2D structure, molecular weight, rotatable bonds, etc. Various analysis tools have been implemented including the calculation of amino acid preferences of fragments’ binding sites, classification of fragments based on the enzyme classification class of the enzyme(s) they bind to and small molecule library generation via a fragment-assembler tool. Using the FragmentStore, it is now possible to identify the common fragments of different classes of molecules and generate hypotheses about the effects of such intersections. For instance, the co-occurrence of fragments in different drugs may indicate similar targets and possible off-target interactions whereas the co-occurrence of fragments in a drug and a toxic compound/metabolite could be indicative of side-effects. The database is publicly available at: http://bioinformatics.charite.de/fragment_store.
doi:10.1093/nar/gkq969
PMCID: PMC3013803  PMID: 20965964

Results 1-25 (1437291)