PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (415)
 

Clipboard (0)
None
Journals
Year of Publication
1.  RS-WebPredictor: a server for predicting CYP-mediated sites of metabolism on drug-like molecules 
Bioinformatics  2012;29(4):497-498.
Summary: Regioselectivity-WebPredictor (RS-WebPredictor) is a server that predicts isozyme-specific cytochrome P450 (CYP)-mediated sites of metabolism (SOMs) on drug-like molecules. Predictions may be made for the promiscuous 2C9, 2D6 and 3A4 CYP isozymes, as well as CYPs 1A2, 2A6, 2B6, 2C8, 2C19 and 2E1. RS-WebPredictor is the first freely accessible server that predicts the regioselectivity of the last six isozymes. Server execution time is fast, taking on average 2s to encode a submitted molecule and 1s to apply a given model, allowing for high-throughput use in lead optimization projects.
Availability: RS-WebPredictor is accessible for free use at http://reccr.chem.rpi.edu/Software/RS-WebPredictor/
Contact: brenec@rpi.edu
doi:10.1093/bioinformatics/bts705
PMCID: PMC3570214  PMID: 23242264
2.  A novel link prediction algorithm for reconstructing protein–protein interaction networks by topological similarity 
Bioinformatics  2012;29(3):355-364.
Motivation: Recent advances in technology have dramatically increased the availability of protein–protein interaction (PPI) data and stimulated the development of many methods for improving the systems level understanding the cell. However, those efforts have been significantly hindered by the high level of noise, sparseness and highly skewed degree distribution of PPI networks. Here, we present a novel algorithm to reduce the noise present in PPI networks. The key idea of our algorithm is that two proteins sharing some higher-order topological similarities, measured by a novel random walk-based procedure, are likely interacting with each other and may belong to the same protein complex.
Results: Applying our algorithm to a yeast PPI network, we found that the edges in the reconstructed network have higher biological relevance than in the original network, assessed by multiple types of information, including gene ontology, gene expression, essentiality, conservation between species and known protein complexes. Comparison with existing methods shows that the network reconstructed by our method has the highest quality. Using two independent graph clustering algorithms, we found that the reconstructed network has resulted in significantly improved prediction accuracy of protein complexes. Furthermore, our method is applicable to PPI networks obtained with different experimental systems, such as affinity purification, yeast two-hybrid (Y2H) and protein-fragment complementation assay (PCA), and evidence shows that the predicted edges are likely bona fide physical interactions. Finally, an application to a human PPI network increased the coverage of the network by at least 100%.
Availability: www.cs.utsa.edu/∼jruan/RWS/.
Contact: Jianhua.Ruan@utsa.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts688
PMCID: PMC3562060  PMID: 23235927
3.  Concerning the accuracy of Fido and parameter choice 
Bioinformatics  2012;29(3):412.
Contact: Oliver.Serang@Childrens.Harvard.edu
doi:10.1093/bioinformatics/bts687
PMCID: PMC3562061  PMID: 23193221
4.  Glycosylation Network Analysis Toolbox: a MATLAB-based environment for systems glycobiology 
Bioinformatics  2012;29(3):404-406.
Summary: Systems glycobiology studies the interaction of various pathways that regulate glycan biosynthesis and function. Software tools for the construction and analysis of such pathways are not yet available. We present GNAT, a platform-independent, user-extensible MATLAB-based toolbox that provides an integrated computational environment to construct, manipulate and simulate glycans and their networks. It enables integration of XML-based glycan structure data into SBML (Systems Biology Markup Language) files that describe glycosylation reaction networks. Curation and manipulation of networks is facilitated using class definitions and glycomics database query tools. High quality visualization of networks and their steady-state and dynamic simulation are also supported.
Availability: The software package including source code, help documentation and demonstrations are available at http://sourceforge.net/projects/gnatmatlab/files/.
Contact: neel@buffalo.edu or gangliu@buffalo.edu
doi:10.1093/bioinformatics/bts703
PMCID: PMC3562062  PMID: 23230149
5.  PAIR: paired allelic log-intensity-ratio-based normalization method for SNP-CGH arrays 
Bioinformatics  2012;29(3):299-307.
Motivation: Normalization is critical in DNA copy number analysis. We propose a new method to correctly identify two-copy probes from the genome to obtain representative references for normalization in single nucleotide polymorphism arrays. The method is based on a two-state Hidden Markov Model. Unlike most currently available methods in the literature, the proposed method does not need to assume that the percentage of two-copy state probes is dominant in the genome, as long as there do exist two-copy probes.
Results: The real data analysis and simulation study show that the proposed algorithm is successful in that (i) it performs as well as the current methods (e.g. CGHnormaliter and popLowess) for samples with dominant two-copy states and outperforms these methods for samples with less dominant two-copy states; (ii) it can identify the copy-neutral loss of heterozygosity; and (iii) it is efficient in terms of the computational time used.
Availability: R scripts are available at http://publichealth.lsuhsc.edu/PAIR.html.
Contact: zfang@lsuhsc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts683
PMCID: PMC3562063  PMID: 23196989
6.  nestly—a framework for running software with nested parameter choices and aggregating results 
Bioinformatics  2012;29(3):387-388.
Summary: The execution of a software application or pipeline using various combinations of parameters and inputs is a common task in bioinformatics. In the absence of a specialized tool to organize, streamline and formalize this process, scientists must write frequently complex scripts to perform these tasks. We present nestly, a Python package to facilitate running tools with nested combinations of parameters and inputs. nestly provides three components. First, a module to build nested directory structures corresponding to choices of parameters. Second, the nestrun script to run a given command using each set of parameter choices. Third, the nestagg script to aggregate results of the individual runs into a CSV file, as well as support for more complex aggregation. We also include a module for easily specifying nested dependencies for the SCons build tool, enabling incremental builds.
Availability: Source, documentation and tutorial examples are available at http://github.com/fhcrc/nestly. nestly can be installed from the Python Package Index via pip; it is open source (MIT license).
Contact: cmccoy@fhcrc.org or matsen@fhcrc.org
doi:10.1093/bioinformatics/bts696
PMCID: PMC3562064  PMID: 23220574
7.  Multistructural hot spot characterization with FTProd 
Bioinformatics  2012;29(3):393-394.
Summary: Computational solvent fragment mapping is typically performed on a single structure of a protein to identify and characterize binding sites. However, the simultaneous analysis of several mutant structures or frames of a molecular dynamics simulation may provide more realistic detail about the behavior of the sites. Here we present a plug-in for Visual Molecular Dynamics that streamlines the comparison of the binding configurations of several FTMAP-generated structures.
Availability: FTProd is a freely available and open-source plug-in that can be downloaded at http://amarolab.ucsd.edu/ftprod
Contact: ramaro@ucsd.edu
Supplementary Information: Supplementary data are available at Bioinformatics online
doi:10.1093/bioinformatics/bts689
PMCID: PMC3562065  PMID: 23202744
8.  Scribl: an HTML5 Canvas-based graphics library for visualizing genomic data over the web 
Bioinformatics  2012;29(3):381-383.
Motivation: High-throughput biological research requires simultaneous visualization as well as analysis of genomic data, e.g. read alignments, variant calls and genomic annotations. Traditionally, such integrative analysis required desktop applications operating on locally stored data. Many current terabyte-size datasets generated by large public consortia projects, however, are already only feasibly stored at specialist genome analysis centers. As even small laboratories can afford very large datasets, local storage and analysis are becoming increasingly limiting, and it is likely that most such datasets will soon be stored remotely, e.g. in the cloud. These developments will require web-based tools that enable users to access, analyze and view vast remotely stored data with a level of sophistication and interactivity that approximates desktop applications. As rapidly dropping cost enables researchers to collect data intended to answer questions in very specialized contexts, developers must also provide software libraries that empower users to implement customized data analyses and data views for their particular application. Such specialized, yet lightweight, applications would empower scientists to better answer specific biological questions than possible with general-purpose genome browsers currently available.
Results: Using recent advances in core web technologies (HTML5), we developed Scribl, a flexible genomic visualization library specifically targeting coordinate-based data such as genomic features, DNA sequence and genetic variants. Scribl simplifies the development of sophisticated web-based graphical tools that approach the dynamism and interactivity of desktop applications.
Availability and implementation: Software is freely available online at http://chmille4.github.com/Scribl/ and is implemented in JavaScript with all modern browsers supported.
Contact: gabor.marth@bc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts677
PMCID: PMC3562066  PMID: 23172864
9.  RetroSeq: transposable element discovery from next-generation sequencing data 
Bioinformatics  2012;29(3):389-390.
Summary: A significant proportion of eukaryote genomes consist of transposable element (TE)-derived sequence. These elements are known to have the capacity to modulate gene function and genome evolution. We have developed RetroSeq for detecting non-reference TE insertions from Illumina paired-end whole-genome sequencing data. We evaluate RetroSeq on a human trio from the 1000 Genomes Project, showing that it produces highly accurate TE calls.
Availabilty: RetroSeq is open-source and available from https://github.com/tk2/RetroSeq.
Contact: tk2@sanger.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts697
PMCID: PMC3562067  PMID: 23233656
10.  NURBS: a database of experimental and predicted nuclear receptor binding sites of mouse 
Bioinformatics  2012;29(2):295-297.
Summary: Nuclear receptors (NRs) are a class of transcription factors playing important roles in various biological processes. An NR often impacts numerous genes and different NRs share overlapped target networks. To fulfil the need for a database incorporating binding sites of different NRs at various conditions for easy comparison and visualization to improve our understanding of NR binding mechanisms, we have developed NURBS, a database for experimental and predicted nuclear receptor binding sites of mouse (NURBS). NURBS currently contains binding sites across the whole-mouse genome of 8 NRs identified in 40 chromatin immunoprecipitation with massively parallel DNA sequencing experiments. All datasets are processed using a widely used procedure and same statistical criteria to ensure the binding sites derived from different datasets are comparable. NURBS also provides predicted binding sites using NR-HMM, a Hidden Markov Model (HMM) model.
Availability: The GBrowse-based user interface of NURBS is freely accessible at http://shark.abl.ku.edu/nurbs/. NR-HMM and all results can be downloaded for free at the website.
Contact: jwfang@ku.edu
doi:10.1093/bioinformatics/bts693
PMCID: PMC3546791  PMID: 23196988
11.  VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue 
Bioinformatics  2012;29(2):266-267.
Summary: We developed a new algorithmic method, VirusSeq, for detecting known viruses and their integration sites in the human genome using next-generation sequencing data. We evaluated VirusSeq on whole-transcriptome sequencing (RNA-Seq) data of 256 human cancer samples from The Cancer Genome Atlas. Using these data, we showed that VirusSeq accurately detects the known viruses and their integration sites with high sensitivity and specificity. VirusSeq can also perform this function using whole-genome sequencing data of human tissue.
Availability: VirusSeq has been implemented in PERL and is available at http://odin.mdacc.tmc.edu/∼xsu1/VirusSeq.html.
Contact: xsu1@mdanderson.org
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts665
PMCID: PMC3546792  PMID: 23162058
12.  Defining and predicting structurally conserved regions in protein superfamilies 
Bioinformatics  2012;29(2):175-181.
Motivation: The structures of homologous proteins are generally better conserved than their sequences. This phenomenon is demonstrated by the prevalence of structurally conserved regions (SCRs) even in highly divergent protein families. Defining SCRs requires the comparison of two or more homologous structures and is affected by their availability and divergence, and our ability to deduce structurally equivalent positions among them. In the absence of multiple homologous structures, it is necessary to predict SCRs of a protein using information from only a set of homologous sequences and (if available) a single structure. Accurate SCR predictions can benefit homology modelling and sequence alignment.
Results: Using pairwise DaliLite alignments among a set of homologous structures, we devised a simple measure of structural conservation, termed structural conservation index (SCI). SCI was used to distinguish SCRs from non-SCRs. A database of SCRs was compiled from 386 SCOP superfamilies containing 6489 protein domains. Artificial neural networks were then trained to predict SCRs with various features deduced from a single structure and homologous sequences. Assessment of the predictions via a 5-fold cross-validation method revealed that predictions based on features derived from a single structure perform similarly to ones based on homologous sequences, while combining sequence and structural features was optimal in terms of accuracy (0.755) and Matthews correlation coefficient (0.476). These results suggest that even without information from multiple structures, it is still possible to effectively predict SCRs for a protein. Finally, inspection of the structures with the worst predictions pinpoints difficulties in SCR definitions.
Availability: The SCR database and the prediction server can be found at http://prodata.swmed.edu/SCR.
Contact: 91huangi@gmail.com or grishin@chop.swmed.edu
Supplementary information: Supplementary data are available at Bioinformatics Online
doi:10.1093/bioinformatics/bts682
PMCID: PMC3546793  PMID: 23193223
13.  A system for exact and approximate genetic linkage analysis of SNP data in large pedigrees 
Bioinformatics  2012;29(2):197-205.
Motivation: The use of dense single nucleotide polymorphism (SNP) data in genetic linkage analysis of large pedigrees is impeded by significant technical, methodological and computational challenges. Here we describe Superlink-Online SNP, a new powerful online system that streamlines the linkage analysis of SNP data. It features a fully integrated flexible processing workflow comprising both well-known and novel data analysis tools, including SNP clustering, erroneous data filtering, exact and approximate LOD calculations and maximum-likelihood haplotyping. The system draws its power from thousands of CPUs, performing data analysis tasks orders of magnitude faster than a single computer. By providing an intuitive interface to sophisticated state-of-the-art analysis tools coupled with high computing capacity, Superlink-Online SNP helps geneticists unleash the potential of SNP data for detecting disease genes.
Results: Computations performed by Superlink-Online SNP are automatically parallelized using novel paradigms, and executed on unlimited number of private or public CPUs. One novel service is large-scale approximate Markov Chain–Monte Carlo (MCMC) analysis. The accuracy of the results is reliably estimated by running the same computation on multiple CPUs and evaluating the Gelman–Rubin Score to set aside unreliable results. Another service within the workflow is a novel parallelized exact algorithm for inferring maximum-likelihood haplotyping. The reported system enables genetic analyses that were previously infeasible. We demonstrate the system capabilities through a study of a large complex pedigree affected with metabolic syndrome.
Availability: Superlink-Online SNP is freely available for researchers at http://cbl-hap.cs.technion.ac.il/superlink-snp. The system source code can also be downloaded from the system website.
Contact: omerw@cs.technion.ac.il
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts658
PMCID: PMC3546794  PMID: 23162081
14.  High-throughput microbial population genomics using the Cortex variation assembler 
Bioinformatics  2012;29(2):275-276.
Summary: We have developed a software package, Cortex, designed for the analysis of genetic variation by de novo assembly of multiple samples. This allows direct comparison of samples without using a reference genome as intermediate and incorporates discovery and genotyping of single-nucleotide polymorphisms, indels and larger events in a single framework. We introduce pipelines which simplify the analysis of microbial samples and increase discovery power; these also enable the construction of a graph of known sequence and variation in a species, against which new samples can be compared rapidly. We demonstrate the ease-of-use and power by reproducing the results of studies using both long and short reads.
Availability: http://cortexassembler.sourceforge.net (GPLv3 license).
Contact: zam@well.ox.ac.uk, mcvean@well.ox.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts673
PMCID: PMC3546798  PMID: 23172865
15.  STAR: ultrafast universal RNA-seq aligner 
Bioinformatics  2012;29(1):15-21.
Motivation: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases.
Results: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80–90% success rate, corroborating the high precision of the STAR mapping strategy.
Availability and implementation: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Contact: dobin@cshl.edu.
doi:10.1093/bioinformatics/bts635
PMCID: PMC3530905  PMID: 23104886
16.  Binary Interval Search: a scalable algorithm for counting interval intersections 
Bioinformatics  2012;29(1):1-7.
Motivation: The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery.
Results: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals.
Availability: https://github.com/arq5x/bits.
Contact: arq5x@virginia.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts652
PMCID: PMC3530906  PMID: 23129298
17.  Rare variant discovery and calling by sequencing pooled samples with overlaps 
Bioinformatics  2012;29(1):29-38.
Motivation: For many complex traits/diseases, it is believed that rare variants account for some of the missing heritability that cannot be explained by common variants. Sequencing a large number of samples through DNA pooling is a cost-effective strategy to discover rare variants and to investigate their associations with phenotypes. Overlapping pool designs provide further benefit because such approaches can potentially identify variant carriers, which is important for downstream applications of association analysis of rare variants. However, existing algorithms for analysing sequence data from overlapping pools are limited.
Results: We propose a complete data analysis framework for overlapping pool designs, with novelties in all three major steps: variant pool and variant locus identification, variant allele frequency estimation and variant sample decoding. The framework can be used in combination with any design matrix. We have investigated its performance based on two different overlapping designs and have compared it with three state-of-the-art methods, by simulating targeted sequencing and by pooling real sequence data. Results on both datasets show that our algorithm has made significant improvements over existing ones. In conclusion, successful discovery of rare variants and identification of variant carriers using overlapping pool strategies critically depend on many steps, from generation of design matrixes to decoding algorithms. The proposed framework in combination with the design matrixes generated based on the Chinese remainder theorem achieves best overall results.
Availability: Source code of the program, termed VIP for Variant Identification by Pooling, is available at http://cbc.case.edu/VIP.
Contact: jingli@cwru.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts645
PMCID: PMC3530907  PMID: 23104896
18.  Pathway hunting by random survival forests 
Bioinformatics  2012;29(1):99-105.
Motivation: Pathway or gene set analysis has been widely applied to genomic data. Many current pathway testing methods use univariate test statistics calculated from individual genomic markers, which ignores the correlations and interactions between candidate markers. Random forests-based pathway analysis is a promising approach for incorporating complex correlation and interaction patterns, but one limitation of previous approaches is that pathways have been considered separately, thus pathway cross-talk information was not considered.
Results: In this article, we develop a new pathway hunting algorithm for survival outcomes using random survival forests, which prioritize important pathways by accounting for gene correlation and genomic interactions. We show that the proposed method performs favourably compared with five popular pathway testing methods using both synthetic and real data. We find that the proposed methodology provides an efficient and powerful pathway modelling framework for high-dimensional genomic data.
Availability: The R code for the analysis used in this article is available upon request.
Contact: xi.steven.chen@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts643
PMCID: PMC3530909  PMID: 23129299
19.  A high-performance computing toolset for relatedness and principal component analysis of SNP data 
Bioinformatics  2012;28(24):3326-3328.
Summary: Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ∼8–50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30–300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the ‘Gene-Environment Association Studies’ consortium studies.
Availability and implementation: gdsfmt and SNPRelate are available from R CRAN (http://cran.r-project.org), including a vignette. A tutorial can be found at https://www.genevastudy.org/Accomplishments/software.
Contact: zhengx@u.washington.edu
doi:10.1093/bioinformatics/bts606
PMCID: PMC3519454  PMID: 23060615
20.  GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies 
Bioinformatics  2012;28(24):3329-3331.
Summary: GWASTools is an R/Bioconductor package for quality control and analysis of genome-wide association studies (GWAS). GWASTools brings the interactive capability and extensive statistical libraries of R to GWAS. Data are stored in NetCDF format to accommodate extremely large datasets that cannot fit within R’s memory limits. The documentation includes instructions for converting data from multiple formats, including variants called from sequencing. GWASTools provides a convenient interface for linking genotypes and intensity data with sample and single nucleotide polymorphism annotation.
Availability and implementation: GWASTools is implemented in R and is available from Bioconductor (http://www.bioconductor.org). An extensive vignette detailing a recommended work flow is included.
Contact: sdmorris@uw.edu
doi:10.1093/bioinformatics/bts610
PMCID: PMC3519456  PMID: 23052040
21.  FOLD-EM: automated fold recognition in medium- and low-resolution (4–15 Å) electron density maps 
Bioinformatics  2012;28(24):3265-3273.
Motivation: Owing to the size and complexity of large multi-component biological assemblies, the most tractable approach to determining their atomic structure is often to fit high-resolution radiographic or nuclear magnetic resonance structures of isolated components into lower resolution electron density maps of the larger assembly obtained using cryo-electron microscopy (cryo-EM). This hybrid approach to structure determination requires that an atomic resolution structure of each component, or a suitable homolog, is available. If neither is available, then the amount of structural information regarding that component is limited by the resolution of the cryo-EM map. However, even if a suitable homolog cannot be identified using sequence analysis, a search for structural homologs should still be performed because structural homology often persists throughout evolution even when sequence homology is undetectable, As macromolecules can often be described as a collection of independently folded domains, one way of searching for structural homologs would be to systematically fit representative domain structures from a protein domain database into the medium/low resolution cryo-EM map and return the best fits. Taken together, the best fitting non-overlapping structures would constitute a ‘mosaic’ backbone model of the assembly that could aid map interpretation and illuminate biological function.
Result: Using the computational principles of the Scale-Invariant Feature Transform (SIFT), we have developed FOLD-EM—a computational tool that can identify folded macromolecular domains in medium to low resolution (4–15 Å) electron density maps and return a model of the constituent polypeptides in a fully automated fashion. As a by-product, FOLD-EM can also do flexible multi-domain fitting that may provide insight into conformational changes that occur in macromolecular assemblies.
Availability and implementation: FOLD-EM is available at: http://cs.stanford.edu/~mitul/foldEM/, as a free open source software to the structural biology scientific community.
Contact: mitul@cs.stanford.edu or mcmorais@utmb.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts616
PMCID: PMC3519459  PMID: 23131460
22.  Two distinct SSB protein families in nucleo-cytoplasmic large DNA viruses 
Bioinformatics  2012;28(24):3186-3190.
Motivation: Eukaryote-infecting nucleo-cytoplasmic large DNA viruses (NCLDVs) feature some of the largest genomes in the viral world. These viruses typically do not strongly depend on the host DNA replication systems. In line with this observation, a number of essential DNA replication proteins, such as DNA polymerases, primases, helicases and ligases, have been identified in the NCLDVs. One other ubiquitous component of DNA replisomes is the single-stranded DNA-binding (SSB) protein. Intriguingly, no NCLDV homologs of canonical OB-fold-containing SSB proteins had previously been detected. Only in poxviruses, one of seven NCLDV families, I3 was identified as the SSB protein. However, whether I3 is related to any known protein structure has not yet been established.
Results: Here, we addressed the case of ‘missing’ canonical SSB proteins in the NCLDVs and also probed evolutionary origins of the I3 family. Using advanced computational methods, in four NCLDV families, we detected homologs of the bacteriophage T7 SSB protein (gp2.5). We found the properties of these homologs to be consistent with the SSB function. Moreover, we implicated specific residues in single-stranded DNA binding. At the same time, we found no evolutionary link between the T7 gp2.5-like NCLDV SSB homologs and the poxviral SSB protein (I3). Instead, we identified a distant relationship between I3 and small protein B (SmpB), a bacterial RNA-binding protein. Thus, apparently, the NCLDVs have the two major distinct sets of SSB proteins having bacteriophage and bacterial origins, respectively.
Contact: venclovas@ibt.lt
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts626
PMCID: PMC3519460  PMID: 23097418
23.  A method for integrative structure determination of protein-protein complexes 
Bioinformatics  2012;28(24):3282-3289.
Motivation: Structural characterization of protein interactions is necessary for understanding and modulating biological processes. On one hand, X-ray crystallography or NMR spectroscopy provide atomic resolution structures but the data collection process is typically long and the success rate is low. On the other hand, computational methods for modeling assembly structures from individual components frequently suffer from high false-positive rate, rarely resulting in a unique solution.
Results: Here, we present a combined approach that computationally integrates data from a variety of fast and accessible experimental techniques for rapid and accurate structure determination of protein–protein complexes. The integrative method uses atomistic models of two interacting proteins and one or more datasets from five accessible experimental techniques: a small-angle X-ray scattering (SAXS) profile, 2D class average images from negative-stain electron microscopy micrographs (EM), a 3D density map from single-particle negative-stain EM, residue type content of the protein–protein interface from NMR spectroscopy and chemical cross-linking detected by mass spectrometry. The method is tested on a docking benchmark consisting of 176 known complex structures and simulated experimental data. The near-native model is the top scoring one for up to 61% of benchmark cases depending on the included experimental datasets; in comparison to 10% for standard computational docking. We also collected SAXS, 2D class average images and 3D density map from negative-stain EM to model the PCSK9 antigen–J16 Fab antibody complex, followed by validation of the model by a subsequently available X-ray crystallographic structure.
Availability: http://salilab.org/idock
Contact: dina@salilab.org or sali@salilab.org
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts628
PMCID: PMC3519461  PMID: 23093611
24.  SCALCE: boosting sequence compression algorithms using locally consistent encoding 
Bioinformatics  2012;28(23):3051-3057.
Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.
Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.
Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at http://scalce.sourceforge.net.
Contact: fhach@cs.sfu.ca or cenk@cs.sfu.ca
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts593
PMCID: PMC3509486  PMID: 23047557
25.  SemMedDB: a PubMed-scale repository of biomedical semantic predications 
Bioinformatics  2012;28(23):3158-3160.
Summary: Effective access to the vast biomedical knowledge present in the scientific literature is challenging. Semantic relations are increasingly used in knowledge management applications supporting biomedical research to help address this challenge. We describe SemMedDB, a repository of semantic predications (subject–predicate–object triples) extracted from the entire set of PubMed citations. We propose the repository as a knowledge resource that can assist in hypothesis generation and literature-based discovery in biomedicine as well as in clinical decision-making support.
Availability and implementation: The SemMedDB repository is available as a MySQL database for non-commercial use at http://skr3.nlm.nih.gov/SemMedDB. An UMLS Metathesaurus license is required.
Contact: kilicogluh@mail.nih.gov
doi:10.1093/bioinformatics/bts591
PMCID: PMC3509487  PMID: 23044550

Results 1-25 (415)