PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (2182)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
more »
1.  A variable selection method for genome-wide association studies 
Bioinformatics  2010;27(1):1-8.
Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at http://www.bios.unc.edu/~lin.
Access to WTCCC data: http://www.wtccc.org.uk/
Contact: lin@bios.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics Online.
doi:10.1093/bioinformatics/btq600
PMCID: PMC3025714  PMID: 21036813
2.  A variable selection method for genome-wide association studies 
Biometrics  2011;27(1):1-8.
Motivation
Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results
We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false-positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium (LD) patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
doi:10.1093/bioinformatics/btq600
PMCID: PMC3025714  PMID: 21036813
3.  Efficient whole-genome association mapping using local phylogenies for unphased genotype data 
Bioinformatics  2008;24(19):2215-2221.
Motivation: Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results: In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets.
Availability The software described in this article is available at http://www.daimi.au.dk/~mailund/Blossoc and distributed under the GNU General Public License.
Contact:mailund@birc.au.dk
doi:10.1093/bioinformatics/btn406
PMCID: PMC2553438  PMID: 18667442
4.  iFoldRNA: three-dimensional RNA structure prediction and folding 
Bioinformatics  2008;24(17):1951-1952.
Summary: Three-dimensional RNA structure prediction and folding is of significant interest in the biological research community. Here, we present iFoldRNA, a novel web-based methodology for RNA structure prediction with near atomic resolution accuracy and analysis of RNA folding thermodynamics. iFoldRNA rapidly explores RNA conformations using discrete molecular dynamics simulations of input RNA sequences. Starting from simplified linear-chain conformations, RNA molecules (<50 nt) fold to native-like structures within half an hour of simulation, facilitating rapid RNA structure prediction. All-atom reconstruction of energetically stable conformations generates iFoldRNA predicted RNA structures. The predicted RNA structures are within 2–5 Å root mean squre deviations (RMSDs) from corresponding experimentally derived structures. RNA folding parameters including specific heat, contact maps, simulation trajectories, gyration radii, RMSDs from native state, fraction of native-like contacts are accessible from iFoldRNA. We expect iFoldRNA will serve as a useful resource for RNA structure prediction and folding thermodynamic analyses.
Availability: http://iFoldRNA.dokhlab.org.
Contact: dokh@med.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn328
PMCID: PMC2559968  PMID: 18579566
5.  Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence 
Bioinformatics  2008;24(16):1805-1811.
Motivation: A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of genotype–phenotype correlation with a priori evidence of biological relevance.
Results: We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains including statistical evidence of genotype–phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
Availability: A comprehensive database of biological prioritization scores for all known SNPs is available at http://zork.wustl.edu/gin. This can be used to prioritize nicotine dependence association studies through a straightforward mathematical formula—no special software is necessary.
Contact: ssaccone@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn315
PMCID: PMC2610477  PMID: 18565990
6.  Comprehensive in silico mutagenesis highlights functionally important residues in proteins 
Bioinformatics  2008;24(16):i207-i212.
Motivation: Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Results: Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1 kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
Availability: ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Contact: yb2009@columbia.edu
doi:10.1093/bioinformatics/btn268
PMCID: PMC2597370  PMID: 18689826
7.  Systematic biological prioritization after a genome-wide association study 
Bioinformatics (Oxford, England)  2008;24(16):1805-1811.
Motivation
A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of geno-type-phenotype correlation with a priori evidence of biological relevance.
Results
We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains, including statistical evidence of genotype-phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
doi:10.1093/bioinformatics/btn315
PMCID: PMC2610477  PMID: 18565990
8.  LOT: a tool for linkage analysis of ordinal traits for pedigree data 
Bioinformatics  2008;24(15):1737-1739.
Summary: Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
Availability: The LOT program is available for download at http://c2s2.yale.edu/software/LOT/
Contact: heping.zhang@yale.edu
doi:10.1093/bioinformatics/btn258
PMCID: PMC2566542  PMID: 18535081
9.  Memory-efficient dynamic programming backtrace and pairwise local sequence alignment 
Bioinformatics (Oxford, England)  2008;24(16):1772-1778.
Motivation
A backtrace through a dynamic programming algorithm’s intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Results
Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Availability
Sample C++-code for optimal backtrace is available in the Supplementary Materials.
doi:10.1093/bioinformatics/btn308
PMCID: PMC2668612  PMID: 18558620
10.  Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification 
Bioinformatics (Oxford, England)  2008;24(13):i348-i356.
Motivation
Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results
We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability
Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Contact
noble@gs.washington.edu
doi:10.1093/bioinformatics/btn189
PMCID: PMC2665034  PMID: 18586734
11.  Comprehensive in silico mutagenesis highlights functionally important residues in proteins 
Bioinformatics (Oxford, England)  2008;24(16):i207-i212.
Motivation
Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Results
Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
Availability
ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Contact: yb2009@columbia.edu
doi:10.1093/bioinformatics/btn268
PMCID: PMC2597370  PMID: 18689826
12.  Powerful fusion: PSI-BLAST and consensus sequences 
Bioinformatics (Oxford, England)  2008;24(18):1987-1993.
Motivation
A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
Results
We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a widely popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
Availability
http://www.rostlab.org/services/consensus/
Contact:
dsp23@columbia.edu
doi:10.1093/bioinformatics/btn384
PMCID: PMC2577777  PMID: 18678588
13.  Efficient Whole-Genome Association Mapping using Local Phylogenies for Unphased Genotype Data 
Bioinformatics (Oxford, England)  2008;24(19):2215-2221.
Motivation
Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
Results
In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.
doi:10.1093/bioinformatics/btn406
PMCID: PMC2553438  PMID: 18667442
14.  LOT 
Bioinformatics (Oxford, England)  2008;24(15):1737-1739.
Summary
Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
doi:10.1093/bioinformatics/btn258
PMCID: PMC2566542  PMID: 18535081
15.  iFoldRNA: Three-dimensional RNA Structure Prediction and Folding 
Bioinformatics (Oxford, England)  2008;24(17):1951-1952.
Summary
Three-dimensional RNA structure prediction and folding is of significant interest in the biological research community. Here, we present iFoldRNA, a novel web-based methodology for RNA structure prediction with near atomic resolution accuracy and analysis of RNA folding thermodynamics. iFoldRNA rapidly explores RNA conformations using discrete molecular dynamics simulations of input RNA sequences. Starting from simplified linear-chain conformations, RNA molecules (<50 nucleotides) fold to native-like structures within half an hour of simulation, facilitating rapid RNA structure prediction. All-atom reconstruction of energetically stable conformations generates iFoldRNA predicted RNA structures. The predicted RNA structures are within 2–5 Angstrom root mean square deviations from corresponding experimentally derived structures. RNA folding parameters including specific heat, contact maps, simulation trajectories, gyration radii, root mean square deviations from native state, fraction of native-like contacts are accessible from iFoldRNA. We expect iFoldRNA will serve as a useful resource for RNA structure prediction and folding thermodynamic analyses.
doi:10.1093/bioinformatics/btn328
PMCID: PMC2559968  PMID: 18579566
16.  Powerful fusion: PSI-BLAST and consensus sequences 
Bioinformatics  2008;24(18):1987-1993.
Motivation: A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
Results: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a very popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
Availability: http://www.rostlab.org/services/consensus/
Contact: dariusz@mit.edu
doi:10.1093/bioinformatics/btn384
PMCID: PMC2577777  PMID: 18678588
17.  Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification 
Bioinformatics  2008;24(13):i348-i356.
Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Contact:noble@gs.washington.edu
doi:10.1093/bioinformatics/btn189
PMCID: PMC2665034  PMID: 18586734
18.  Memory-efficient dynamic programming backtrace and pairwise local sequence alignment 
Bioinformatics  2008;24(16):1772-1778.
Motivation: A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Results: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Availability: Sample C++-code for optimal backtrace is available in the Supplementary Materials.
Contact: leen@cs.rpi.edu
Supplementary information: Supplementary data is available at Bioinformatics online.
doi:10.1093/bioinformatics/btn308
PMCID: PMC2668612  PMID: 18558620
19.  GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit 
Bioinformatics  2013;29(7):845-854.
Motivation: Molecular simulation has historically been a low-throughput technique, but faster computers and increasing amounts of genomic and structural data are changing this by enabling large-scale automated simulation of, for instance, many conformers or mutants of biomolecules with or without a range of ligands. At the same time, advances in performance and scaling now make it possible to model complex biomolecular interaction and function in a manner directly testable by experiment. These applications share a need for fast and efficient software that can be deployed on massive scale in clusters, web servers, distributed computing or cloud resources.
Results: Here, we present a range of new simulation algorithms and features developed during the past 4 years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules, such as proteins, nucleic acids and lipids, and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models, as well as new free-energy algorithms, and the software now uses multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.
Availability: GROMACS is an open source and free software available from http://www.gromacs.org.
Contact: erik.lindahl@scilifelab.se
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt055
PMCID: PMC3605599  PMID: 23407358
20.  A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing 
Bioinformatics  2013;29(7):878-885.
Motivation: Next-generation sequencing (NGS) technologies have enabled whole-genome discovery and analysis of genetic variants in many species of interest. Individuals are often sequenced at low coverage for detecting novel variants, phasing haplotypes and inferring population structures. Although several tools have been developed for SNP and genotype calling in NGS data, haplotype phasing is often done separately on the called genotypes.
Results: We propose a dynamic Bayesian Markov model (DBM) for simultaneous genotype calling and haplotype phasing in low-coverage NGS data of unrelated individuals. Our method is fully probabilistic that produces consistent inference of genotypes, haplotypes and recombination probabilities. Using data from the 1000 Genomes Project, we demonstrate that DBM not only yields more accurate results than some popular methods, but also provides novel characterization of haplotype structures at the individual level for visualization, interpretation and comparison in downstream analysis. DBM is a powerful and flexible tool that can be applied to many sequencing studies. Its statistical framework can also be extended to accommodate broader scopes of data.
Availability and implementation: http://stat.psu.edu/∼yuzhang/software/dbm.tar
Contact: yuzhang@stat.psu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt065
PMCID: PMC3656686  PMID: 23407359
22.  BION web server: predicting non-specifically bound surface ions 
Bioinformatics  2013;29(6):805-806.
Motivation: Ions are essential component of the cell and frequently are found bound to various macromolecules, in particular to proteins. A binding of an ion to a protein greatly affects protein’s biophysical characteristics and needs to be taken into account in any modeling approach. However, ion’s bounded positions cannot be easily revealed experimentally, especially if they are loosely bound to macromolecular surface.
Results: Here, we report a web server, the BION web server, which addresses the demand for tools of predicting surface bound ions, for which specific interactions are not crucial; thus, they are difficult to predict. The BION is easy to use web server that requires only coordinate file to be inputted, and the user is provided with various, but easy to navigate, options. The coordinate file with predicted bound ions is displayed on the output and is available for download.
Availability: http://compbio.clemson.edu/bion_server/
Supplementary information: Supplementary data are available at Bioinformatics online.
Contact: ealexov@clemson.edu
doi:10.1093/bioinformatics/btt032
PMCID: PMC3597141  PMID: 23380591
23.  AbCD: arbitrary coverage design for sequencing-based genetic studies 
Bioinformatics  2013;29(6):799-801.
Summary: Recent advances in sequencing technologies have revolutionized genetic studies. Although high-coverage sequencing can uncover most variants present in the sequenced sample, low-coverage sequencing is appealing for its cost effectiveness. Here, we present AbCD (arbitrary coverage design) to aid the design of sequencing-based studies. AbCD is a user-friendly interface providing pre-estimated effective sample sizes, specific to each minor allele frequency category, for designs with arbitrary coverage (0.5–30×) and sample size (20–10 000), and for four major ethnic groups (Europeans, Africans, Asians and African Americans). In addition, we also present two software tools: ShotGun and DesignPlanner, which were used to generate the estimates behind AbCD. ShotGun is a flexible short-read simulator for arbitrary user-specified read length and average depth, allowing cycle-specific sequencing error rates and realistic read depth distributions. DesignPlanner is a full pipeline that uses ShotGun to generate sequence data and performs initial SNP discovery, uses our previously presented linkage disequilibrium-aware method to call genotypes, and, finally, provides minor allele frequency-specific effective sample sizes. ShotGun plus DesignPlanner can accommodate effective sample size estimate for any combination of high-depth and low-depth data (for example, whole-genome low-depth plus exonic high-depth) or combination of sequence and genotype data [for example, whole-exome sequencing plus genotyping from existing Genomewide Association Study (GWAS)].
Availability and implementation: AbCD, including its downloadable terminal interface and web-based interface, and the associated tools ShotGun and DesignPlanner, including documentation, examples and executables, are available at http://www.unc.edu/∼yunmli/AbCD.html.
Contact: yunli@med.unc.edu
doi:10.1093/bioinformatics/btt041
PMCID: PMC3597143  PMID: 23357921
24.  Bayesian analysis of gene essentiality based on sequencing of transposon insertion libraries 
Bioinformatics  2013;29(6):695-703.
Motivation: Next-generation sequencing affords an efficient analysis of transposon insertion libraries, which can be used to identify essential genes in bacteria. To analyse this high-resolution data, we present a formal Bayesian framework for estimating the posterior probability of essentiality for each gene, using the extreme-value distribution to characterize the statistical significance of the longest region lacking insertions within a gene. We describe a sampling procedure based on the Metropolis–Hastings algorithm to calculate posterior probabilities of essentiality while simultaneously integrating over unknown internal parameters.
Results: Using a sequence dataset from a transposon library for Mycobacterium tuberculosis, we show that this Bayesian approach predicts essential genes that correspond well with genes shown to be essential in previous studies. Furthermore, we show that by using the extreme-value distribution to characterize genomic regions lacking transposon insertions, this method is capable of identifying essential domains within genes. This approach can be used for analysing transposon libraries in other organisms and augmenting essentiality predictions with statistical confidence scores.
Availability: A python script implementing the method described is available for download from http://saclab.tamu.edu/essentiality/.
Contact: michael.dejesus@tamu.edu or ioerger@cs.tamu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt043
PMCID: PMC3597147  PMID: 23361328

Results 1-25 (2182)