Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes 
BMC Genomics  2015;16(1):143.
Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls.
This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.
In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4352271  PMID: 25765891
INDEL; 1000 Genomes Project; Distribution; Mutagenesis
2.  Bioinformatic Tools for Identifying Disease Gene and SNP Candidates 
As databases of genome data continue to grow, our understanding of the functional elements of the genome grows as well. Many genetic changes in the genome have now been discovered and characterized, including both disease-causing mutations and neutral polymorphisms. In addition to experimental approaches to characterize specific variants, over the past decade, there has been intense bioinformatic research to understand the molecular effects of these genetic changes. In addition to genomic experimental assays, the bioinformatic efforts have focused on two general areas. First, researchers have annotated genetic variation data with molecular features that are likely to affect function. Second, statistical methods have been developed to predict mutations that are likely to have a molecular effect. In this protocol manuscript, methods for understanding the molecular functions of single nucleotide polymorphisms (SNPs) and mutations are reviewed and described. The intent of this chapter is to provide an introduction to the online tools that are both easy to use and useful.
PMCID: PMC3957484  PMID: 20238089
Single nucleotide polymorphism; SNP; Genetic disease; Candidate gene; Genome; Bioinformatics; Machine learning
3.  Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics 
Science (New York, N.Y.)  2013;342(6154):1235587.
Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations (“ultrasensitive”) and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, “motif-breakers”). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
PMCID: PMC3947637  PMID: 24092746
4.  Tor1 regulates protein solubility in Saccharomyces cerevisiae 
Molecular Biology of the Cell  2012;23(24):4679-4688.
The transition of proteins targeted for autophagic degradation from the soluble to the insoluble phase is regulated in an ATG1-independent mechanism by TORC1. This process is likely a critical mechanism for maintaining protein homeostasis when challenged with proteomic stress.
Accumulation of insoluble protein in cells is associated with aging and aging-related diseases; however, the roles of insoluble protein in these processes are uncertain. The nature and impact of changes to protein solubility during normal aging are less well understood. Using quantitative mass spectrometry, we identify 480 proteins that become insoluble during postmitotic aging in Saccharomyces cerevisiae and show that this ensemble of insoluble proteins is similar to those that accumulate in aging nematodes. SDS-insoluble protein is present exclusively in a nonquiescent subpopulation of postmitotic cells, indicating an asymmetrical distribution of this protein. In addition, we show that nitrogen starvation of young cells is sufficient to cause accumulation of a similar group of insoluble proteins. Although many of the insoluble proteins identified are known to be autophagic substrates, induction of macroautophagy is not required for insoluble protein formation. However, genetic or chemical inhibition of the Tor1 kinase is sufficient to promote accumulation of insoluble protein. We conclude that target of rapamycin complex 1 regulates accumulation of insoluble proteins via mechanisms acting upstream of macroautophagy. Our data indicate that the accumulation of proteins in an SDS-insoluble state in postmitotic cells represents a novel autophagic cargo preparation process that is regulated by the Tor1 kinase.
PMCID: PMC3521677  PMID: 23097491
5.  STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation 
BMC Bioinformatics  2013;14:53.
Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins.
As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms.
Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at
PMCID: PMC3635999  PMID: 23409969
6.  Atlas2 Cloud: a framework for personal genome analysis in the cloud 
BMC Genomics  2012;13(Suppl 6):S19.
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
PMCID: PMC3481437  PMID: 23134663
7.  An integrative variant analysis suite for whole exome next-generation sequencing data 
BMC Bioinformatics  2012;13:8.
Whole exome capture sequencing allows researchers to cost-effectively sequence the coding regions of the genome. Although the exome capture sequencing methods have become routine and well established, there is currently a lack of tools specialized for variant calling in this type of data.
Using statistical models trained on validated whole-exome capture sequencing data, the Atlas2 Suite is an integrative variant analysis pipeline optimized for variant discovery on all three of the widely used next generation sequencing platforms (SOLiD, Illumina, and Roche 454). The suite employs logistic regression models in conjunction with user-adjustable cutoffs to accurately separate true SNPs and INDELs from sequencing and mapping errors with high sensitivity (96.7%).
We have implemented the Atlas2 Suite and applied it to 92 whole exome samples from the 1000 Genomes Project. The Atlas2 Suite is available for download at In addition to a command line version, the suite has been integrated into the Genboree Workbench, allowing biomedical scientists with minimal informatics expertise to remotely call, view, and further analyze variants through a simple web interface. The existing genomic databases displayed via the Genboree browser also streamline the process from variant discovery to functional genomics analysis, resulting in an off-the-shelf toolkit for the broader community.
PMCID: PMC3292476  PMID: 22239737
8.  In Silico Functional Profiling of Human Disease-Associated and Polymorphic Amino Acid Substitutions 
Human mutation  2010;31(3):335-346.
An important challenge in translational bioinformatics is to understand how genetic variation gives rise to molecular changes at the protein level that can precipitate both monogenic and complex disease. To this end, we compiled datasets of human disease-associated amino acid substitutions (AAS) in the contexts of inherited monogenic disease, complex disease, functional polymorphisms with no known disease association, and somatic mutations in cancer, and compared them with respect to predicted functional sites in proteins. Using the sequence homology-based tool SIFT to estimate the proportion of deleterious AAS in each dataset, only complex disease AAS were found to be indistinguishable from neutral polymorphic AAS. Investigation of monogenic disease AAS predicted to be non-deleterious by SIFT were characterized by a significant enrichment for inherited AAS within solvent accessible residues, regions of intrinsic protein disorder, and an association with the loss or gain of various post-translational modifications. Sites of structural and/or functional interest were therefore surmised to constitute useful additional features with which to identify the molecular disruptions caused by deleterious AAS. A range of bioinformatic tools, designed to predict structural and functional sites in protein sequences, were then employed to demonstrate that intrinsic biases exist in terms of the distribution of different types of human AAS with respect to specific structural, functional and pathological features. Our web tool, designed to potentiate the functional profiling of novel AAS, has been made available at
PMCID: PMC3098813  PMID: 20052762
amino acid substitutions; missense mutations; translational bioinformatics; disease mechanism; association study; SNP
9.  An Ontology-Neutral Framework for Enrichment Analysis 
Advanced statistical methods used to analyze high-throughput data (e.g. gene-expression assays) result in long lists of “significant genes.” One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is relevant for and extensible to data analysis with other high-throughput measurement modalities such as proteomics, metabolomics, and tissue-microarray assays. With the availability of tools for automatic ontology-based annotation of datasets with terms from biomedical ontologies besides the GO, we need not restrict enrichment analysis to the GO. We describe, RANSUM – Rich Annotation Summarizer – which performs enrichment analysis using any ontology in the National Center for Biomedical Ontology’s (NCBO) BioPortal. We outline the methodology of enrichment analysis, the associated challenges, and discuss novel analyses enabled by RANSUM.
PMCID: PMC3041299  PMID: 21347088
10.  Proteomic analysis of age-dependent changes in protein solubility identifies genes that modulate lifespan 
Aging Cell  2012;11(1):120-127.
While it is generally recognized that misfolding of specific proteins can cause late-onset disease, the contribution of protein aggregation to the normal aging process is less well understood. To address this issue, a mass spectrometry-based proteomic analysis was performed to identify proteins that adopt sodium dodecyl sulfate (SDS)-insoluble conformations during aging in Caenorhabditis elegans. SDS-insoluble proteins extracted from young and aged C. elegans were chemically labeled by isobaric tagging for relative and absolute quantification (iTRAQ) and identified by liquid chromatography and mass spectrometry. Two hundred and three proteins were identified as being significantly enriched in an SDS-insoluble fraction in aged nematodes and were largely absent from a similar protein fraction in young nematodes. The SDS-insoluble fraction in aged animals contains a diverse range of proteins including a large number of ribosomal proteins. Gene ontology analysis revealed highly significant enrichments for energy production and translation functions. Expression of genes encoding insoluble proteins observed in aged nematodes was knocked down using RNAi, and effects on lifespan were measured. 41% of genes tested were shown to extend lifespan after RNAi treatment, compared with 18% in a control group of genes. These data indicate that genes encoding proteins that become insoluble with age are enriched for modifiers of lifespan. This demonstrates that proteomic approaches can be used to identify genes that modify lifespan. Finally, these observations indicate that the accumulation of insoluble proteins with diverse functions may be a general feature of aging.
PMCID: PMC3437485  PMID: 22103665
C. elegans; lifespan; aging; protein solubility; protein aggregation

Results 1-10 (10)