PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-16 (16)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
1.  Accurate and Robust Prediction of Genetic Relationship from Whole-Genome Sequences 
PLoS ONE  2014;9(2):e85437.
Computing the genetic relationship between two humans is important to studies in genetics, genomics, genealogy, and forensics. Relationship algorithms may be sensitive to noise, such as that arising from sequencing errors or imperfect reference genomes. We developed an algorithm for estimation of genetic relationship by averaged blocks (GRAB) that is designed for whole-genome sequencing (WGS) data. GRAB segments the genome into blocks, calculates the fraction of blocks sharing identity, and then uses a classification tree to infer 1st- to 5th- degree relationships and unrelated individuals. We evaluated GRAB on simulated and real sequenced families, and compared it with other software. GRAB achieves similar performance, and does not require knowledge of population background or phasing. GRAB can be used in workflows for identifying unreported relationships, validating reported relationships in family-based studies, and detection of sample-tracking errors or duplicate inclusion. The software is available at familygenomics.systemsbiology.net/grab.
doi:10.1371/journal.pone.0085437
PMCID: PMC3938395  PMID: 24586241
2.  Relationship Estimation from Whole-Genome Sequence Data 
PLoS Genetics  2014;10(1):e1004144.
The determination of the relationship between a pair of individuals is a fundamental application of genetics. Previously, we and others have demonstrated that identity-by-descent (IBD) information generated from high-density single-nucleotide polymorphism (SNP) data can greatly improve the power and accuracy of genetic relationship detection. Whole-genome sequencing (WGS) marks the final step in increasing genetic marker density by assaying all single-nucleotide variants (SNVs), and thus has the potential to further improve relationship detection by enabling more accurate detection of IBD segments and more precise resolution of IBD segment boundaries. However, WGS introduces new complexities that must be addressed in order to achieve these improvements in relationship detection. To evaluate these complexities, we estimated genetic relationships from WGS data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. We identified several genomic regions with excess pairwise IBD in both the pedigree and control datasets using three established IBD methods: GERMLINE, fastIBD, and ISCA. These spurious IBD segments produced a 10-fold increase in the rate of detected false-positive relationships among controls compared to high-density microarray datasets. To address this issue, we developed a new method to identify and mask genomic regions with excess IBD. This method, implemented in ERSA 2.0, fully resolved the inflated cryptic relationship detection rates while improving relationship estimation accuracy. ERSA 2.0 detected all 1st through 6th degree relationships, and 55% of 9th through 11th degree relationships in the 30 families. We estimate that WGS data provides a 5% to 15% increase in relationship detection power relative to high-density microarray data for distant relationships. Our results identify regions of the genome that are highly problematic for IBD mapping and introduce new software to accurately detect 1st through 9th degree relationships from whole-genome sequence data.
Author Summary
The determination of the relationship between a pair of individuals is a fundamental application of genetics. The most accurate methods for relationship estimation rely on precise, localized estimates of genetic sharing between individuals. Earlier methods have generated these estimates from high-density genetic marker data. We performed relationship estimation using whole-genome sequence data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. Our results demonstrate that complexities specific to whole-genome sequencing result in regions of the genome that are prone to false-positive estimates of genetic sharing. We provide a map of these spurious IBD regions and introduce new methods, implemented in the software package ERSA 2.0, to control for spurious IBD. We show that ERSA 2.0 provides a 5% to 15% increase in relationship detection power for distant relationships with whole-genome sequence data relative to high-density genetic marker data.
doi:10.1371/journal.pgen.1004144
PMCID: PMC3907355  PMID: 24497848
3.  Paramecium bursaria Chlorella Virus 1 Proteome Reveals Novel Architectural and Regulatory Features of a Giant Virus 
Journal of Virology  2012;86(16):8821-8834.
The 331-kbp chlorovirus Paramecium bursaria chlorella virus 1 (PBCV-1) genome was resequenced and annotated to correct errors in the original 15-year-old sequence; 40 codons was considered the minimum protein size of an open reading frame. PBCV-1 has 416 predicted protein-encoding sequences and 11 tRNAs. A proteome analysis was also conducted on highly purified PBCV-1 virions using two mass spectrometry-based protocols. The mass spectrometry-derived data were compared to PBCV-1 and its host Chlorella variabilis NC64A predicted proteomes. Combined, these analyses revealed 148 unique virus-encoded proteins associated with the virion (about 35% of the coding capacity of the virus) and 1 host protein. Some of these proteins appear to be structural/architectural, whereas others have enzymatic, chromatin modification, and signal transduction functions. Most (106) of the proteins have no known function or homologs in the existing gene databases except as orthologs with proteins of other chloroviruses, phycodnaviruses, and nuclear-cytoplasmic large DNA viruses. The genes encoding these proteins are dispersed throughout the virus genome, and most are transcribed late or early-late in the infection cycle, which is consistent with virion morphogenesis.
doi:10.1128/JVI.00907-12
PMCID: PMC3421733  PMID: 22696644
4.  Kaviar: an accessible system for testing SNV novelty 
Bioinformatics  2011;27(22):3216-3217.
Summary: With the rapidly expanding availability of data from personal genomes, exomes and transcriptomes, medical researchers will frequently need to test whether observed genomic variants are novel or known. This task requires downloading and handling large and diverse datasets from a variety of sources, and processing them with bioinformatics tools and pipelines. Alternatively, researchers can upload data to online tools, which may conflict with privacy requirements. We present here Kaviar, a tool that greatly simplifies the assessment of novel variants. Kaviar includes: (i) an integrated and growing database of genomic variation from diverse sources, including over 55 million variants from personal genomes, family genomes, transcriptomes, SNV databases and population surveys; and (ii) software for querying the database efficiently.
Availability: Kaviar is programmed in Perl and offered free of charge as Open Source Software. Kaviar may be used online as a programmatic web service or downloaded for local use from http://db.systemsbiology.net/kaviar. The database is also provided.
Contact: gustavo@systemsbiology.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr540
PMCID: PMC3208392  PMID: 21965822
5.  Analysis of Genetic Inheritance in a Family Quartet by Whole Genome Sequencing 
Science (New York, N.Y.)  2010;328(5978):636-639.
We analyzed the whole genome sequences of a family of four, consisting of two siblings and their parents. Family-based sequencing allowed us to delineate recombination sites precisely, identify 70% of the sequencing errors, and identify very rare SNVs. We also directly estimated a human intergeneration mutation rate of ∼1.1×10-8 per position per haploid genome. Both offspring in this family have two recessive disorders--Miller syndrome, for which the gene was concurrently identified, and primary ciliary dyskinesia, for which causative genes have been previously identified. Family-based genome analysis enabled us to narrow the candidate genes for both of these Mendelian disorders to only four. Our results demonstrate the unique value of complete genome sequencing in families.
doi:10.1126/science.1186802
PMCID: PMC3037280  PMID: 20220176
whole genome sequencing; rare genetic disease; inheritance analysis; recessive models; de novo mutations; recombination hotspot; crossover; haploidentity; haploidentical block; inheritance state; inheritance vector; HMM; haplotype; Miller syndrome; POADS; DHODH; DNAH5; KIAA0556; CES1
6.  TFCat: the curated catalog of mouse and human transcription factors 
Genome Biology  2009;10(3):R29.
TFCat is a catalog of mouse and human transcription factors based on a reliable core collection of annotations obtained by expert review of the scientific literature
Unravelling regulatory programs governed by transcription factors (TFs) is fundamental to understanding biological systems. TFCat is a catalog of mouse and human TFs based on a reliable core collection of annotations obtained by expert review of the scientific literature. The collection, including proven and homology-based candidate TFs, is annotated within a function-based taxonomy and DNA-binding proteins are organized within a classification system. All data and user-feedback mechanisms are available at the TFCat portal .
doi:10.1186/gb-2009-10-3-r29
PMCID: PMC2691000  PMID: 19284633
7.  High Functional Diversity in Mycobacterium tuberculosis Driven by Genetic Drift and Human Demography 
PLoS Biology  2008;6(12):e311.
Mycobacterium tuberculosis infects one third of the human world population and kills someone every 15 seconds. For more than a century, scientists and clinicians have been distinguishing between the human- and animal-adapted members of the M. tuberculosis complex (MTBC). However, all human-adapted strains of MTBC have traditionally been considered to be essentially identical. We surveyed sequence diversity within a global collection of strains belonging to MTBC using seven megabase pairs of DNA sequence data. We show that the members of MTBC affecting humans are more genetically diverse than generally assumed, and that this diversity can be linked to human demographic and migratory events. We further demonstrate that these organisms are under extremely reduced purifying selection and that, as a result of increased genetic drift, much of this genetic diversity is likely to have functional consequences. Our findings suggest that the current increases in human population, urbanization, and global travel, combined with the population genetic characteristics of M. tuberculosis described here, could contribute to the emergence and spread of drug-resistant tuberculosis.
Author Summary
Tuberculosis remains a worldwide public health emergency. The emergence of drug-resistant forms of tuberculosis in many parts of the world is threatening to make this important human disease incurable. Even though many resources are being invested into the development of new tuberculosis control tools, we still do not know the extent of genetic diversity in tuberculosis bacteria, nor do we understand the evolutionary forces that shape this diversity. To address these questions, we studied a large collection of human tuberculosis strains using DNA sequencing. We found that strains originating in different parts of the world are more genetically diverse than previously recognized. Our results also suggest that much of this diversity has functional consequences and could affect the efficacy of new tuberculosis diagnostics, drugs, and vaccines. Furthermore, we found that the global diversity in tuberculosis strains can be linked to the ancient human migrations out of Africa, as well as to more recent movements that followed the increases of human populations in Europe, India, and China during the past few hundred years. Taken together, our findings suggest that the evolutionary characteristics of tuberculosis bacteria could synergize with the effects of increasing globalization and human travel to enhance the global spread of drug-resistant tuberculosis.
DNA sequence analysis of a global collection ofM. tuberculosis strains reveals high functional diversity, severely reduced selective constraint, and global spread through both ancient and recent human migrations.
doi:10.1371/journal.pbio.0060311
PMCID: PMC2602723  PMID: 19090620
10.  Uncovering a Macrophage Transcriptional Program by Integrating Evidence from Motif Scanning and Expression Dynamics 
PLoS Computational Biology  2008;4(3):e1000021.
Macrophages are versatile immune cells that can detect a variety of pathogen-associated molecular patterns through their Toll-like receptors (TLRs). In response to microbial challenge, the TLR-stimulated macrophage undergoes an activation program controlled by a dynamically inducible transcriptional regulatory network. Mapping a complex mammalian transcriptional network poses significant challenges and requires the integration of multiple experimental data types. In this work, we inferred a transcriptional network underlying TLR-stimulated murine macrophage activation. Microarray-based expression profiling and transcription factor binding site motif scanning were used to infer a network of associations between transcription factor genes and clusters of co-expressed target genes. The time-lagged correlation was used to analyze temporal expression data in order to identify potential causal influences in the network. A novel statistical test was developed to assess the significance of the time-lagged correlation. Several associations in the resulting inferred network were validated using targeted ChIP-on-chip experiments. The network incorporates known regulators and gives insight into the transcriptional control of macrophage activation. Our analysis identified a novel regulator (TGIF1) that may have a role in macrophage activation.
Author Summary
Macrophages play a vital role in host defense against infection by recognizing pathogens through pattern recognition receptors, such as the Toll-like receptors (TLRs), and mounting an immune response. Stimulation of TLRs initiates a complex transcriptional program in which induced transcription factor genes dynamically regulate downstream genes. Microarray-based transcriptional profiling has proved useful for mapping such transcriptional programs in simpler model organisms; however, mammalian systems present difficulties such as post-translational regulation of transcription factors, combinatorial gene regulation, and a paucity of available gene-knockout expression data. Additional evidence sources, such as DNA sequence-based identification of transcription factor binding sites, are needed. In this work, we computationally inferred a transcriptional network for TLR-stimulated murine macrophages. Our approach combined sequence scanning with time-course expression data in a probabilistic framework. Expression data were analyzed using the time-lagged correlation. A novel, unbiased method was developed to assess the significance of the time-lagged correlation. The inferred network of associations between transcription factor genes and co-expressed gene clusters was validated with targeted ChIP-on-chip experiments, and yielded insights into the macrophage activation program, including a potential novel regulator. Our general approach could be used to analyze other complex mammalian systems for which time-course expression data are available.
doi:10.1371/journal.pcbi.1000021
PMCID: PMC2265556  PMID: 18369420
11.  The Innate Immune Database (IIDB) 
BMC Immunology  2008;9:7.
Background
As part of a National Institute of Allergy and Infectious Diseases funded collaborative project, we have performed over 150 microarray experiments measuring the response of C57/BL6 mouse bone marrow macrophages to toll-like receptor stimuli. These microarray expression profiles are available freely from our project web site . Here, we report the development of a database of computationally predicted transcription factor binding sites and related genomic features for a set of over 2000 murine immune genes of interest. Our database, which includes microarray co-expression clusters and a host of web-based query, analysis and visualization facilities, is available freely via the internet. It provides a broad resource to the research community, and a stepping stone towards the delineation of the network of transcriptional regulatory interactions underlying the integrated response of macrophages to pathogens.
Description
We constructed a database indexed on genes and annotations of the immediate surrounding genomic regions. To facilitate both gene-specific and systems biology oriented research, our database provides the means to analyze individual genes or an entire genomic locus. Although our focus to-date has been on mammalian toll-like receptor signaling pathways, our database structure is not limited to this subject, and is intended to be broadly applicable to immunology. By focusing on selected immune-active genes, we were able to perform computationally intensive expression and sequence analyses that would currently be prohibitive if applied to the entire genome. Using six complementary computational algorithms and methodologies, we identified transcription factor binding sites based on the Position Weight Matrices available in TRANSFAC. For one example transcription factor (ATF3) for which experimental data is available, over 50% of our predicted binding sites coincide with genome-wide chromatin immnuopreciptation (ChIP-chip) results. Our database can be interrogated via a web interface. Genomic annotations and binding site predictions can be automatically viewed with a customized version of the Argo genome browser.
Conclusion
We present the Innate Immune Database (IIDB) as a community resource for immunologists interested in gene regulatory systems underlying innate responses to pathogens. The database website can be freely accessed at .
doi:10.1186/1471-2172-9-7
PMCID: PMC2268913  PMID: 18321385
12.  Conservation of Toll-Like Receptor Signaling Pathways in Teleost Fish 
In mammals, Toll-like receptors (TLR) recognize ligands, including pathogen-associated molecular patterns (PAMPs), and respond with ligand-specific induction of genes. In this study, we establish evolutionary conservation in teleost fish of key components of the TLR-signaling pathway that act as switches for differential gene induction, including MYD88, TIRAP, TRIF, TRAF6, IRF3, and IRF7. We further explore this conservation with a molecular phylogenetic analysis of MYD88. To the extent that current genomic analysis can establish, each vertebrate has one ortholog to each of these genes. For molecular tree construction and phylogeny inference, we demonstrate a methodology for including genes with only partial primary sequences without disrupting the topology provided by the high-confidence full-length sequences. Conservation of the TLR-signaling molecules suggests that the basic program of gene regulation by the TLR-signaling pathway is conserved across vertebrates. To test this hypothesis, leukocytes from a model fish, rainbow trout (Oncorhynchus mykiss), were stimulated with known mammalian TLR agonists including: diacylated and triacylated forms of lipoprotein, flagellin, two forms of LPS, synthetic double-stranded RNA, and two imidazoquinoline compounds (loxoribine and R848). Trout leukocytes responded in vitro to a number of these agonists with distinct patterns of cytokine expression that correspond to mammalian responses. Our results support the key prediction from our phylogenetic analyses that strong selective pressure of pathogenic microbes has preserved both TLR recognition and signaling functions during vertebrate evolution.
doi:10.1016/j.cbd.2005.07.003
PMCID: PMC1524722  PMID: 17330145
pro-inflammatory cytokine; interferon; MYD88; TIRAP; TRIF; TRAF6; IRF3; phylogeny; molecular tree; PHYLIP
13.  Endotoxin recognition: In fish or not in fish ? 
FEBS letters  2005;579(29):6519-6528.
The interaction between pathogens and their multicellular hosts is initiated by activation of pathogen recognition receptors (PRRs). These receptors, that include most notably members of the toll-like receptor (TLR) family, recognize specific pathogen-associated molecular patterns (PAMPs). TLR4 is a central part of the receptor complex that is involved in the activation of the immune system by lipopolysaccharide (LPS) through the specific recognition of its endotoxic moiety (Lipid A). This is a critical event that is essential for the immune response to Gram-negative bacteria as well as the etiology of endotoxic shock. Interestingly, compared to mammals, fish are resistant to endotoxic shock. This in vivo resistance concurs with in vitro studies demonstrating significantly lowered sensitivity of fish leukocytes to LPS activation. Further, our in vitro analyses demonstrate that in trout mononuclear phagocytes, LPS fails to induce antiviral genes, an event that occurs down-stream of TLR4 and is required for the development of endotoxic shock. Finally, an in silico approach that includes mining of different piscine genomic and EST databases, reveals the presence in fish of all of the major TLR signaling elements except for the molecules specifically involved in TLR4-mediated endotoxin recognition and signaling in mammals. Collectively, our analysis questions the existence of TLR4-mediated cellular responses to LPS in fish. We further speculate that other receptors, in particular beta-2 integrins, may play a primary role in the activation of piscine leukocytes by LPS.
doi:10.1016/j.febslet.2005.10.061
PMCID: PMC1365396  PMID: 16297386
innate immunity; pathogen recognition receptors; pathogen-associated molecular patterns; lipopolysaccharide; toll-like receptors; endotoxicity
14.  A Third Approach to Gene Prediction Suggests Thousands of Additional Human Transcribed Regions 
PLoS Computational Biology  2006;2(3):e18.
The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent “genomic deserts.”
Synopsis
To date, genes have been identified from genomic sequence using two basic concepts: the identification of specific signals delineating the structure of the genes and by similarity to previously known genes. Here the authors describe four novel algorithms based on a third basic concept: the identification and quantification of mutational and selectional effects of transcription. Central to this work is a detailed analysis of interspersed repeats, the “junk DNA” left behind by transposon activity, that is usually discarded when predicting genes even though it amounts to nearly half the human genome. Using the new methodology, the authors identify thousands of potential novel genes, some of which appear not to code for protein products. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many “genomic deserts,” regions currently thought to be devoid of genes.
doi:10.1371/journal.pcbi.0020018
PMCID: PMC1391917  PMID: 16543943
15.  Application of affymetrix array and massively parallel signature sequencing for identification of genes involved in prostate cancer progression 
BMC Cancer  2005;5:86.
Background
Affymetrix GeneChip Array and Massively Parallel Signature Sequencing (MPSS) are two high throughput methodologies used to profile transcriptomes. Each method has certain strengths and weaknesses; however, no comparison has been made between the data derived from Affymetrix arrays and MPSS. In this study, two lineage-related prostate cancer cell lines, LNCaP and C4-2, were used for transcriptome analysis with the aim of identifying genes associated with prostate cancer progression.
Methods
Affymetrix GeneChip array and MPSS analyses were performed. Data was analyzed with GeneSpring 6.2 and in-house perl scripts. Expression array results were verified with RT-PCR.
Results
Comparison of the data revealed that both technologies detected genes the other did not. In LNCaP, 3,180 genes were only detected by Affymetrix and 1,169 genes were only detected by MPSS. Similarly, in C4-2, 4,121 genes were only detected by Affymetrix and 1,014 genes were only detected by MPSS. Analysis of the combined transcriptomes identified 66 genes unique to LNCaP cells and 33 genes unique to C4-2 cells. Expression analysis of these genes in prostate cancer specimens showed CA1 to be highly expressed in bone metastasis but not expressed in primary tumor and EPHA7 to be expressed in normal prostate and primary tumor but not bone metastasis.
Conclusion
Our data indicates that transcriptome profiling with a single methodology will not fully assess the expression of all genes in a cell line. A combination of transcription profiling technologies such as DNA array and MPSS provides a more robust means to assess the expression profile of an RNA sample. Finally, genes that were differentially expressed in cell lines were also differentially expressed in primary prostate cancer and its metastases.
doi:10.1186/1471-2407-5-86
PMCID: PMC1187880  PMID: 16042785
16.  Evolutionary algorithms for the selection of single nucleotide polymorphisms 
BMC Bioinformatics  2003;4:30.
Background
Large databases of single nucleotide polymorphisms (SNPs) are available for use in genomics studies. Typically, investigators must choose a subset of SNPs from these databases to employ in their studies. The choice of subset is influenced by many factors, including estimated or known reliability of the SNP, biochemical factors, intellectual property, cost, and effectiveness of the subset for mapping genes or identifying disease loci. We present an evolutionary algorithm for multiobjective SNP selection.
Results
We implemented a modified version of the Strength-Pareto Evolutionary Algorithm (SPEA2) in Java. Our implementation, Multiobjective Analyzer for Genetic Marker Acquisition (MAGMA), approximates the set of optimal trade-off solutions for large problems in minutes. This set is very useful for the design of large studies, including those oriented towards disease identification, genetic mapping, population studies, and haplotype-block elucidation.
Conclusion
Evolutionary algorithms are particularly suited for optimization problems that involve multiple objectives and a complex search space on which exact methods such as exhaustive enumeration cannot be applied. They provide flexibility with respect to the problem formulation if a problem description evolves or changes. Results are produced as a trade-off front, allowing the user to make informed decisions when prioritizing factors. MAGMA is open source and available at . Evolutionary algorithms are well suited for many other applications in genomics.
doi:10.1186/1471-2105-4-30
PMCID: PMC183839  PMID: 12875658

Results 1-16 (16)