Supercentenarians (110 years or older) are the world’s oldest people. Seventy four are alive worldwide, with twenty two in the United States. We performed whole-genome sequencing on 17 supercentenarians to explore the genetic basis underlying extreme human longevity. We found no significant evidence of enrichment for a single rare protein-altering variant or for a gene harboring different rare protein altering variants in supercentenarian compared to control genomes. We followed up on the gene most enriched for rare protein-altering variants in our cohort of supercentenarians, TSHZ3, by sequencing it in a second cohort of 99 long-lived individuals but did not find a significant enrichment. The genome of one supercentenarian had a pathogenic mutation in DSC2, known to predispose to arrhythmogenic right ventricular cardiomyopathy, which is recommended to be reported to this individual as an incidental finding according to a recent position statement by the American College of Medical Genetics and Genomics. Even with this pathogenic mutation, the proband lived to over 110 years. The entire list of rare protein-altering variants and DNA sequence of all 17 supercentenarian genomes is available as a resource to assist the discovery of the genetic basis of extreme longevity in future studies.
Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.
High-throughput sequencing of related individuals has become an important tool for studying human disease. However, owing to technical complexity and lack of available tools, most pedigree-based sequencing studies rely on an ad hoc combination of suboptimal analyses. Here we present pedigree-VAAST (pVAAST), a disease-gene identification tool designed for high-throughput sequence data in pedigrees. pVAAST uses a sequence-based model to perform variant and gene-based linkage analysis. Linkage information is then combined with functional prediction and rare variant case-control association information in a unified statistical framework. pVAAST outperformed linkage and rare-variant association tests in simulations and identified disease-causing genes from whole-genome sequence data in three human pedigrees with dominant, recessive and de novo inheritance patterns. The approach is robust to incomplete penetrance and locus heterogeneity and is applicable to a wide variety of genetic traits. pVAAST maintains high power across studies of monogenic, high-penetrance phenotypes in a single pedigree to highly polygenic, common phenotypes involving hundreds of pedigrees.
Computing the genetic relationship between two humans is important to studies in genetics, genomics, genealogy, and forensics. Relationship algorithms may be sensitive to noise, such as that arising from sequencing errors or imperfect reference genomes. We developed an algorithm for estimation of genetic relationship by averaged blocks (GRAB) that is designed for whole-genome sequencing (WGS) data. GRAB segments the genome into blocks, calculates the fraction of blocks sharing identity, and then uses a classification tree to infer 1st- to 5th- degree relationships and unrelated individuals. We evaluated GRAB on simulated and real sequenced families, and compared it with other software. GRAB achieves similar performance, and does not require knowledge of population background or phasing. GRAB can be used in workflows for identifying unreported relationships, validating reported relationships in family-based studies, and detection of sample-tracking errors or duplicate inclusion. The software is available at familygenomics.systemsbiology.net/grab.
The determination of the relationship between a pair of individuals is a fundamental application of genetics. Previously, we and others have demonstrated that identity-by-descent (IBD) information generated from high-density single-nucleotide polymorphism (SNP) data can greatly improve the power and accuracy of genetic relationship detection. Whole-genome sequencing (WGS) marks the final step in increasing genetic marker density by assaying all single-nucleotide variants (SNVs), and thus has the potential to further improve relationship detection by enabling more accurate detection of IBD segments and more precise resolution of IBD segment boundaries. However, WGS introduces new complexities that must be addressed in order to achieve these improvements in relationship detection. To evaluate these complexities, we estimated genetic relationships from WGS data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. We identified several genomic regions with excess pairwise IBD in both the pedigree and control datasets using three established IBD methods: GERMLINE, fastIBD, and ISCA. These spurious IBD segments produced a 10-fold increase in the rate of detected false-positive relationships among controls compared to high-density microarray datasets. To address this issue, we developed a new method to identify and mask genomic regions with excess IBD. This method, implemented in ERSA 2.0, fully resolved the inflated cryptic relationship detection rates while improving relationship estimation accuracy. ERSA 2.0 detected all 1st through 6th degree relationships, and 55% of 9th through 11th degree relationships in the 30 families. We estimate that WGS data provides a 5% to 15% increase in relationship detection power relative to high-density microarray data for distant relationships. Our results identify regions of the genome that are highly problematic for IBD mapping and introduce new software to accurately detect 1st through 9th degree relationships from whole-genome sequence data.
The determination of the relationship between a pair of individuals is a fundamental application of genetics. The most accurate methods for relationship estimation rely on precise, localized estimates of genetic sharing between individuals. Earlier methods have generated these estimates from high-density genetic marker data. We performed relationship estimation using whole-genome sequence data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. Our results demonstrate that complexities specific to whole-genome sequencing result in regions of the genome that are prone to false-positive estimates of genetic sharing. We provide a map of these spurious IBD regions and introduce new methods, implemented in the software package ERSA 2.0, to control for spurious IBD. We show that ERSA 2.0 provides a 5% to 15% increase in relationship detection power for distant relationships with whole-genome sequence data relative to high-density genetic marker data.
The 331-kbp chlorovirus Paramecium bursaria chlorella virus 1 (PBCV-1) genome was resequenced and annotated to correct errors in the original 15-year-old sequence; 40 codons was considered the minimum protein size of an open reading frame. PBCV-1 has 416 predicted protein-encoding sequences and 11 tRNAs. A proteome analysis was also conducted on highly purified PBCV-1 virions using two mass spectrometry-based protocols. The mass spectrometry-derived data were compared to PBCV-1 and its host Chlorella variabilis NC64A predicted proteomes. Combined, these analyses revealed 148 unique virus-encoded proteins associated with the virion (about 35% of the coding capacity of the virus) and 1 host protein. Some of these proteins appear to be structural/architectural, whereas others have enzymatic, chromatin modification, and signal transduction functions. Most (106) of the proteins have no known function or homologs in the existing gene databases except as orthologs with proteins of other chloroviruses, phycodnaviruses, and nuclear-cytoplasmic large DNA viruses. The genes encoding these proteins are dispersed throughout the virus genome, and most are transcribed late or early-late in the infection cycle, which is consistent with virion morphogenesis.
Summary: With the rapidly expanding availability of data from personal genomes, exomes and transcriptomes, medical researchers will frequently need to test whether observed genomic variants are novel or known. This task requires downloading and handling large and diverse datasets from a variety of sources, and processing them with bioinformatics tools and pipelines. Alternatively, researchers can upload data to online tools, which may conflict with privacy requirements. We present here Kaviar, a tool that greatly simplifies the assessment of novel variants. Kaviar includes: (i) an integrated and growing database of genomic variation from diverse sources, including over 55 million variants from personal genomes, family genomes, transcriptomes, SNV databases and population surveys; and (ii) software for querying the database efficiently.
Availability: Kaviar is programmed in Perl and offered free of charge as Open Source Software. Kaviar may be used online as a programmatic web service or downloaded for local use from http://db.systemsbiology.net/kaviar. The database is also provided.
Supplementary Information: Supplementary data are available at Bioinformatics online.
We analyzed the whole genome sequences of a family of four, consisting of two siblings and their parents. Family-based sequencing allowed us to delineate recombination sites precisely, identify 70% of the sequencing errors, and identify very rare SNVs. We also directly estimated a human intergeneration mutation rate of ∼1.1×10-8 per position per haploid genome. Both offspring in this family have two recessive disorders--Miller syndrome, for which the gene was concurrently identified, and primary ciliary dyskinesia, for which causative genes have been previously identified. Family-based genome analysis enabled us to narrow the candidate genes for both of these Mendelian disorders to only four. Our results demonstrate the unique value of complete genome sequencing in families.
whole genome sequencing; rare genetic disease; inheritance analysis; recessive models; de novo mutations; recombination hotspot; crossover; haploidentity; haploidentical block; inheritance state; inheritance vector; HMM; haplotype; Miller syndrome; POADS; DHODH; DNAH5; KIAA0556; CES1
TFCat is a catalog of mouse and human transcription factors based on a reliable core collection of annotations obtained by expert review of the scientific literature
Unravelling regulatory programs governed by transcription factors (TFs) is fundamental to understanding biological systems. TFCat is a catalog of mouse and human TFs based on a reliable core collection of annotations obtained by expert review of the scientific literature. The collection, including proven and homology-based candidate TFs, is annotated within a function-based taxonomy and DNA-binding proteins are organized within a classification system. All data and user-feedback mechanisms are available at the TFCat portal .
Mycobacterium tuberculosis infects one third of the human world population and kills someone every 15 seconds. For more than a century, scientists and clinicians have been distinguishing between the human- and animal-adapted members of the M. tuberculosis complex (MTBC). However, all human-adapted strains of MTBC have traditionally been considered to be essentially identical. We surveyed sequence diversity within a global collection of strains belonging to MTBC using seven megabase pairs of DNA sequence data. We show that the members of MTBC affecting humans are more genetically diverse than generally assumed, and that this diversity can be linked to human demographic and migratory events. We further demonstrate that these organisms are under extremely reduced purifying selection and that, as a result of increased genetic drift, much of this genetic diversity is likely to have functional consequences. Our findings suggest that the current increases in human population, urbanization, and global travel, combined with the population genetic characteristics of M. tuberculosis described here, could contribute to the emergence and spread of drug-resistant tuberculosis.
Tuberculosis remains a worldwide public health emergency. The emergence of drug-resistant forms of tuberculosis in many parts of the world is threatening to make this important human disease incurable. Even though many resources are being invested into the development of new tuberculosis control tools, we still do not know the extent of genetic diversity in tuberculosis bacteria, nor do we understand the evolutionary forces that shape this diversity. To address these questions, we studied a large collection of human tuberculosis strains using DNA sequencing. We found that strains originating in different parts of the world are more genetically diverse than previously recognized. Our results also suggest that much of this diversity has functional consequences and could affect the efficacy of new tuberculosis diagnostics, drugs, and vaccines. Furthermore, we found that the global diversity in tuberculosis strains can be linked to the ancient human migrations out of Africa, as well as to more recent movements that followed the increases of human populations in Europe, India, and China during the past few hundred years. Taken together, our findings suggest that the evolutionary characteristics of tuberculosis bacteria could synergize with the effects of increasing globalization and human travel to enhance the global spread of drug-resistant tuberculosis.
DNA sequence analysis of a global collection ofM. tuberculosis strains reveals high functional diversity, severely reduced selective constraint, and global spread through both ancient and recent human migrations.
Macrophages are versatile immune cells that can detect a variety of pathogen-associated molecular patterns through their Toll-like receptors (TLRs). In response to microbial challenge, the TLR-stimulated macrophage undergoes an activation program controlled by a dynamically inducible transcriptional regulatory network. Mapping a complex mammalian transcriptional network poses significant challenges and requires the integration of multiple experimental data types. In this work, we inferred a transcriptional network underlying TLR-stimulated murine macrophage activation. Microarray-based expression profiling and transcription factor binding site motif scanning were used to infer a network of associations between transcription factor genes and clusters of co-expressed target genes. The time-lagged correlation was used to analyze temporal expression data in order to identify potential causal influences in the network. A novel statistical test was developed to assess the significance of the time-lagged correlation. Several associations in the resulting inferred network were validated using targeted ChIP-on-chip experiments. The network incorporates known regulators and gives insight into the transcriptional control of macrophage activation. Our analysis identified a novel regulator (TGIF1) that may have a role in macrophage activation.
Macrophages play a vital role in host defense against infection by recognizing pathogens through pattern recognition receptors, such as the Toll-like receptors (TLRs), and mounting an immune response. Stimulation of TLRs initiates a complex transcriptional program in which induced transcription factor genes dynamically regulate downstream genes. Microarray-based transcriptional profiling has proved useful for mapping such transcriptional programs in simpler model organisms; however, mammalian systems present difficulties such as post-translational regulation of transcription factors, combinatorial gene regulation, and a paucity of available gene-knockout expression data. Additional evidence sources, such as DNA sequence-based identification of transcription factor binding sites, are needed. In this work, we computationally inferred a transcriptional network for TLR-stimulated murine macrophages. Our approach combined sequence scanning with time-course expression data in a probabilistic framework. Expression data were analyzed using the time-lagged correlation. A novel, unbiased method was developed to assess the significance of the time-lagged correlation. The inferred network of associations between transcription factor genes and co-expressed gene clusters was validated with targeted ChIP-on-chip experiments, and yielded insights into the macrophage activation program, including a potential novel regulator. Our general approach could be used to analyze other complex mammalian systems for which time-course expression data are available.
As part of a National Institute of Allergy and Infectious Diseases funded collaborative project, we have performed over 150 microarray experiments measuring the response of C57/BL6 mouse bone marrow macrophages to toll-like receptor stimuli. These microarray expression profiles are available freely from our project web site . Here, we report the development of a database of computationally predicted transcription factor binding sites and related genomic features for a set of over 2000 murine immune genes of interest. Our database, which includes microarray co-expression clusters and a host of web-based query, analysis and visualization facilities, is available freely via the internet. It provides a broad resource to the research community, and a stepping stone towards the delineation of the network of transcriptional regulatory interactions underlying the integrated response of macrophages to pathogens.
We constructed a database indexed on genes and annotations of the immediate surrounding genomic regions. To facilitate both gene-specific and systems biology oriented research, our database provides the means to analyze individual genes or an entire genomic locus. Although our focus to-date has been on mammalian toll-like receptor signaling pathways, our database structure is not limited to this subject, and is intended to be broadly applicable to immunology. By focusing on selected immune-active genes, we were able to perform computationally intensive expression and sequence analyses that would currently be prohibitive if applied to the entire genome. Using six complementary computational algorithms and methodologies, we identified transcription factor binding sites based on the Position Weight Matrices available in TRANSFAC. For one example transcription factor (ATF3) for which experimental data is available, over 50% of our predicted binding sites coincide with genome-wide chromatin immnuopreciptation (ChIP-chip) results. Our database can be interrogated via a web interface. Genomic annotations and binding site predictions can be automatically viewed with a customized version of the Argo genome browser.
We present the Innate Immune Database (IIDB) as a community resource for immunologists interested in gene regulatory systems underlying innate responses to pathogens. The database website can be freely accessed at .
In mammals, Toll-like receptors (TLR) recognize ligands, including pathogen-associated molecular patterns (PAMPs), and respond with ligand-specific induction of genes. In this study, we establish evolutionary conservation in teleost fish of key components of the TLR-signaling pathway that act as switches for differential gene induction, including MYD88, TIRAP, TRIF, TRAF6, IRF3, and IRF7. We further explore this conservation with a molecular phylogenetic analysis of MYD88. To the extent that current genomic analysis can establish, each vertebrate has one ortholog to each of these genes. For molecular tree construction and phylogeny inference, we demonstrate a methodology for including genes with only partial primary sequences without disrupting the topology provided by the high-confidence full-length sequences. Conservation of the TLR-signaling molecules suggests that the basic program of gene regulation by the TLR-signaling pathway is conserved across vertebrates. To test this hypothesis, leukocytes from a model fish, rainbow trout (Oncorhynchus mykiss), were stimulated with known mammalian TLR agonists including: diacylated and triacylated forms of lipoprotein, flagellin, two forms of LPS, synthetic double-stranded RNA, and two imidazoquinoline compounds (loxoribine and R848). Trout leukocytes responded in vitro to a number of these agonists with distinct patterns of cytokine expression that correspond to mammalian responses. Our results support the key prediction from our phylogenetic analyses that strong selective pressure of pathogenic microbes has preserved both TLR recognition and signaling functions during vertebrate evolution.
pro-inflammatory cytokine; interferon; MYD88; TIRAP; TRIF; TRAF6; IRF3; phylogeny; molecular tree; PHYLIP
The interaction between pathogens and their multicellular hosts is initiated by activation of pathogen recognition receptors (PRRs). These receptors, that include most notably members of the toll-like receptor (TLR) family, recognize specific pathogen-associated molecular patterns (PAMPs). TLR4 is a central part of the receptor complex that is involved in the activation of the immune system by lipopolysaccharide (LPS) through the specific recognition of its endotoxic moiety (Lipid A). This is a critical event that is essential for the immune response to Gram-negative bacteria as well as the etiology of endotoxic shock. Interestingly, compared to mammals, fish are resistant to endotoxic shock. This in vivo resistance concurs with in vitro studies demonstrating significantly lowered sensitivity of fish leukocytes to LPS activation. Further, our in vitro analyses demonstrate that in trout mononuclear phagocytes, LPS fails to induce antiviral genes, an event that occurs down-stream of TLR4 and is required for the development of endotoxic shock. Finally, an in silico approach that includes mining of different piscine genomic and EST databases, reveals the presence in fish of all of the major TLR signaling elements except for the molecules specifically involved in TLR4-mediated endotoxin recognition and signaling in mammals. Collectively, our analysis questions the existence of TLR4-mediated cellular responses to LPS in fish. We further speculate that other receptors, in particular beta-2 integrins, may play a primary role in the activation of piscine leukocytes by LPS.
innate immunity; pathogen recognition receptors; pathogen-associated molecular patterns; lipopolysaccharide; toll-like receptors; endotoxicity
The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent “genomic deserts.”
To date, genes have been identified from genomic sequence using two basic concepts: the identification of specific signals delineating the structure of the genes and by similarity to previously known genes. Here the authors describe four novel algorithms based on a third basic concept: the identification and quantification of mutational and selectional effects of transcription. Central to this work is a detailed analysis of interspersed repeats, the “junk DNA” left behind by transposon activity, that is usually discarded when predicting genes even though it amounts to nearly half the human genome. Using the new methodology, the authors identify thousands of potential novel genes, some of which appear not to code for protein products. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many “genomic deserts,” regions currently thought to be devoid of genes.
Affymetrix GeneChip Array and Massively Parallel Signature Sequencing (MPSS) are two high throughput methodologies used to profile transcriptomes. Each method has certain strengths and weaknesses; however, no comparison has been made between the data derived from Affymetrix arrays and MPSS. In this study, two lineage-related prostate cancer cell lines, LNCaP and C4-2, were used for transcriptome analysis with the aim of identifying genes associated with prostate cancer progression.
Affymetrix GeneChip array and MPSS analyses were performed. Data was analyzed with GeneSpring 6.2 and in-house perl scripts. Expression array results were verified with RT-PCR.
Comparison of the data revealed that both technologies detected genes the other did not. In LNCaP, 3,180 genes were only detected by Affymetrix and 1,169 genes were only detected by MPSS. Similarly, in C4-2, 4,121 genes were only detected by Affymetrix and 1,014 genes were only detected by MPSS. Analysis of the combined transcriptomes identified 66 genes unique to LNCaP cells and 33 genes unique to C4-2 cells. Expression analysis of these genes in prostate cancer specimens showed CA1 to be highly expressed in bone metastasis but not expressed in primary tumor and EPHA7 to be expressed in normal prostate and primary tumor but not bone metastasis.
Our data indicates that transcriptome profiling with a single methodology will not fully assess the expression of all genes in a cell line. A combination of transcription profiling technologies such as DNA array and MPSS provides a more robust means to assess the expression profile of an RNA sample. Finally, genes that were differentially expressed in cell lines were also differentially expressed in primary prostate cancer and its metastases.
Large databases of single nucleotide polymorphisms (SNPs) are available for use in genomics studies. Typically, investigators must choose a subset of SNPs from these databases to employ in their studies. The choice of subset is influenced by many factors, including estimated or known reliability of the SNP, biochemical factors, intellectual property, cost, and effectiveness of the subset for mapping genes or identifying disease loci. We present an evolutionary algorithm for multiobjective SNP selection.
We implemented a modified version of the Strength-Pareto Evolutionary Algorithm (SPEA2) in Java. Our implementation, Multiobjective Analyzer for Genetic Marker Acquisition (MAGMA), approximates the set of optimal trade-off solutions for large problems in minutes. This set is very useful for the design of large studies, including those oriented towards disease identification, genetic mapping, population studies, and haplotype-block elucidation.
Evolutionary algorithms are particularly suited for optimization problems that involve multiple objectives and a complex search space on which exact methods such as exhaustive enumeration cannot be applied. They provide flexibility with respect to the problem formulation if a problem description evolves or changes. Results are produced as a trade-off front, allowing the user to make informed decisions when prioritizing factors. MAGMA is open source and available at . Evolutionary algorithms are well suited for many other applications in genomics.
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.
A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization.
The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.