We describe the first definitive case of a fibrous dysplastic neoplasm in a Neandertal rib (120.71) from the site of Krapina in present-day Croatia. The tumor predates other evidence for these kinds of tumor by well over 100,000 years. Tumors of any sort are a rare occurrence in recent archaeological periods or in living primates, but especially in the human fossil record. Several studies have surveyed bone diseases in past human populations and living primates and fibrous dysplasias occur in a low incidence. Within the class of bone tumors of the rib, fibrous dysplasia is present in living humans at a higher frequency than other bone tumors. The bony features leading to our diagnosis are described in detail. In living humans effects of the neoplasm present a broad spectrum of symptoms, from asymptomatic to debilitating. Given the incomplete nature of this rib and the lack of associated skeletal elements, we resist commenting on the health effects the tumor had on the individual. Yet, the occurrence of this neoplasm shows that at least one Neandertal suffered a common bone tumor found in modern humans.
We present a DNA library preparation method that has allowed us to reconstruct a high coverage (30X) genome sequence of a Denisovan, an extinct relative of Neandertals. The quality of this genome allows a direct estimation of Denisovan heterozygosity indicating that genetic diversity in these archaic hominins was extremely low. It also allows tentative dating of the specimen on the basis of “missing evolution” in its genome, detailed measurements of Denisovan and Neandertal admixture into present-day human populations, and the generation of a near-complete catalog of genetic changes that swept to high frequency in modern humans since their divergence from Denisovans.
Motivation: The conversion of the raw intensities obtained from next-generation sequencing platforms into nucleotide sequences with well-calibrated quality scores is a critical step in the generation of good sequence data. While recent model-based approaches can yield highly accurate calls, they require a substantial amount of processing time and/or computational resources. We previously introduced Ibis, a fast and accurate basecaller for the Illumina platform. We have continued active development of Ibis to take into account developments in the Illumina technology, as well as to make Ibis fully open source.
Results: We introduce here freeIbis, which offers significant improvements in sequence accuracy owing to the use of a novel multiclass support vector machine (SVM) algorithm. Sequence quality scores are now calibrated based on empirically observed scores, thus providing a high correlation to their respective error rates. These improvements result in downstream advantages including improved genotyping accuracy.
Availability and implementation: FreeIbis is freely available for use under the GPL (http://bioinf.eva.mpg.de/freeibis/). It requires a Python interpreter and a C++ compiler. Tailored versions of LIBOCAS and LIBLINEAR are distributed along with the package.
Supplementary data are available at Bioinformatics online.
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours1–4, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
The detection of genetic segments of Identical by Descent (IBD) in Genome-Wide Association Studies has proven successful in pinpointing genetic relatedness between reportedly unrelated individuals and leveraging such regions to shortlist candidate genes. These techniques depend on high-density genotyping arrays and their effectiveness in diverse sequence data is largely unknown. Due to decreasing costs and increasing effectiveness of high throughput techniques for whole-exome sequencing, an influx of exome sequencing data has become available. Studies using exomes and IBD-detection methods within known pedigrees have shown that IBD can be useful in finding hidden genetic candidates where known relatives are available. We set out to examine the viability of using IBD-detection in whole exome sequencing data in population-wide studies. In doing so, we extend GERMLINE, a method to detect IBD from exome sequencing data by finding small slices of matching alleles between pairs of individuals and extending them into full IBD segments. This algorithm allows for efficient population-wide detection in dense data. We apply this algorithm to a cohort of Crohn's Disease cases where whole-exome and GWAS array data is available. We confirm that GWAS-based detected segments are highly accurate and predictive of underlying shared variation. Where segments inferred from GWAS are expected to be of high accuracy, we compare exome-based detection accuracy of multiple detection strategies. We find detection accuracy to be prohibitively low in all assessments, both in terms of segment sensitivity and specificity. Even after isolating relatively long segments beyond 10cM, exome-based detection continued to offer poor specificity/sensitivity tradeoffs. We hypothesize that the variable coverage and platform biases of exome capture account for this decreased accuracy and look toward whole genome sequencing data as a higher quality source for detecting population-wide IBD.
While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
The International Society for Computational Biology, ISCB, organizes the largest event in the field of computational biology and bioinformatics, namely the annual international conference on Intelligent Systems for Molecular Biology, the ISMB. This year at ISMB 2012 in Long Beach, ISCB celebrated the 20th anniversary of its flagship meeting. ISCB is a young, lean and efficient society that aspires to make a significant impact with only limited resources. Many constraints make the choice of venues for ISMB a tough challenge. Here, we describe those challenges and invite the contribution of ideas for solutions.
In addition to genome sequencing, accurate functional annotation of genomes is required in order to carry out comparative and evolutionary analyses between species. Among primates, the human genome is the most extensively annotated. Human miRNA gene annotation is based on multiple lines of evidence including evidence for expression as well as prediction of the characteristic hairpin structure. In contrast, most miRNA genes in non-human primates are annotated based on homology without any expression evidence. We have sequenced small-RNA libraries from chimpanzee, gorilla, orangutan and rhesus macaque from multiple individuals and tissues. Using patterns of miRNA expression in conjunction with a model of miRNA biogenesis we used these high-throughput sequencing data to identify novel miRNAs in non-human primates.
We predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque. The algorithm we used was able to predict 64% of the previously known miRNAs in chimpanzee, 94% in gorilla, 61% in orangutan and 71% in rhesus macaque. We therefore added evidence for expression in between one and five tissues to miRNAs that were previously annotated based only on homology to human miRNAs. We increased from 60 to 175 the number miRNAs that are located in orthologous regions in humans and the four non-human primate species studied here.
In this study we provide expression evidence for homology-based annotated miRNAs and predict de novo miRNAs in four non-human primate species. We increased the number of annotated miRNA genes and provided evidence for their expression in four non-human primates. Similar approaches using different individuals and tissues would improve annotation in non-human primates and allow for further comparative studies in the future.
MicroRNAs (miRNAs) are small RNA molecules involved in the regulation of mammalian gene expression. Together with other transcription regulators, miRNAs modulate the expression of genes and thereby potentially contribute to tissue and species diversity. To identify miRNAs that are differentially expressed between tissues and/or species, and the genes regulated by these, we have quantified expression of miRNAs and messenger RNAs in five tissues from multiple human, chimpanzee, and rhesus macaque individuals using high-throughput sequencing. The breadth of this tissue and species data allows us to show that downregulation of target genes by miRNAs is more pronounced between tissues than between species and that downregulation is more pronounced for genes with fewer binding sites for expressed miRNAs. Intriguingly, we find that tissue- and species-specific miRNAs target transcription factor genes (TFs) significantly more often than expected. Through their regulatory effect on transcription factors, miRNAs may therefore exert an indirect influence on a larger proportion of genes than previously thought.
microRNA; transcription factor; gene expression; gene regulation; primates
Several previous comparisons of the human genome with other primate and vertebrate genomes identified genomic regions that are highly conserved in vertebrate evolution but fast-evolving on the human lineage. These human accelerated regions (HARs) may be regions of past adaptive evolution in humans. Alternatively, they may be the result of non-adaptive processes, such as biased gene conversion. We captured and sequenced DNA from a collection of previously published HARs using DNA from an Iberian Neandertal. Combining these new data with shotgun sequence from the Neandertal and Denisova draft genomes, we determine at least one archaic hominin allele for 84% of all positions within HARs. We find that 8% of HAR substitutions are not observed in the archaic hominins and are thus recent in the sense that the derived allele had not come to fixation in the common ancestor of modern humans and archaic hominins. Further, we find that recent substitutions in HARs tend to have come to fixation faster than substitutions elsewhere in the genome and that substitutions in HARs tend to cluster in time, consistent with an episodic rather than a clock-like process underlying HAR evolution. Our catalog of sequence changes in HARs will help prioritize them for functional studies of genomic elements potentially responsible for modern human adaptations.
As biomedical investigators strive to integrate data and analyses across spatiotemporal scales and biomedical domains, they have recognized the benefits of formalizing languages and terminologies via computational ontologies. Although ontologies for biological entities—molecules, cells, organs—are well-established, there are no principled ontologies of physical properties—energies, volumes, flow rates—of those entities. In this paper, we introduce the Ontology of Physics for Biology (OPB), a reference ontology of classical physics designed for annotating biophysical content of growing repositories of biomedical datasets and analytical models. The OPB's semantic framework, traceable to James Clerk Maxwell, encompasses modern theories of system dynamics and thermodynamics, and is implemented as a computational ontology that references available upper ontologies. In this paper we focus on the OPB classes that are designed for annotating physical properties encoded in biomedical datasets and computational models, and we discuss how the OPB framework will facilitate biomedical knowledge integration.
The 2011 International Conference on Bioinformatics (InCoB) conference, which is the annual scientific conference of the Asia-Pacific Bioinformatics Network (APBioNet), is hosted by Kuala Lumpur, Malaysia, is co-organized with the first ISCB-Asia conference of the International Society for Computational Biology (ISCB). InCoB and the sequencing of the human genome are both celebrating their tenth anniversaries and InCoB’s goalposts for the next decade, implementing standards in bioinformatics and globally distributed computational networks, will be discussed and adopted at this conference. Of the 49 manuscripts (selected from 104 submissions) accepted to BMC Genomics and BMC Bioinformatics conference supplements, 24 are featured in this issue, covering software tools, genome/proteome analysis, systems biology (networks, pathways, bioimaging) and drug discovery and design.
In 2009 the International Society for Computational Biology (ISCB) started to roll out regional bioinformatics conferences in Africa, Latin America and Asia. The open and competitive bid for the first meeting in Asia (ISCB-Asia) was awarded to Asia-Pacific Bioinformatics Network (APBioNet) which has been running the International Conference on Bioinformatics (InCoB) in the Asia-Pacific region since 2002. InCoB/ISCB-Asia 2011 is held from November 30 to December 2, 2011 in Kuala Lumpur, Malaysia. Of 104 manuscripts submitted to BMC Genomics and BMC Bioinformatics conference supplements, 49 (47.1%) were accepted. The strong showing of Asia among submissions (82.7%) and acceptances (81.6%) signals the success of this tenth InCoB anniversary meeting, and bodes well for the future of ISCB-Asia.
More than 10 000 proteins were identified by high-resolution mass spectrometry in a human cancer cell line. The data cover most of the functional proteome as judged by RNA-seq data and it reveals the expression range of different protein classes.
While the number and identity of proteins expressed in a single human cell type is currently unknown, this fundamental question can be addressed by advanced mass spectrometry (MS)-based proteomics. Online liquid chromatography coupled to high-resolution MS and MS/MS yielded 166 420 peptides with unique amino-acid sequence from HeLa cells. These peptides identified 10 255 different human proteins encoded by 9207 human genes, providing a lower limit on the proteome in this cancer cell line. Deep transcriptome sequencing revealed transcripts for nearly all detected proteins. We calculate copy numbers for the expressed proteins and show that the abundances of >90% of them are within a factor 60 of the median protein expression level. Comparisons of the proteome and the transcriptome, and analysis of protein complex databases and GO categories, suggest that we achieved deep coverage of the functional transcriptome and the proteome of a single cell type.
mass spectrometry; proteomics; RNA-Seq; systems biology; transcriptomics
The OBML 2010 workshop, held at the University of Mannheim on September 9-10, 2010, is the 2nd in a series of meetings organized by the Working Group “Ontologies in Biomedicine and Life Sciences” of the German Society of Computer Science (GI) and the German Society of Medical Informatics, Biometry and Epidemiology (GMDS). Integrating, processing and applying the rapidly expanding information generated in the life sciences — from public health to clinical care and molecular biology — is one of the most challenging problems that research in these fields is facing today. As the amounts of experimental data, clinical information and scientific knowledge increase, there is a growing need to promote interoperability of these resources, support formal analyses, and to pre-process knowledge for further use in problem solving and hypothesis formulation.
The OBML workshop series pursues the aim of gathering scientists who research topics related to life science ontologies, to exchange ideas, discuss new results and establish relationships. The OBML group promotes the collaboration between ontologists, computer scientists, bio-informaticians and applied logicians, as well as the cooperation with physicians, biologists, biochemists and biometricians, and supports the establishment of this new discipline in research and teaching. Research topics of OBML 2010 included medical informatics, Semantic Web applications, formal ontology, bio-ontologies, knowledge representation as well as the wide range of applications of biomedical ontologies to science and medicine. A total of 14 papers were presented, and from these we selected four manuscripts for inclusion in this special issue.
An interdisciplinary audience from all areas related to biomedical ontologies attended OBML 2010. In the future, OBML will continue as an annual meeting that aims to bridge the gap between theory and application of ontologies in the life sciences. The next event emphasizes the special topic of the ontology of phenotypes, in Berlin, Germany on October 6-7, 2011.
Advances in DNA sequencing technologies have made it possible to generate large amounts of sequence data very rapidly and at substantially lower cost than capillary sequencing. These new technologies have specific characteristics and limitations that require either consideration during project design, or which must be addressed during data analysis. Specialist skills, both at the laboratory and the computational stages of project design and analysis, are crucial to the generation of high quality data from these new platforms. The Illumina sequencers (including the Genome Analyzers I/II/IIe/IIx and the new HiScan and HiSeq) represent a widely used platform providing parallel readout of several hundred million immobilized sequences using fluorescent-dye reversible-terminator chemistry. Sequencing library quality, sample handling, instrument settings and sequencing chemistry have a strong impact on sequencing run quality. The presence of adapter chimeras and adapter sequences at the end of short-insert molecules, as well as increased error rates and short read lengths complicate many computational analyses. We discuss here some of the factors that influence the frequency and severity of these problems and provide solutions for circumventing these. Further, we present a set of general principles for good analysis practice that enable problems with sequencing runs to be identified and dealt with.
Recently, next-generation sequencing-based technologies have enabled DNA methylation profiling at high resolution and low cost. Methyl-Seq and Reduced Representation Bisulfite Sequencing (RRBS) are two such technologies that interrogate methylation levels at CpG sites throughout the entire human genome. With rapid reduction of sequencing costs, these technologies will enable epigenotyping of large cohorts for phenotypic association studies. Existing quantification methods for sequencing-based methylation profiling are simplistic and do not deal with the noise due to the random sampling nature of sequencing and various experimental artifacts. Therefore, there is a need to investigate the statistical issues related to the quantification of methylation levels for these emerging technologies, with the goal of developing an accurate quantification method.
In this paper, we propose two methods for Methyl-Seq quantification. The first method, the Maximum Likelihood estimate, is both conceptually intuitive and computationally simple. However, this estimate is biased at extreme methylation levels and does not provide variance estimation. The second method, based on Bayesian hierarchical model, allows variance estimation of methylation levels, and provides a flexible framework to adjust technical bias in the sequencing process.
We compare the previously proposed binary method, the Maximum Likelihood (ML) method, and the Bayesian method. In both simulation and real data analysis of Methyl-Seq data, the Bayesian method offers the most accurate quantification. The ML method is slightly less accurate than the Bayesian method. But both our proposed methods outperform the original binary method in Methyl-Seq. In addition, we applied these quantification methods to simulation data and show that, with sequencing depth above 40–300 (which varies with different tissue samples) per cleavage site, Methyl-Seq offers a comparable quantification consistency as microarrays.
Herbivores rely on digestive tract lignocellulolytic microorganisms, including bacteria, fungi and protozoa, to derive energy and carbon from plant cell wall polysaccharides. Culture independent metagenomic studies have been used to reveal the genetic content of the bacterial species within gut microbiomes. However, the nature of the genes encoded by eukaryotic protozoa and fungi within these environments has not been explored using metagenomic or metatranscriptomic approaches.
In this study, a metatranscriptomic approach was used to investigate the functional diversity of the eukaryotic microorganisms within the rumen of muskoxen (Ovibos moschatus), with a focus on plant cell wall degrading enzymes. Polyadenylated RNA (mRNA) was sequenced on the Illumina Genome Analyzer II system and 2.8 gigabases of sequences were obtained and 59129 contigs assembled. Plant cell wall degrading enzyme modules including glycoside hydrolases, carbohydrate esterases and polysaccharide lyases were identified from over 2500 contigs. These included a number of glycoside hydrolase family 6 (GH6), GH48 and swollenin modules, which have rarely been described in previous gut metagenomic studies.
The muskoxen rumen metatranscriptome demonstrates a much higher percentage of cellulase enzyme discovery and an 8.7x higher rate of total carbohydrate active enzyme discovery per gigabase of sequence than previous rumen metagenomes. This study provides a snapshot of eukaryotic gene expression in the muskoxen rumen, and identifies a number of candidate genes coding for potentially valuable lignocellulolytic enzymes.
Expression profiling of restricted neural populations using microarrays can facilitate neuronal classification and provide insight into the molecular bases of cellular phenotypes. Due to the formidable heterogeneity of intermixed cell types that make up the brain, isolating cell types prior to microarray processing poses steep technical challenges that have been met in various ways. These methodological differences have the potential to distort cell-type-specific gene expression profiles insofar as they may insufficiently filter out contaminating mRNAs or induce aberrant cellular responses not normally present in vivo. Thus we have compared the repeatability, susceptibility to contamination from off-target cell-types, and evidence for stress-responsive gene expression of five different purification methods - Laser Capture Microdissection (LCM), Translating Ribosome Affinity Purification (TRAP), Immunopanning (PAN), Fluorescence Activated Cell Sorting (FACS), and manual sorting of fluorescently labeled cells (Manual). We found that all methods obtained comparably high levels of repeatability, however, data from LCM and TRAP showed significantly higher levels of contamination than the other methods. While PAN samples showed higher activation of apoptosis-related, stress-related and immediate early genes, samples from FACS and Manual studies, which also require dissociated cells, did not. Given that TRAP targets actively translated mRNAs, whereas other methods target all transcribed mRNAs, observed differences may also reflect translational regulation.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
Most biomedical ontologies are represented in the OBO Flatfile Format, which is an easy-to-use graph-based ontology language. The semantics of the OBO Flatfile Format 1.2 enforces a strict predetermined interpretation of relationship statements between classes. It does not allow flexible specifications that provide better approximations of the intuitive understanding of the considered relations. If relations cannot be accurately expressed then ontologies built upon them may contain false assertions and hence lead to false inferences. Ontologies in the OBO Foundry must formalize the semantics of relations according to the OBO Relationship Ontology (RO). Therefore, being able to accurately express the intended meaning of relations is of crucial importance. Since the Web Ontology Language (OWL) is an expressive language with a formal semantics, it is suitable to de ne the meaning of relations accurately.
We developed a method to provide definition patterns for relations between classes using OWL and describe a novel implementation of the RO based on this method. We implemented our extension in software that converts ontologies in the OBO Flatfile Format to OWL, and also provide a prototype to extract relational patterns from OWL ontologies using automated reasoning. The conversion software is freely available at http://bioonto.de/obo2owl, and can be accessed via a web interface.
Explicitly defining relations permits their use in reasoning software and leads to a more flexible and powerful way of representing biomedical ontologies. Using the extended langua0067e and semantics avoids several mistakes commonly made in formalizing biomedical ontologies, and can be used to automatically detect inconsistencies. The use of our method enables the use of graph-based ontologies in OWL, and makes complex OWL ontologies accessible in a graph-based form. Thereby, our method provides the means to gradually move the representation of biomedical ontologies into formal knowledge representation languages that incorporates an explicit semantics. Our method facilitates the use of OWL-based software in the back-end while ontology curators may continue to develop ontologies with an OBO-style front-end.
Genome annotation projects, gene functional studies, and phylogenetic analyses for a given organism all greatly benefit from access to a validated full-length cDNA resource. While increasingly common in model species, full-length cDNA resources in aquaculture species are scarce.
Methodology and Principal Findings
Through in silico analysis of catfish (Ictalurus spp.) ESTs, a total of 10,037 channel catfish and 7,382 blue catfish cDNA clones were identified as potentially encoding full-length cDNAs. Of this set, a total of 1,169 channel catfish and 933 blue catfish full-length cDNA clones were selected for re-sequencing to provide additional coverage and ensure sequence accuracy. A total of 1,745 unique gene transcripts were identified from the full-length cDNA set, including 1,064 gene transcripts from channel catfish and 681gene transcripts from blue catfish, with 416 transcripts shared between the two closely related species. Full-length sequence characteristics (ortholog conservation, UTR length, Kozak sequence, and conserved motifs) of the channel and blue catfish were examined in detail. Comparison of gene ontology composition between full-length cDNAs and all catfish ESTs revealed that the full-length cDNA set is representative of the gene diversity encoded in the catfish transcriptome.
This study describes the first catfish full-length cDNA set constructed from several cDNA libraries. The catfish full-length cDNA sequences, and data gleaned from sequence characteristics analysis, will be a valuable resource for ongoing catfish whole-genome sequencing and future gene-based studies of function and evolution in teleost fishes.
Biological data, and particularly annotation data, are increasingly being represented in directed acyclic graphs (DAGs). However, while relevant biological information is implicit in the links between multiple domains, annotations from these different domains are usually represented in distinct, unconnected DAGs, making links between the domains represented difficult to determine. We develop a novel family of general statistical tests for the discovery of strong associations between two directed acyclic graphs. Our method takes the topology of the input graphs and the specificity and relevance of associations between nodes into consideration. We apply our method to the extraction of associations between biomedical ontologies in an extensive use-case. Through a manual and an automatic evaluation, we show that our tests discover biologically relevant relations. The suite of statistical tests we develop for this purpose is implemented and freely available for download.