Genome-wide breeding value (GWEBV) estimation methods can be classified based on the prior distribution assumptions of marker effects. Genome-wide BLUP methods assume a normal prior distribution for all markers with a constant variance, and are computationally fast. In Bayesian methods, more flexible prior distributions of SNP effects are applied that allow for very large SNP effects although most are small or even zero, but these prior distributions are often also computationally demanding as they rely on Monte Carlo Markov chain sampling. In this study, we adopted the Pareto principle to weight available marker loci, i.e., we consider that x% of the loci explain (100 - x)% of the total genetic variance. Assuming this principle, it is also possible to define the variances of the prior distribution of the 'big' and 'small' SNP. The relatively few large SNP explain a large proportion of the genetic variance and the majority of the SNP show small effects and explain a minor proportion of the genetic variance. We name this method MixP, where the prior distribution is a mixture of two normal distributions, i.e. one with a big variance and one with a small variance. Simulation results, using a real Norwegian Red cattle pedigree, show that MixP is at least as accurate as the other methods in all studied cases. This method also reduces the hyper-parameters of the prior distribution from 2 (proportion and variance of SNP with big effects) to 1 (proportion of SNP with big effects), assuming the overall genetic variance is known. The mixture of normal distribution prior made it possible to solve the equations iteratively, which greatly reduced computation loads by two orders of magnitude. In the era of marker density reaching million(s) and whole-genome sequence data, MixP provides a computationally feasible Bayesian method of analysis.
Big defensin is an antimicrobial peptide composed of a highly hydrophobic N-terminal region and a cationic C-terminal region containing six cysteine residues involved in three internal disulfide bridges. While big defensin sequences have been reported in various mollusk species, few studies have been devoted to their sequence diversity, gene organization and their expression in response to microbial infections.
Using the high-throughput Digital Gene Expression approach, we have identified in Crassostrea gigas oysters several sequences coding for big defensins induced in response to a Vibrio infection. We showed that the oyster big defensin family is composed of three members (named Cg-BigDef1, Cg-BigDef2 and Cg-BigDef3) that are encoded by distinct genomic sequences. All Cg-BigDefs contain a hydrophobic N-terminal domain and a cationic C-terminal domain that resembles vertebrate β-defensins. Both domains are encoded by separate exons. We found that big defensins form a group predominantly present in mollusks and closer to vertebrate defensins than to invertebrate and fungi CSαβ-containing defensins. Moreover, we showed that Cg-BigDefs are expressed in oyster hemocytes only and follow different patterns of gene expression. While Cg-BigDef3 is non-regulated, both Cg-BigDef1 and Cg-BigDef2 transcripts are strongly induced in response to bacterial challenge. Induction was dependent on pathogen associated molecular patterns but not damage-dependent. The inducibility of Cg-BigDef1 was confirmed by HPLC and mass spectrometry, since ions with a molecular mass compatible with mature Cg-BigDef1 (10.7 kDa) were present in immune-challenged oysters only. From our biochemical data, native Cg-BigDef1 would result from the elimination of a prepropeptide sequence and the cyclization of the resulting N-terminal glutamine residue into a pyroglutamic acid.
We provide here the first report showing that big defensins form a family of antimicrobial peptides diverse not only in terms of sequences but also in terms of genomic organization and regulation of gene expression.
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data.
The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation.
In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.
MapReduce; Hadoop; Big data; Clinical big data analysis; Clinical data analysis; Bioinformatics; Distributed programming
The tRNA gene data base curated by experts “tRNADB-CE” (http://trna.ie.niigata-u.ac.jp) was constructed by analyzing 1,966 complete and 5,272 draft genomes of prokaryotes, 171 viruses’, 121 chloroplasts’, and 12 eukaryotes’ genomes plus fragment sequences obtained by metagenome studies of environmental samples. 595,115 tRNA genes in total, and thus two times of genes compiled previously, have been registered, for which sequence, clover-leaf structure, and results of sequence-similarity and oligonucleotide-pattern searches can be browsed. To provide collective knowledge with help from experts in tRNA researches, we added a column for enregistering comments to each tRNA. By grouping bacterial tRNAs with an identical sequence, we have found high phylogenetic preservation of tRNA sequences, especially at the phylum level. Since many species-unknown tRNAs from metagenomic sequences have sequences identical to those found in species-known prokaryotes, the identical sequence group (ISG) can provide phylogenetic markers to investigate the microbial community in an environmental ecosystem. This strategy can be applied to a huge amount of short sequences obtained from next-generation sequencers, as showing that tRNADB-CE is a well-timed database in the era of big sequence data. It is also discussed that batch-learning self-organizing-map with oligonucleotide composition is useful for efficient knowledge discovery from big sequence data.
tRNA; database; metagenome; phylogenic maker; BLSOM; big data
An era can be defined as a period in time identified by distinctive character, events, or practices. We are now in the genomic era. The pre-genomic era: There was a pre-genomic era. It started many years ago with novel and seminal animal experiments, primarily directed at studying cancer. It is marked by the development of the two-year rodent cancer bioassay and the ultimate realization that alternative approaches and short-term animal models were needed to replace this resource-intensive and time-consuming method for predicting human health risk. Many alternatives approaches and short-term animal models were proposed and tried but, to date, none have completely replaced our dependence upon the two-year rodent bioassay. However, the alternative approaches and models themselves have made tangible contributions to basic research, clinical medicine and to our understanding of cancer and they remain useful tools to address hypothesis-driven research questions. The pre-genomic era was a time when toxicologic pathologists played a major role in drug development, evaluating the cancer bioassay and the associated dose-setting toxicity studies, and exploring the utility of proposed alternative animal models. It was a time when there was shortage of qualified toxicologic pathologists. The genomic era: We are in the genomic era. It is a time when the genetic underpinnings of normal biological and pathologic processes are being discovered and documented. It is a time for sequencing entire genomes and deliberately silencing relevant segments of the mouse genome to see what each segment controls and if that silencing leads to increased susceptibility to disease. What remains to be charted in this genomic era is the complex interaction of genes, gene segments, post-translational modifications of encoded proteins, and environmental factors that affect genomic expression. In this current genomic era, the toxicologic pathologist has had to make room for a growing population of molecular biologists. In this present era newly emerging DVM and MD scientists enter the work arena with a PhD in pathology often based on some aspect of molecular biology or molecular pathology research. In molecular biology, the almost daily technological advances require one’s complete dedication to remain at the cutting edge of the science. Similarly, the practice of toxicologic pathology, like other morphological disciplines, is based largely on experience and requires dedicated daily examination of pathology material to maintain a well-trained eye capable of distilling specific information from stained tissue slides - a dedicated effort that cannot be well done as an intermezzo between other tasks. It is a rare individual that has true expertise in both molecular biology and pathology. In this genomic era, the newly emerging DVM-PhD or MD-PhD pathologist enters a marketplace without many job opportunities in contrast to the pre-genomic era. Many face an identity crisis needing to decide to become a competent pathologist or, alternatively, to become a competent molecular biologist. At the same time, more PhD molecular biologists without training in pathology are members of the research teams working in drug development and toxicology. How best can the toxicologic pathologist interact in the contemporary team approach in drug development, toxicology research and safety testing? Based on their biomedical training, toxicologic pathologists are in an ideal position to link data from the emerging technologies with their knowledge of pathobiology and toxicology. To enable this linkage and obtain the synergy it provides, the bench-level, slide-reading expert pathologist will need to have some basic understanding and appreciation of molecular biology methods and tools. On the other hand, it is not likely that the typical molecular biologist could competently evaluate and diagnose stained tissue slides from a toxicology study or a cancer bioassay. The post-genomic era: The post-genomic era will likely arrive approximately around 2050 at which time entire genomes from multiple species will exist in massive databases, data from thousands of robotic high throughput chemical screenings will exist in other databases, genetic toxicity and chemical structure-activity-relationships will reside in yet other databases. All databases will be linked and relevant information will be extracted and analyzed by appropriate algorithms following input of the latest molecular, submolecular, genetic, experimental, pathology and clinical data. Knowledge gained will permit the genetic components of many diseases to be amenable to therapeutic prevention and/or intervention. Much like computerized algorithms are currently used to forecast weather or to predict political elections, computerized sophisticated algorithms based largely on scientific data mining will categorize new drugs and chemicals relative to their health benefits versus their health risks for defined human populations and subpopulations. However, this form of a virtual toxicity study or cancer bioassay will only identify probabilities of adverse consequences from interaction of particular environmental and/or chemical/drug exposure(s) with specific genomic variables. Proof in many situations will require confirmation in intact in vivo mammalian animal models. The toxicologic pathologist in the post-genomic era will be the best suited scientist to confirm the data mining and its probability predictions for safety or adverse consequences with the actual tissue morphological features in test species that define specific test agent pathobiology and human health risk.
genomic era; history of toxicologic pathology; molecular biology
The Human Genome Project (HGP) is regarded by many as one of the major scientific achievements in recent science history, a large-scale endeavour that is changing the way in which biomedical research is done and expected, moreover, to yield considerable benefit for society. Thus, since the completion of the human genome sequencing effort, a debate has emerged over the question whether this effort merits to be awarded a Nobel Prize and if so, who should be the one(s) to receive it, as (according to current procedures) no more than three individuals can be selected. In this article, the HGP is taken as a case study to consider the ethical question to what extent it is still possible, in an era of big science, of large-scale consortia and global team work, to acknowledge and reward individual contributions to important breakthroughs in biomedical fields. Is it still viable to single out individuals for their decisive contributions in order to reward them in a fair and convincing way? Whereas the concept of the Nobel prize as such seems to reflect an archetypical view of scientists as solitary researchers who, at a certain point in their careers, make their one decisive discovery, this vision has proven to be problematic from the very outset. Already during the first decade of the Nobel era, Ivan Pavlov was denied the Prize several times before finally receiving it, on the basis of the argument that he had been active as a research manager (a designer and supervisor of research projects) rather than as a researcher himself. The question then is whether, in the case of the HGP, a research effort that involved the contributions of hundreds or even thousands of researchers worldwide, it is still possible to “individualise” the Prize? The “HGP Nobel Prize problem” is regarded as an exemplary issue in current research ethics, highlighting a number of quandaries and trends involved in contemporary life science research practices more broadly.
Human Genome Project; Nobel Prize; Research ethics; Fairness of reward mechanism in biomedical research
Cancer rates are set to increase at an alarming rate, from 10 million new cases globally in 2000 to 15 million in 2020. Regarding the pharmacological treatment of cancer, we currently are in the interphase of two treatment eras. The so-called pregenomic therapy which names the traditional cancer drugs, mainly cytotoxic drug types, and post-genomic era-type drugs referring to rationally-based designed. Although there are successful examples of this newer drug discovery approach, most target-specific agents only provide small gains in symptom control and/or survival, whereas others have consistently failed in the clinical testing. There is however, a characteristic shared by these agents: -their high cost-. This is expected as drug discovery and development is generally carried out within the commercial rather than the academic realm. Given the extraordinarily high therapeutic drug discovery-associated costs and risks, it is highly unlikely that any single public-sector research group will see a novel chemical "probe" become a "drug". An alternative drug development strategy is the exploitation of established drugs that have already been approved for treatment of non-cancerous diseases and whose cancer target has already been discovered. This strategy is also denominated drug repositioning, drug repurposing, or indication switch. Although traditionally development of these drugs was unlikely to be pursued by Big Pharma due to their limited commercial value, biopharmaceutical companies attempting to increase productivity at present are pursuing drug repositioning. More and more companies are scanning the existing pharmacopoeia for repositioning candidates, and the number of repositioning success stories is increasing. Here we provide noteworthy examples of known drugs whose potential anticancer activities have been highlighted, to encourage further research on these known drugs as a means to foster their translation into clinical trials utilizing the more limited public-sector resources. If these drug types eventually result in being effective, it follows that they could be much more affordable for patients with cancer; therefore, their contribution in terms of reducing cancer mortality at the global level would be greater.
Advances in high-throughput sequencing technologies have brought us into the individual genome era. Projects such as the 1000 Genomes Project have led the individual genome sequencing to become more and more popular. How to visualize, analyse and annotate individual genomes with knowledge bases to support genome studies and personalized healthcare is still a big challenge. The Personal Genome Browser (PGB) is developed to provide comprehensive functional annotation and visualization for individual genomes based on the genetic–molecular–phenotypic model. Investigators can easily view individual genetic variants, such as single nucleotide variants (SNVs), INDELs and structural variations (SVs), as well as genomic features and phenotypes associated to the individual genetic variants. The PGB especially highlights potential functional variants using the PGB built-in method or SIFT/PolyPhen2 scores. Moreover, the functional risks of genes could be evaluated by scanning individual genetic variants on the whole genome, a chromosome, or a cytoband based on functional implications of the variants. Investigators can then navigate to high risk genes on the scanned individual genome. The PGB accepts Variant Call Format (VCF) and Genetic Variation Format (GVF) files as the input. The functional annotation of input individual genome variants can be visualized in real time by well-defined symbols and shapes. The PGB is available at http://www.pgbrowser.org/.
Molecular systematics occupies one of the central stages in biology in the genomic era, ushered in by unprecedented progress in DNA technology. The inference of organismal phylogeny is now based on many independent genetic loci, a widely accepted approach to assemble the tree of life. Surprisingly, this approach is hindered by lack of appropriate nuclear gene markers for many taxonomic groups especially at high taxonomic level, partially due to the lack of tools for efficiently developing new phylogenetic makers. We report here a genome-comparison strategy to identifying nuclear gene markers for phylogenetic inference and apply it to the ray-finned fishes – the largest vertebrate clade in need of phylogenetic resolution.
A total of 154 candidate molecular markers – relatively well conserved, putatively single-copy gene fragments with long, uninterrupted exons – were obtained by comparing whole genome sequences of two model organisms, Danio rerio and Takifugu rubripes. Experimental tests of 15 of these (randomly picked) markers on 36 taxa (representing two-thirds of the ray-finned fish orders) demonstrate the feasibility of amplifying by PCR and directly sequencing most of these candidates from whole genomic DNA in a vast diversity of fish species. Preliminary phylogenetic analyses of sequence data obtained for 14 taxa and 10 markers (total of 7,872 bp for each species) are encouraging, suggesting that the markers obtained will make significant contributions to future fish phylogenetic studies.
We present a practical approach that systematically compares whole genome sequences to identify single-copy nuclear gene markers for inferring phylogeny. Our method is an improvement over traditional approaches (e.g., manually picking genes for testing) because it uses genomic information and automates the process to identify large numbers of candidate makers. This approach is shown here to be successful for fishes, but also could be applied to other groups of organisms for which two or more complete genome sequences exist, which has important implications for assembling the tree of life.
Very early after the identification of the human immunodeficiency virus (HIV), host genetics factors were anticipated to play a role in viral control and disease progression. As early as the mid-1990s, candidate gene studies demonstrated a central role for the chemokine co-receptor/ligand (e.g., CCR5) and human leukocyte antigen (HLA) systems. In the last decade, the advent of genome-wide arrays opened a new era for unbiased genetic exploration of the genome and brought big expectations for the identification of new unexpected genes and pathways involved in HIV/AIDS. More than 15 genome-wide association studies targeting various HIV-linked phenotypes have been published since 2007. Surprisingly, only the two HIV-chemokine co-receptors and HLA loci have exhibited consistent and reproducible statistically significant genetic associations. In this chapter, we will review the findings from the genome-wide studies focusing especially on non-progressive and HIV control phenotypes, and discuss the current perspectives.
genome-wide association study; SNP; HIV-1; viral control; long-term non-progression; chemokine receptors region; HLA
Global climate change and its impact on human life has become one of our era's greatest challenges. Despite the urgency, data science has had little impact on furthering our understanding of our planet in spite of the abundance of climate data. This is a stark contrast from other fields such as advertising or electronic commerce where big data has been a great success story. This discrepancy stems from the complex nature of climate data as well as the scientific questions climate science brings forth. This article introduces a data science audience to the challenges and opportunities to mine large climate datasets, with an emphasis on the nuanced difference between mining climate data and traditional big data approaches. We focus on data, methods, and application challenges that must be addressed in order for big data to fulfill their promise with regard to climate science applications. More importantly, we highlight research showing that solely relying on traditional big data techniques results in dubious findings, and we instead propose a theory-guided data science paradigm that uses scientific theory to constrain both the big data techniques as well as the results-interpretation process to extract accurate insight from large climate data.
Understanding the relationship between the millions of functional DNA elements and their protein regulators, and how they work in conjunction to manifest diverse phenotypes, is key to advancing our understanding of the mammalian genome. Next-generation sequencing technology is now used widely to probe these protein-DNA interactions and to profile gene expression at a genome-wide scale. As the cost of DNA sequencing continues to fall, the interpretation of the ever increasing amount of data generated represents a considerable challenge.
We have developed ngs.plot – a standalone program to visualize enrichment patterns of DNA-interacting proteins at functionally important regions based on next-generation sequencing data. We demonstrate that ngs.plot is not only efficient but also scalable. We use a few examples to demonstrate that ngs.plot is easy to use and yet very powerful to generate figures that are publication ready.
We conclude that ngs.plot is a useful tool to help fill the gap between massive datasets and genomic information in this era of big sequencing data.
Next-generation sequencing; Visualization; Epigenomics; Data mining; Genomic databases
A three-year study (July 2000 – June 2003) of fish assemblages was conducted in four tributaries of the Big Black River: Big Bywy, Little Bywy, Middle Bywy and McCurtain creeks that cross the Natchez Trace Parkway, Choctaw County, Mississippi, USA. Little Bywy and Middle Bywy creeks were within watersheds influenced by the lignite mining. Big Bywy and Middle Bywy creeks were historically impacted by channelisation. McCurtain Creek was chosen as a reference (control) stream. Fish were collected using a portable backpack electrofishing unit (Smith-Root Inc., Washington, USA). Insectivorous fish dominated all of the streams. There were no pronounced differences in relative abundances of fishes among the streams (P > 0.05) but fish assemblages fluctuated seasonally. Although there were some differences among streams with regard to individual species, channelisation and lignite mining had no discernable adverse effects on functional components of fish assemblages suggesting that fishes in these systems are euryceous fluvial generalist species adapted to the variable environments of small stream ecosystems.
Fish; Mining; Channelisation
Lactic acid bacteria are among the powerhouses of the food industry, colonize the surfaces of plants and animals, and contribute to our health and well-being. The genomic characterization of LAB has rocketed and presently over 100 complete or nearly complete genomes are available, many of which serve as scientific paradigms. Moreover, functional and comparative metagenomic studies are taking off and provide a wealth of insight in the activity of lactic acid bacteria used in a variety of applications, ranging from starters in complex fermentations to their marketing as probiotics. In this new era of high throughput analysis, biology has become big science. Hence, there is a need to systematically store the generated information, apply this in an intelligent way, and provide modalities for constructing self-learning systems that can be used for future improvements. This review addresses these systems solutions with a state of the art overview of the present paradigms that relate to the use of lactic acid bacteria in industrial applications. Moreover, an outlook is presented of the future developments that include the transition into practice as well as the use of lactic acid bacteria in synthetic biology and other next generation applications.
The trend of recent researches, in which synthetic biology and white technology through system approaches based on “Omics technology” are recognized as the ground of biotechnology, indicates the coming of the ‘metagenome era’ that accesses the genomes of all microbes aiming at the understanding and industrial application of the whole microbial resources. The remarkable advance of technologies for digging out and analyzing metagenome is enabling not only practical applications of metagenome but also system approaches on a mixed-genome level based on accumulated information. In this situation, the present review is purposed to introduce the trends and methods of research on metagenome and to examine big science led by related resources in the future.
Metagenome; Gene mining; Novel metabolites; Systems approach; Biological treasure
Human influenza virus isolates generally grow poorly in embryonated chicken eggs. Hence, gene reassortment of influenza A wild type (wt) viruses is performed with a highly egg adapted donor virus, A/Puerto Rico/8/1934 (PR8), to provide the high yield reassortant (HYR) viral ‘seeds’ for vaccine production. HYR must contain the hemagglutinin (HA) and neuraminidase (NA) genes of wt virus and one to six ‘internal’ genes from PR8. Most studies of influenza wt and HYRs have focused on the HA gene. The main objective of this study is the identification of the molecular signature in all eight gene segments of influenza A HYR candidate vaccine seeds associated with high growth in ovo.
The genomes of 14 wt parental viruses, 23 HYRs (5 H1N1; 2, 1976 H1N1-SOIV; 2, 2009 H1N1pdm; 2 H2N2 and 12 H3N2) and PR8 were sequenced using the high-throughput sequencing pipeline with big dye terminator chemistry.
Silent and coding mutations were found in all internal genes derived from PR8 with the exception of the M gene. The M gene derived from PR8 was invariant in all 23 HYRs underlining the critical role of PR8 M in high yield phenotype. None of the wt virus derived internal genes had any silent change(s) except the PB1 gene in X-157. The highest number of recurrent silent and coding mutations was found in NS. With respect to the surface antigens, the majority of HYRs had coding mutations in HA; only 2 HYRs had coding mutations in NA.
In the era of application of reverse genetics to alter influenza A virus genomes, the mutations identified in the HYR gene segments associated with high growth in ovo may be of great practical benefit to modify PR8 and/or wt virus gene sequences for improved growth of vaccine ‘seed’ viruses.
The overwhelming amount of network data in functional genomics is making its visualization cluttered with jumbling nodes and edges. Such cluttered network visualization, which is known as "hair-balls", is significantly hindering data interpretation and analysis of researchers. Effective navigation approaches that can always abstract network data properly and present them insightfully are hence required, to help researchers interpret the data and acquire knowledge efficiently. Cytoscape is a de facto standard platform for network visualization and analysis, which has many users around the world. Apart from its core sophisticated features, it easily allows for extension of the functionalities by loading extra plug-ins.
We developed NaviClusterCS, which enables researchers to interactively navigate large biological networks of ~100,000 nodes in a "Google Maps-like" manner in the Cytoscape environment. NaviClusterCS rapidly and automatically identifies biologically meaningful clusters in large networks, e.g., proteins sharing similar biological functions in protein-protein interaction networks. Then, it displays not all nodes but only preferable numbers of those clusters at any magnification to avoid creating the cluttered network visualization, while its zooming and re-centering functions still enable researchers to interactively analyze the networks in detail. Its application to a real Arabidopsis co-expression network dataset illustrated a practical use of the tool for suggesting knowledge that is hidden in large biological networks and difficult to be obtained using other visualization methods.
NaviClusterCS provides interactive and multi-scale network navigation to a wide range of biologists in the big data era, via the de facto standard platform for network visualization. It can be freely downloaded at http://navicluster.cb.k.u-tokyo.ac.jp/cs/ and installed as a plug-in of Cytoscape.
With a remarkable increase in genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-organizing map (SOM) is a powerful tool for clustering high-dimensional data on one plane. For oligonucleotide compositions handled as high-dimensional data, we have previously modified the conventional SOM for genome informatics: BLSOM. In the present study, we constructed BLSOMs for oligonucleotide compositions in fragment sequences (e.g. 100 kb) from a wide range of vertebrates, including coelacanth, and found that the sequences were clustered primarily according to species without species information. As one of the nearest living relatives of tetrapod ancestors, coelacanth is believed to provide access to the phenotypic and genomic transitions leading to the emergence of tetrapods. The characteristic oligonucleotide composition found for coelacanth was connected with the lowest dinucleotide CG occurrence (i.e. the highest CG suppression) among fishes, which was rather equivalent to that of tetrapods. This evident CG suppression in coelacanth should reflect molecular evolutionary processes of epigenetic systems including DNA methylation during vertebrate evolution. Sequence of a de novo DNA methylase (Dntm3a) of coelacanth was found to be more closely related to that of tetrapods than that of other fishes.
big data; epigenetic; SOM; DNA methylation; CG suppression
Advances in biotechnology have created “big-data” situations in molecular and cellular biology. Several sophisticated algorithms have been developed that process big data to generate hundreds of biomedical hypotheses (or predictions). The bottleneck to translating this large number of biological hypotheses is that each of them needs to be studied by experimentation for interpreting its functional significance. Even when the predictions are estimated to be very accurate, from a biologist’s perspective, the choice of which of these predictions is to be studied further is made based on factors like availability of reagents and resources and the possibility of formulating some reasonable hypothesis about its biological relevance. When viewed from a global perspective, say from that of a federal funding agency, ideally the choice of which prediction should be studied would be made based on which of them can make the most translational impact.
We propose that algorithms be developed to identify which of the computationally generated hypotheses have potential for high translational impact; this way, funding agencies and scientific community can invest resources and drive the research based on a global view of biomedical impact without being deterred by local view of feasibility. In short, data-analytic algorithms analyze big-data and generate hypotheses; in contrast, the proposed inference-analytic algorithms analyze these hypotheses and rank them by predicted biological impact. We demonstrate this through the development of an algorithm to predict biomedical impact of protein-protein interactions (PPIs) which is estimated by the number of future publications that cite the paper which originally reported the PPI.
This position paper describes a new computational problem that is relevant in the era of big-data and discusses the challenges that exist in studying this problem, highlighting the need for the scientific community to engage in this line of research. The proposed class of algorithms, namely inference-analytic algorithms, is necessary to ensure that resources are invested in translating those computational outcomes that promise maximum biological impact. Application of this concept to predict biomedical impact of PPIs illustrates not only the concept, but also the challenges in designing these algorithms.
Impact prediction; Data analytics; Inference analytics; Protein-protein interaction prediction; Big-data
Big data has long been found its way into clinical practice since the advent of information technology era. Medical records and follow-up data can be more efficiently stored and extracted with information technology. Immediately after admission a patient immediately produces a large amount of data including laboratory findings, medications, fluid balance, progressing notes and imaging findings. Clinicians and clinical investigators should make every effort to make full use of the big data that is being continuously generated by electronic medical record (EMR) system and other healthcare databases. At this stage, more training courses on data management and statistical analysis are required before clinicians and clinical investigators can handle big data and translate them into advances in medical science. China is a large country with a population of 1.3 billion and can contribute greatly to clinical researches by providing reliable and high-quality big data.
Big data; critical care medicine; mainland China
Summary: BigWig and BigBed files are compressed binary indexed files containing data at several resolutions that allow the high-performance display of next-generation sequencing experiment results in the UCSC Genome Browser. The visualization is implemented using a multi-layered software approach that takes advantage of specific capabilities of web-based protocols and Linux and UNIX operating systems files, R trees and various indexing and compression tricks. As a result, only the data needed to support the current browser view is transmitted rather than the entire file, enabling fast remote access to large distributed data sets.
Availability and implementation: Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/. Source code for the creation and visualization software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. The UCSC Genome Browser is available at http://genome.ucsc.edu
Supplementary information: Supplementary byte-level details of the BigWig and BigBed file formats are available at Bioinformatics online. For an in-depth description of UCSC data file formats and custom tracks, see http://genome.ucsc.edu/FAQ/FAQformat.html and http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html
The National Wildlife Refuge system is a vital resource for the protection and conservation of biodiversity and biological integrity in the United States. Surveys were conducted to determine the spatial and temporal patterns of fish, macroinvertebrate, and crayfish populations in two watersheds that encompass three refuges in southern Indiana. The Patoka River National Wildlife Refuge had the highest number of aquatic species with 355 macroinvertebrate taxa, six crayfish species, and 82 fish species, while the Big Oaks National Wildlife Refuge had 163 macroinvertebrate taxa, seven crayfish species, and 37 fish species. The Muscatatuck National Wildlife Refuge had the lowest diversity of macroinvertebrates with 96 taxa and six crayfish species, while possessing the second highest fish species richness with 51 species. Habitat quality was highest in the Muscatatuck River drainage with increased amounts of forested habitats compared to the Patoka River drainage. Biological integrity of the three refuges ranked the Patoka NWR as the lowest biological integrity (mean IBI reach scores = 35 IBI points), while Big Oaks had the highest biological integrity (mean IBI reach score = 41 IBI points). The Muscatatuck NWR had a mean IBI reach score of 31 during June, which seasonally increased to a mean of 40 IBI points during summer. Watershed IBI scores and habitat condition were highest in the Big Oaks NWR.
Distribution; Conservation; Ecological health; Fish; Crayfish; Macroinvertebrates
Between five and fourteen per cent of genes in the vertebrate genomes do overlap sharing some intronic and/or exonic sequence. It was observed that majority of these overlaps are not conserved among vertebrate lineages. Although several mechanisms have been proposed to explain gene overlap origination the evolutionary basis of these phenomenon are still not well understood. Here, we present results of the comparative analysis of several vertebrate genomes. The purpose of this study was to examine overlapping genes in the context of their evolution and mechanisms leading to their origin.
Based on the presence and arrangement of human overlapping genes orthologs in rodent and fish genomes we developed 15 theoretical scenarios of overlapping genes evolution. Analysis of these theoretical scenarios and close examination of genomic sequences revealed new mechanisms leading to the overlaps evolution and confirmed that many of the vertebrate gene overlaps are not conserved. This study also demonstrates that repetitive elements contribute to the overlapping genes origination and, for the first time, that evolutionary events could lead to the loss of an ancient overlap.
Birth as well as most probably death of gene overlaps occurred over the entire time of vertebrate evolution and there wasn't any rapid origin or 'big bang' in the course of overlapping genes evolution. The major forces in the gene overlaps origination are transposition and exaptation. Our results also imply that origin of overlapping genes is not an issue of saving space and contracting genomes size.
The cancer Biomedical Informatics Grid (caBIG) was launched in 2003 by the US National Cancer Institute with the aim of connecting research teams through the use of shared infrastructure and software to collect, analyse and share data. It was an ambitious project, and the issue it aimed to address was huge and far-reaching. With such developments as the mapping of the human genome and the advancement of new technologies for the analysis of genes and proteins, cancer researchers have never produced so much complex data, nor have they understood so much about cancer on a molecular level. This new ‘molecular understanding’ of cancer, according to the caBIG 2007 ‘Pilot Report’, leads to molecular or ‘personalised’ medicine being the way forward in cancer research and treatment, and connects basic research to clinical care in an unprecedented way. But the former ‘silo-like’ nature of research does not lend itself to this brave new world of molecular medicine—individual labs and institutes working in isolation, “in effect, as cottage industries, each collecting and interpreting data using a unique language of their own” will not advance cancer research as it should be advanced. The solution proposed by the NCI in caBIG was to produce an integrated informatics grid (‘caGrid’) to incorporate open source, open access tools to collect, analyse and share data, enabling everyone to use the same methods and language for these tasks.
caBIG is primarily a US-based endeavour, and though the tools are openly available for users worldwide, it is in US NCI-funded cancer centres that they have been actively introduced and promoted with the eventual hope, according to the pilot report, of being able to do the same worldwide. caBIG also has a collaboration in place with the UK organisation NCRI to exchange technologies and research data. The European Association for Cancer Research, a member association for cancer researchers, conducted an online survey in January 2011 to identify the penetration of the ambitious caBIG project into European laboratories. The survey was sent to 6396 researchers based in Europe, with 764 respondents, a total response rate of 11.94%.
In this article we introduce modern statistical machine learning and bioinformatics approaches that have been used in learning statistical relationships from big data in medicine and behavioral science that typically include clinical, genomic (and proteomic) and environmental variables. Every year, data collected from biomedical and behavioral science is getting larger and more complicated. Thus, in medicine, we also need to be aware of this trend and understand the statistical tools that are available to analyze these datasets. Many statistical analyses that are aimed to analyze such big datasets have been introduced recently. However, given many different types of clinical, genomic, and environmental data, it is rather uncommon to see statistical methods that combine knowledge resulting from those different data types. To this extent, we will introduce big data in terms of clinical data, single nucleotide polymorphism and gene expression studies and their interactions with environment. In this article, we will introduce the concept of well-known regression analyses such as linear and logistic regressions that has been widely used in clinical data analyses and modern statistical models such as Bayesian networks that has been introduced to analyze more complicated data. Also we will discuss how to represent the interaction among clinical, genomic, and environmental data in using modern statistical models. We conclude this article with a promising modern statistical method called Bayesian networks that is suitable in analyzing big data sets that consists with different type of large data from clinical, genomic, and environmental data. Such statistical model form big data will provide us with more comprehensive understanding of human physiology and disease.
Bayesian analysis; Statistical data interpretation; Systems biology