Search tips
Search criteria

Results 1-25 (156632)

Clipboard (0)

Related Articles

1.  Using the Pareto principle in genome-wide breeding value estimation 
Genome-wide breeding value (GWEBV) estimation methods can be classified based on the prior distribution assumptions of marker effects. Genome-wide BLUP methods assume a normal prior distribution for all markers with a constant variance, and are computationally fast. In Bayesian methods, more flexible prior distributions of SNP effects are applied that allow for very large SNP effects although most are small or even zero, but these prior distributions are often also computationally demanding as they rely on Monte Carlo Markov chain sampling. In this study, we adopted the Pareto principle to weight available marker loci, i.e., we consider that x% of the loci explain (100 - x)% of the total genetic variance. Assuming this principle, it is also possible to define the variances of the prior distribution of the 'big' and 'small' SNP. The relatively few large SNP explain a large proportion of the genetic variance and the majority of the SNP show small effects and explain a minor proportion of the genetic variance. We name this method MixP, where the prior distribution is a mixture of two normal distributions, i.e. one with a big variance and one with a small variance. Simulation results, using a real Norwegian Red cattle pedigree, show that MixP is at least as accurate as the other methods in all studied cases. This method also reduces the hyper-parameters of the prior distribution from 2 (proportion and variance of SNP with big effects) to 1 (proportion of SNP with big effects), assuming the overall genetic variance is known. The mixture of normal distribution prior made it possible to solve the equations iteratively, which greatly reduced computation loads by two orders of magnitude. In the era of marker density reaching million(s) and whole-genome sequence data, MixP provides a computationally feasible Bayesian method of analysis.
PMCID: PMC3354342  PMID: 22044555
2.  Big Defensins, a Diverse Family of Antimicrobial Peptides That Follows Different Patterns of Expression in Hemocytes of the Oyster Crassostrea gigas 
PLoS ONE  2011;6(9):e25594.
Big defensin is an antimicrobial peptide composed of a highly hydrophobic N-terminal region and a cationic C-terminal region containing six cysteine residues involved in three internal disulfide bridges. While big defensin sequences have been reported in various mollusk species, few studies have been devoted to their sequence diversity, gene organization and their expression in response to microbial infections.
Using the high-throughput Digital Gene Expression approach, we have identified in Crassostrea gigas oysters several sequences coding for big defensins induced in response to a Vibrio infection. We showed that the oyster big defensin family is composed of three members (named Cg-BigDef1, Cg-BigDef2 and Cg-BigDef3) that are encoded by distinct genomic sequences. All Cg-BigDefs contain a hydrophobic N-terminal domain and a cationic C-terminal domain that resembles vertebrate β-defensins. Both domains are encoded by separate exons. We found that big defensins form a group predominantly present in mollusks and closer to vertebrate defensins than to invertebrate and fungi CSαβ-containing defensins. Moreover, we showed that Cg-BigDefs are expressed in oyster hemocytes only and follow different patterns of gene expression. While Cg-BigDef3 is non-regulated, both Cg-BigDef1 and Cg-BigDef2 transcripts are strongly induced in response to bacterial challenge. Induction was dependent on pathogen associated molecular patterns but not damage-dependent. The inducibility of Cg-BigDef1 was confirmed by HPLC and mass spectrometry, since ions with a molecular mass compatible with mature Cg-BigDef1 (10.7 kDa) were present in immune-challenged oysters only. From our biochemical data, native Cg-BigDef1 would result from the elimination of a prepropeptide sequence and the cyclization of the resulting N-terminal glutamine residue into a pyroglutamic acid.
We provide here the first report showing that big defensins form a family of antimicrobial peptides diverse not only in terms of sequences but also in terms of genomic organization and regulation of gene expression.
PMCID: PMC3182236  PMID: 21980497
3.  Big data: the next frontier for innovation in therapeutics and healthcare 
Advancements in genomics and personalized medicine not only effect healthcare delivery from patient and provider standpoints, but also reshape biomedical discovery. We are in the era of the “-omics”, wherein an individual’s genome, transcriptome, proteome and metabolome can be scrutinized to the finest resolution to paint a personalized biochemical fingerprint that enables tailored treatments, prognoses, risk factors, etc. Digitization of this information parlays into “big data” informatics-driven evidence-based medical practice. While individualized patient management is a key beneficiary of next-generation medical informatics, this data also harbors a wealth of novel therapeutic discoveries waiting to be uncovered. “Big data” informatics allows for networks-driven systems pharmacodynamics whereby drug information can be coupled to cellular- and organ-level physiology for determining whole-body outcomes. Patient “-omics” data can be integrated for ontology-based data-mining for the discovery of new biological associations and drug targets. Here we highlight the potential of “big data” informatics for clinical pharmacology.
PMCID: PMC4448933  PMID: 24702684
big data; clinical pharmacology; personalized medicine; systems medicine; therapeutics
4.  Lost in Translation (LiT) 
British Journal of Pharmacology  2014;171(9):2269-2290.
Translational medicine is a roller coaster with occasional brilliant successes and a large majority of failures. Lost in Translation 1 (‘LiT1’), beginning in the 1950s, was a golden era built upon earlier advances in experimental physiology, biochemistry and pharmacology, with a dash of serendipity, that led to the discovery of many new drugs for serious illnesses. LiT2 saw the large-scale industrialization of drug discovery using high-throughput screens and assays based on affinity for the target molecule. The links between drug development and university sciences and medicine weakened, but there were still some brilliant successes. In LiT3, the coverage of translational medicine expanded from molecular biology to drug budgets, with much greater emphasis on safety and official regulation. Compared with R&D expenditure, the number of breakthrough discoveries in LiT3 was disappointing, but monoclonal antibodies for immunity and inflammation brought in a new golden era and kinase inhibitors such as imatinib were breakthroughs in cancer. The pharmaceutical industry is trying to revive the LiT1 approach by using phenotypic assays and closer links with academia. LiT4 faces a data explosion generated by the genome project, GWAS, ENCODE and the ‘omics’ that is in danger of leaving LiT4 in a computerized cloud. Industrial laboratories are filled with masses of automated machinery while the scientists sit in a separate room viewing the results on their computers. Big Data will need Big Thinking in LiT4 but with so many unmet medical needs and so many new opportunities being revealed there are high hopes that the roller coaster will ride high again.
PMCID: PMC3997269  PMID: 24428732
roller coaster; golden years; breakthroughs; monoclonal antibodies; GWAS; ENCODE; cost; safety; patients; adherence; regulation; Big Data
5.  Big Data: the challenge for small research groups in the era of cancer genomics 
British Journal of Cancer  2015;113(10):1405-1412.
In the past decade, cancer research has seen an increasing trend towards high-throughput techniques and translational approaches. The increasing availability of assays that utilise smaller quantities of source material and produce higher volumes of data output have resulted in the necessity for data storage solutions beyond those previously used. Multifactorial data, both large in sample size and heterogeneous in context, needs to be integrated in a standardised, cost-effective and secure manner. This requires technical solutions and administrative support not normally financially accounted for in small- to moderate-sized research groups. In this review, we highlight the Big Data challenges faced by translational research groups in the precision medicine era; an era in which the genomes of over 75 000 patients will be sequenced by the National Health Service over the next 3 years to advance healthcare. In particular, we have looked at three main themes of data management in relation to cancer research, namely (1) cancer ontology management, (2) IT infrastructures that have been developed to support data management and (3) the unique ethical challenges introduced by utilising Big Data in research.
PMCID: PMC4815885  PMID: 26492224
cancer research; database management systems; biobanking; genomics; ontology management; data ethics
6.  Lactococcus garvieae: a small bacteria and a big data world 
To describe the importance of bioinformatics tools to analyze the big data yielded from new "omics" generation-methods, with the aim of unraveling the biology of the pathogen bacteria Lactococcus garvieae.
The paper provides the vision of the large volume of data generated from genome sequences, gene expression profiles by microarrays and other experimental methods that require biomedical informatics methods for management and analysis.
The use of biomedical informatics methods improves the analysis of big data in order to obtain a comprehensive characterization and understanding of the biology of pathogenic organisms, such as L. garvieae.
The "Big Data" concepts of high volume, veracity and variety are nowadays part of the research in microbiology associated with the use of multiple methods in the "omic" era. The use of biomedical informatics methods is a requisite necessary to improve the analysis of these data.
PMCID: PMC4416232  PMID: 25960872
Lactococcus garvieae; Big Data; Genomics; Gene expression
7.  The Role of the Toxicologic Pathologist in the Post-Genomic Era# 
Journal of Toxicologic Pathology  2013;26(2):105-110.
An era can be defined as a period in time identified by distinctive character, events, or practices. We are now in the genomic era. The pre-genomic era: There was a pre-genomic era. It started many years ago with novel and seminal animal experiments, primarily directed at studying cancer. It is marked by the development of the two-year rodent cancer bioassay and the ultimate realization that alternative approaches and short-term animal models were needed to replace this resource-intensive and time-consuming method for predicting human health risk. Many alternatives approaches and short-term animal models were proposed and tried but, to date, none have completely replaced our dependence upon the two-year rodent bioassay. However, the alternative approaches and models themselves have made tangible contributions to basic research, clinical medicine and to our understanding of cancer and they remain useful tools to address hypothesis-driven research questions. The pre-genomic era was a time when toxicologic pathologists played a major role in drug development, evaluating the cancer bioassay and the associated dose-setting toxicity studies, and exploring the utility of proposed alternative animal models. It was a time when there was shortage of qualified toxicologic pathologists. The genomic era: We are in the genomic era. It is a time when the genetic underpinnings of normal biological and pathologic processes are being discovered and documented. It is a time for sequencing entire genomes and deliberately silencing relevant segments of the mouse genome to see what each segment controls and if that silencing leads to increased susceptibility to disease. What remains to be charted in this genomic era is the complex interaction of genes, gene segments, post-translational modifications of encoded proteins, and environmental factors that affect genomic expression. In this current genomic era, the toxicologic pathologist has had to make room for a growing population of molecular biologists. In this present era newly emerging DVM and MD scientists enter the work arena with a PhD in pathology often based on some aspect of molecular biology or molecular pathology research. In molecular biology, the almost daily technological advances require one’s complete dedication to remain at the cutting edge of the science. Similarly, the practice of toxicologic pathology, like other morphological disciplines, is based largely on experience and requires dedicated daily examination of pathology material to maintain a well-trained eye capable of distilling specific information from stained tissue slides - a dedicated effort that cannot be well done as an intermezzo between other tasks. It is a rare individual that has true expertise in both molecular biology and pathology. In this genomic era, the newly emerging DVM-PhD or MD-PhD pathologist enters a marketplace without many job opportunities in contrast to the pre-genomic era. Many face an identity crisis needing to decide to become a competent pathologist or, alternatively, to become a competent molecular biologist. At the same time, more PhD molecular biologists without training in pathology are members of the research teams working in drug development and toxicology. How best can the toxicologic pathologist interact in the contemporary team approach in drug development, toxicology research and safety testing? Based on their biomedical training, toxicologic pathologists are in an ideal position to link data from the emerging technologies with their knowledge of pathobiology and toxicology. To enable this linkage and obtain the synergy it provides, the bench-level, slide-reading expert pathologist will need to have some basic understanding and appreciation of molecular biology methods and tools. On the other hand, it is not likely that the typical molecular biologist could competently evaluate and diagnose stained tissue slides from a toxicology study or a cancer bioassay. The post-genomic era: The post-genomic era will likely arrive approximately around 2050 at which time entire genomes from multiple species will exist in massive databases, data from thousands of robotic high throughput chemical screenings will exist in other databases, genetic toxicity and chemical structure-activity-relationships will reside in yet other databases. All databases will be linked and relevant information will be extracted and analyzed by appropriate algorithms following input of the latest molecular, submolecular, genetic, experimental, pathology and clinical data. Knowledge gained will permit the genetic components of many diseases to be amenable to therapeutic prevention and/or intervention. Much like computerized algorithms are currently used to forecast weather or to predict political elections, computerized sophisticated algorithms based largely on scientific data mining will categorize new drugs and chemicals relative to their health benefits versus their health risks for defined human populations and subpopulations. However, this form of a virtual toxicity study or cancer bioassay will only identify probabilities of adverse consequences from interaction of particular environmental and/or chemical/drug exposure(s) with specific genomic variables. Proof in many situations will require confirmation in intact in vivo mammalian animal models. The toxicologic pathologist in the post-genomic era will be the best suited scientist to confirm the data mining and its probability predictions for safety or adverse consequences with the actual tissue morphological features in test species that define specific test agent pathobiology and human health risk.
PMCID: PMC3695332  PMID: 23914052
genomic era; history of toxicologic pathology; molecular biology
8.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends 
BioData Mining  2014;7:22.
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data.
The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation.
In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.
PMCID: PMC4224309  PMID: 25383096
MapReduce; Hadoop; Big data; Clinical big data analysis; Clinical data analysis; Bioinformatics; Distributed programming
9.  tRNADB-CE: tRNA gene database well-timed in the era of big sequence data 
Frontiers in Genetics  2014;5:114.
The tRNA gene data base curated by experts “tRNADB-CE” ( was constructed by analyzing 1,966 complete and 5,272 draft genomes of prokaryotes, 171 viruses’, 121 chloroplasts’, and 12 eukaryotes’ genomes plus fragment sequences obtained by metagenome studies of environmental samples. 595,115 tRNA genes in total, and thus two times of genes compiled previously, have been registered, for which sequence, clover-leaf structure, and results of sequence-similarity and oligonucleotide-pattern searches can be browsed. To provide collective knowledge with help from experts in tRNA researches, we added a column for enregistering comments to each tRNA. By grouping bacterial tRNAs with an identical sequence, we have found high phylogenetic preservation of tRNA sequences, especially at the phylum level. Since many species-unknown tRNAs from metagenomic sequences have sequences identical to those found in species-known prokaryotes, the identical sequence group (ISG) can provide phylogenetic markers to investigate the microbial community in an environmental ecosystem. This strategy can be applied to a huge amount of short sequences obtained from next-generation sequencers, as showing that tRNADB-CE is a well-timed database in the era of big sequence data. It is also discussed that batch-learning self-organizing-map with oligonucleotide composition is useful for efficient knowledge discovery from big sequence data.
PMCID: PMC4013482  PMID: 24822057
tRNA; database; metagenome; phylogenic maker; BLSOM; big data
10.  Reevaluation of the Immunological Big Bang: comparisons of two vertebrate adaptive immune systems 
Current biology : CB  2014;24(21):R1060-R1065.
Classically the immunological ‘Big Bang’ of adaptive immunity was believed to have resulted from the insertion of a transposon into an immunoglobulin superfamily gene member, initiating RAG-based antigen receptor gene rearrangement in an ancestor of jawed vertebrates. However, the discovery of a second, convergent adaptive immune system in jawless fish, focused on the so-called Variable Lymphocyte Receptors (VLR), was arguably the most exciting finding of the past decade in immunology, and has drastically changed the view of immune origins. The recent report of a new lymphocyte lineage in lampreys, defined by the antigen receptor VLRC, suggests that there were three lymphocyte lineages in the common ancestor of jawless and jawed vertebrates that coopted different antigen receptor supertypes. The developmental transcriptional control of these lineages is predicted to be remarkably similar in both the jawless (agnathan) and jawed (gnathostome) systems, suggesting that an early ‘division of labor’ among lymphocytes was a driving force in the emergence of adaptive immunity. The recent cartilaginous fish genome project suggests that most effector cytokines and chemokines were also present, and further studies of the lamprey and hagfish genomes will determine just how explosive the Big Bang actually was.
PMCID: PMC4354883  PMID: 25517375
11.  The Nobel Prize as a Reward Mechanism in the Genomics Era: Anonymous Researchers, Visible Managers and the Ethics of Excellence 
Journal of Bioethical Inquiry  2010;7(3):299-312.
The Human Genome Project (HGP) is regarded by many as one of the major scientific achievements in recent science history, a large-scale endeavour that is changing the way in which biomedical research is done and expected, moreover, to yield considerable benefit for society. Thus, since the completion of the human genome sequencing effort, a debate has emerged over the question whether this effort merits to be awarded a Nobel Prize and if so, who should be the one(s) to receive it, as (according to current procedures) no more than three individuals can be selected. In this article, the HGP is taken as a case study to consider the ethical question to what extent it is still possible, in an era of big science, of large-scale consortia and global team work, to acknowledge and reward individual contributions to important breakthroughs in biomedical fields. Is it still viable to single out individuals for their decisive contributions in order to reward them in a fair and convincing way? Whereas the concept of the Nobel prize as such seems to reflect an archetypical view of scientists as solitary researchers who, at a certain point in their careers, make their one decisive discovery, this vision has proven to be problematic from the very outset. Already during the first decade of the Nobel era, Ivan Pavlov was denied the Prize several times before finally receiving it, on the basis of the argument that he had been active as a research manager (a designer and supervisor of research projects) rather than as a researcher himself. The question then is whether, in the case of the HGP, a research effort that involved the contributions of hundreds or even thousands of researchers worldwide, it is still possible to “individualise” the Prize? The “HGP Nobel Prize problem” is regarded as an exemplary issue in current research ethics, highlighting a number of quandaries and trends involved in contemporary life science research practices more broadly.
PMCID: PMC2917546  PMID: 20730106
Human Genome Project; Nobel Prize; Research ethics; Fairness of reward mechanism in biomedical research
12.  The personal genome browser: visualizing functions of genetic variants 
Nucleic Acids Research  2014;42(Web Server issue):W192-W197.
Advances in high-throughput sequencing technologies have brought us into the individual genome era. Projects such as the 1000 Genomes Project have led the individual genome sequencing to become more and more popular. How to visualize, analyse and annotate individual genomes with knowledge bases to support genome studies and personalized healthcare is still a big challenge. The Personal Genome Browser (PGB) is developed to provide comprehensive functional annotation and visualization for individual genomes based on the genetic–molecular–phenotypic model. Investigators can easily view individual genetic variants, such as single nucleotide variants (SNVs), INDELs and structural variations (SVs), as well as genomic features and phenotypes associated to the individual genetic variants. The PGB especially highlights potential functional variants using the PGB built-in method or SIFT/PolyPhen2 scores. Moreover, the functional risks of genes could be evaluated by scanning individual genetic variants on the whole genome, a chromosome, or a cytoband based on functional implications of the variants. Investigators can then navigate to high risk genes on the scanned individual genome. The PGB accepts Variant Call Format (VCF) and Genetic Variation Format (GVF) files as the input. The functional annotation of input individual genome variants can be visualized in real time by well-defined symbols and shapes. The PGB is available at
PMCID: PMC4086072  PMID: 24799434
13.  A practical approach to phylogenomics: the phylogeny of ray-finned fish (Actinopterygii) as a case study 
Molecular systematics occupies one of the central stages in biology in the genomic era, ushered in by unprecedented progress in DNA technology. The inference of organismal phylogeny is now based on many independent genetic loci, a widely accepted approach to assemble the tree of life. Surprisingly, this approach is hindered by lack of appropriate nuclear gene markers for many taxonomic groups especially at high taxonomic level, partially due to the lack of tools for efficiently developing new phylogenetic makers. We report here a genome-comparison strategy to identifying nuclear gene markers for phylogenetic inference and apply it to the ray-finned fishes – the largest vertebrate clade in need of phylogenetic resolution.
A total of 154 candidate molecular markers – relatively well conserved, putatively single-copy gene fragments with long, uninterrupted exons – were obtained by comparing whole genome sequences of two model organisms, Danio rerio and Takifugu rubripes. Experimental tests of 15 of these (randomly picked) markers on 36 taxa (representing two-thirds of the ray-finned fish orders) demonstrate the feasibility of amplifying by PCR and directly sequencing most of these candidates from whole genomic DNA in a vast diversity of fish species. Preliminary phylogenetic analyses of sequence data obtained for 14 taxa and 10 markers (total of 7,872 bp for each species) are encouraging, suggesting that the markers obtained will make significant contributions to future fish phylogenetic studies.
We present a practical approach that systematically compares whole genome sequences to identify single-copy nuclear gene markers for inferring phylogeny. Our method is an improvement over traditional approaches (e.g., manually picking genes for testing) because it uses genomic information and automates the process to identify large numbers of candidate makers. This approach is shown here to be successful for fishes, but also could be applied to other groups of organisms for which two or more complete genome sequences exist, which has important implications for assembling the tree of life.
PMCID: PMC1838417  PMID: 17374158
14.  The prince and the pauper. A tale of anticancer targeted agents 
Molecular Cancer  2008;7:82.
Cancer rates are set to increase at an alarming rate, from 10 million new cases globally in 2000 to 15 million in 2020. Regarding the pharmacological treatment of cancer, we currently are in the interphase of two treatment eras. The so-called pregenomic therapy which names the traditional cancer drugs, mainly cytotoxic drug types, and post-genomic era-type drugs referring to rationally-based designed. Although there are successful examples of this newer drug discovery approach, most target-specific agents only provide small gains in symptom control and/or survival, whereas others have consistently failed in the clinical testing. There is however, a characteristic shared by these agents: -their high cost-. This is expected as drug discovery and development is generally carried out within the commercial rather than the academic realm. Given the extraordinarily high therapeutic drug discovery-associated costs and risks, it is highly unlikely that any single public-sector research group will see a novel chemical "probe" become a "drug". An alternative drug development strategy is the exploitation of established drugs that have already been approved for treatment of non-cancerous diseases and whose cancer target has already been discovered. This strategy is also denominated drug repositioning, drug repurposing, or indication switch. Although traditionally development of these drugs was unlikely to be pursued by Big Pharma due to their limited commercial value, biopharmaceutical companies attempting to increase productivity at present are pursuing drug repositioning. More and more companies are scanning the existing pharmacopoeia for repositioning candidates, and the number of repositioning success stories is increasing. Here we provide noteworthy examples of known drugs whose potential anticancer activities have been highlighted, to encourage further research on these known drugs as a means to foster their translation into clinical trials utilizing the more limited public-sector resources. If these drug types eventually result in being effective, it follows that they could be much more affordable for patients with cancer; therefore, their contribution in terms of reducing cancer mortality at the global level would be greater.
PMCID: PMC2615789  PMID: 18947424
15.  IMG-ABC: A Knowledge Base To Fuel Discovery of Biosynthetic Gene Clusters and Novel Secondary Metabolites 
mBio  2015;6(4):e00932-15.
In the discovery of secondary metabolites, analysis of sequence data is a promising exploration path that remains largely underutilized due to the lack of computational platforms that enable such a systematic approach on a large scale. In this work, we present IMG-ABC (, an atlas of biosynthetic gene clusters within the Integrated Microbial Genomes (IMG) system, which is aimed at harnessing the power of “big” genomic data for discovering small molecules. IMG-ABC relies on IMG’s comprehensive integrated structural and functional genomic data for the analysis of biosynthetic gene clusters (BCs) and associated secondary metabolites (SMs). SMs and BCs serve as the two main classes of objects in IMG-ABC, each with a rich collection of attributes. A unique feature of IMG-ABC is the incorporation of both experimentally validated and computationally predicted BCs in genomes as well as metagenomes, thus identifying BCs in uncultured populations and rare taxa. We demonstrate the strength of IMG-ABC’s focused integrated analysis tools in enabling the exploration of microbial secondary metabolism on a global scale, through the discovery of phenazine-producing clusters for the first time in Alphaproteobacteria. IMG-ABC strives to fill the long-existent void of resources for computational exploration of the secondary metabolism universe; its underlying scalable framework enables traversal of uncovered phylogenetic and chemical structure space, serving as a doorway to a new era in the discovery of novel molecules.
IMG-ABC is the largest publicly available database of predicted and experimental biosynthetic gene clusters and the secondary metabolites they produce. The system also includes powerful search and analysis tools that are integrated with IMG’s extensive genomic/metagenomic data and analysis tool kits. As new research on biosynthetic gene clusters and secondary metabolites is published and more genomes are sequenced, IMG-ABC will continue to expand, with the goal of becoming an essential component of any bioinformatic exploration of the secondary metabolism world.
PMCID: PMC4502231  PMID: 26173699
16.  Pinpointing disease genes through phenomic and genomic data fusion 
BMC Genomics  2015;16(Suppl 2):S3.
Pinpointing genes involved in inherited human diseases remains a great challenge in the post-genomics era. Although approaches have been proposed either based on the guilt-by-association principle or making use of disease phenotype similarities, the low coverage of both diseases and genes in existing methods has been preventing the scan of causative genes for a significant proportion of diseases at the whole-genome level.
To overcome this limitation, we proposed a rigorous statistical method called pgFusion to prioritize candidate genes by integrating one type of disease phenotype similarity derived from the Unified Medical Language System (UMLS) and seven types of gene functional similarities calculated from gene expression, gene ontology, pathway membership, protein sequence, protein domain, protein-protein interaction and regulation pattern, respectively. Our method covered a total of 7,719 diseases and 20,327 genes, achieving the highest coverage thus far for both diseases and genes. We performed leave-one-out cross-validation experiments to demonstrate the superior performance of our method and applied it to a real exome sequencing dataset of epileptic encephalopathies, showing the capability of this approach in finding causative genes for complex diseases. We further provided the standalone software and online services of pgFusion at
pgFusion not only provided an effective way for prioritizing candidate genes, but also demonstrated feasible solutions to two fundamental questions in the analysis of big genomic data: the comparability of heterogeneous data and the integration of multiple types of data. Applications of this method in exome or whole genome sequencing studies would accelerate the finding of causative genes for human diseases. Other research fields in genomics could also benefit from the incorporation of our data fusion methodology.
PMCID: PMC4331717  PMID: 25708473
17.  Immunogenetics: Genome-Wide Association of Non-Progressive HIV and Viral Load Control: HLA Genes and Beyond 
Very early after the identification of the human immunodeficiency virus (HIV), host genetics factors were anticipated to play a role in viral control and disease progression. As early as the mid-1990s, candidate gene studies demonstrated a central role for the chemokine co-receptor/ligand (e.g., CCR5) and human leukocyte antigen (HLA) systems. In the last decade, the advent of genome-wide arrays opened a new era for unbiased genetic exploration of the genome and brought big expectations for the identification of new unexpected genes and pathways involved in HIV/AIDS. More than 15 genome-wide association studies targeting various HIV-linked phenotypes have been published since 2007. Surprisingly, only the two HIV-chemokine co-receptors and HLA loci have exhibited consistent and reproducible statistically significant genetic associations. In this chapter, we will review the findings from the genome-wide studies focusing especially on non-progressive and HIV control phenotypes, and discuss the current perspectives.
PMCID: PMC3664380  PMID: 23750159
genome-wide association study; SNP; HIV-1; viral control; long-term non-progression; chemokine receptors region; HLA
18.  Compression of Large genomic datasets using COMRAD on Parallel Computing Platform 
Bioinformation  2015;11(5):267-271.
The big data storage is a challenge in a post genome era. Hence, there is a need for high performance computing solutions for managing large genomic data. Therefore, it is of interest to describe a parallel-computing approach using message-passing library for distributing the different compression stages in clusters. The genomic compression helps to reduce the on disk“foot print” of large data volumes of sequences. This supports the computational infrastructure for a more efficient archiving. The approach was shown to find utility in 21 Eukaryotic genomes using stratified sampling in this report. The method achieves an average of 6-fold disk space reduction with three times better compression time than COMRAD.
The source codes are written in C using message passing libraries and are available at https:// projects/ comradmpi/files / COMRADMPI/
PMCID: PMC4464544  PMID: 26124572
Genome compression; Sequence analysis; Parallel Computing; Big data storage; Genome Analysis
19.  ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases 
BMC Genomics  2014;15:284.
Understanding the relationship between the millions of functional DNA elements and their protein regulators, and how they work in conjunction to manifest diverse phenotypes, is key to advancing our understanding of the mammalian genome. Next-generation sequencing technology is now used widely to probe these protein-DNA interactions and to profile gene expression at a genome-wide scale. As the cost of DNA sequencing continues to fall, the interpretation of the ever increasing amount of data generated represents a considerable challenge.
We have developed ngs.plot – a standalone program to visualize enrichment patterns of DNA-interacting proteins at functionally important regions based on next-generation sequencing data. We demonstrate that ngs.plot is not only efficient but also scalable. We use a few examples to demonstrate that ngs.plot is easy to use and yet very powerful to generate figures that are publication ready.
We conclude that ngs.plot is a useful tool to help fill the gap between massive datasets and genomic information in this era of big sequencing data.
PMCID: PMC4028082  PMID: 24735413
Next-generation sequencing; Visualization; Epigenomics; Data mining; Genomic databases
20.  Metabolic fingerprinting of Arabidopsis thaliana accessions 
In the post-genomic era much effort has been put on the discovery of gene function using functional genomics. Despite the advances achieved by these technologies in the understanding of gene function at the genomic and proteomic level, there is still a big genotype-phenotype gap. Metabolic profiling has been used to analyze organisms that have already been characterized genetically. However, there is a small number of studies comparing the metabolic profile of different tissues of distinct accessions. Here, we report the detection of over 14,000 and 17,000 features in inflorescences and leaves, respectively, in two widely used Arabidopsis thaliana accessions. A predictive Random Forest Model was developed, which was able to reliably classify tissue type and accession of samples based on LC-MS profile. Thereby we demonstrate that the morphological differences among A. thaliana accessions are reflected also as distinct metabolic phenotypes within leaves and inflorescences.
PMCID: PMC4444734  PMID: 26074932
metabolic phenotyping; Arabidopsis; accessions; development; metabolites
21.  Fish Assemblages in Streams Subject to Anthropogenic Disturbances Along The Natchez Trace Parkway, Mississippi, USA 
A three-year study (July 2000 – June 2003) of fish assemblages was conducted in four tributaries of the Big Black River: Big Bywy, Little Bywy, Middle Bywy and McCurtain creeks that cross the Natchez Trace Parkway, Choctaw County, Mississippi, USA. Little Bywy and Middle Bywy creeks were within watersheds influenced by the lignite mining. Big Bywy and Middle Bywy creeks were historically impacted by channelisation. McCurtain Creek was chosen as a reference (control) stream. Fish were collected using a portable backpack electrofishing unit (Smith-Root Inc., Washington, USA). Insectivorous fish dominated all of the streams. There were no pronounced differences in relative abundances of fishes among the streams (P > 0.05) but fish assemblages fluctuated seasonally. Although there were some differences among streams with regard to individual species, channelisation and lignite mining had no discernable adverse effects on functional components of fish assemblages suggesting that fishes in these systems are euryceous fluvial generalist species adapted to the variable environments of small stream ecosystems.
PMCID: PMC3819055  PMID: 24575177
Fish; Mining; Channelisation
22.  A Big Data Guide to Understanding Climate Change: The Case for Theory-Guided Data Science 
Big Data  2014;2(3):155-163.
Global climate change and its impact on human life has become one of our era's greatest challenges. Despite the urgency, data science has had little impact on furthering our understanding of our planet in spite of the abundance of climate data. This is a stark contrast from other fields such as advertising or electronic commerce where big data has been a great success story. This discrepancy stems from the complex nature of climate data as well as the scientific questions climate science brings forth. This article introduces a data science audience to the challenges and opportunities to mine large climate datasets, with an emphasis on the nuanced difference between mining climate data and traditional big data approaches. We focus on data, methods, and application challenges that must be addressed in order for big data to fulfill their promise with regard to climate science applications. More importantly, we highlight research showing that solely relying on traditional big data techniques results in dubious findings, and we instead propose a theory-guided data science paradigm that uses scientific theory to constrain both the big data techniques as well as the results-interpretation process to extract accurate insight from large climate data.
PMCID: PMC4174912  PMID: 25276499
23.  Systems solutions by lactic acid bacteria: from paradigms to practice 
Microbial Cell Factories  2011;10(Suppl 1):S2.
Lactic acid bacteria are among the powerhouses of the food industry, colonize the surfaces of plants and animals, and contribute to our health and well-being. The genomic characterization of LAB has rocketed and presently over 100 complete or nearly complete genomes are available, many of which serve as scientific paradigms. Moreover, functional and comparative metagenomic studies are taking off and provide a wealth of insight in the activity of lactic acid bacteria used in a variety of applications, ranging from starters in complex fermentations to their marketing as probiotics. In this new era of high throughput analysis, biology has become big science. Hence, there is a need to systematically store the generated information, apply this in an intelligent way, and provide modalities for constructing self-learning systems that can be used for future improvements. This review addresses these systems solutions with a state of the art overview of the present paradigms that relate to the use of lactic acid bacteria in industrial applications. Moreover, an outlook is presented of the future developments that include the transition into practice as well as the use of lactic acid bacteria in synthetic biology and other next generation applications.
PMCID: PMC3231926  PMID: 21995776
24.  A biological treasure metagenome: pave a way for big science 
Indian Journal of Microbiology  2008;48(2):163-172.
The trend of recent researches, in which synthetic biology and white technology through system approaches based on “Omics technology” are recognized as the ground of biotechnology, indicates the coming of the ‘metagenome era’ that accesses the genomes of all microbes aiming at the understanding and industrial application of the whole microbial resources. The remarkable advance of technologies for digging out and analyzing metagenome is enabling not only practical applications of metagenome but also system approaches on a mixed-genome level based on accumulated information. In this situation, the present review is purposed to introduce the trends and methods of research on metagenome and to examine big science led by related resources in the future.
PMCID: PMC3450180  PMID: 23100711
Metagenome; Gene mining; Novel metabolites; Systems approach; Biological treasure
25.  Molecular Signature of High Yield (Growth) Influenza A Virus Reassortants Prepared as Candidate Vaccine Seeds 
PLoS ONE  2013;8(6):e65955.
Human influenza virus isolates generally grow poorly in embryonated chicken eggs. Hence, gene reassortment of influenza A wild type (wt) viruses is performed with a highly egg adapted donor virus, A/Puerto Rico/8/1934 (PR8), to provide the high yield reassortant (HYR) viral ‘seeds’ for vaccine production. HYR must contain the hemagglutinin (HA) and neuraminidase (NA) genes of wt virus and one to six ‘internal’ genes from PR8. Most studies of influenza wt and HYRs have focused on the HA gene. The main objective of this study is the identification of the molecular signature in all eight gene segments of influenza A HYR candidate vaccine seeds associated with high growth in ovo.
The genomes of 14 wt parental viruses, 23 HYRs (5 H1N1; 2, 1976 H1N1-SOIV; 2, 2009 H1N1pdm; 2 H2N2 and 12 H3N2) and PR8 were sequenced using the high-throughput sequencing pipeline with big dye terminator chemistry.
Silent and coding mutations were found in all internal genes derived from PR8 with the exception of the M gene. The M gene derived from PR8 was invariant in all 23 HYRs underlining the critical role of PR8 M in high yield phenotype. None of the wt virus derived internal genes had any silent change(s) except the PB1 gene in X-157. The highest number of recurrent silent and coding mutations was found in NS. With respect to the surface antigens, the majority of HYRs had coding mutations in HA; only 2 HYRs had coding mutations in NA.
In the era of application of reverse genetics to alter influenza A virus genomes, the mutations identified in the HYR gene segments associated with high growth in ovo may be of great practical benefit to modify PR8 and/or wt virus gene sequences for improved growth of vaccine ‘seed’ viruses.
PMCID: PMC3679156  PMID: 23776579

Results 1-25 (156632)