High-throughput sequencing of related individuals has become an important tool for studying human disease. However, owing to technical complexity and lack of available tools, most pedigree-based sequencing studies rely on an ad hoc combination of suboptimal analyses. Here we present pedigree-VAAST (pVAAST), a disease-gene identification tool designed for high-throughput sequence data in pedigrees. pVAAST uses a sequence-based model to perform variant and gene-based linkage analysis. Linkage information is then combined with functional prediction and rare variant case-control association information in a unified statistical framework. pVAAST outperformed linkage and rare-variant association tests in simulations and identified disease-causing genes from whole-genome sequence data in three human pedigrees with dominant, recessive and de novo inheritance patterns. The approach is robust to incomplete penetrance and locus heterogeneity and is applicable to a wide variety of genetic traits. pVAAST maintains high power across studies of monogenic, high-penetrance phenotypes in a single pedigree to highly polygenic, common phenotypes involving hundreds of pedigrees.
The prostate stromal mesenchyme controls organ-specific development. In cancer, the stromal compartment shows altered gene expression compared to non-cancer. The lineage relationship between cancer-associated stromal cells and normal tissue stromal cells is not known. Nor is the cause underlying the expression difference. Previously, the embryonal carcinoma (EC) cell line, NCCIT, was used by us to study the stromal induction property. In the current study, stromal cells from non-cancer (NP) and cancer (CP) were isolated from tissue specimens and co-cultured with NCCIT cells in a trans-well format to preclude heterotypic cell contact. After 3 days, the stromal cells were analyzed by gene arrays for microRNA (miRNA) and mRNA expression. In co-culture, NCCIT cells were found to alter the miRNA and mRNA expression of NP stromal cells to one like that of CP stromal cells. In contrast, NCCIT had no significant effect on the gene expression of CP stromal cells. We conclude that the gene expression changes in stromal cells can be induced by diffusible factors synthesized by EC cells, and suggest that cancer-associated stromal cells represent a more primitive or less differentiated stromal cell type.
Phenotypic variation, including that which underlies health and disease in humans, results in part from multiple interactions among both genetic variation and environmental factors. While diseases or phenotypes caused by single gene variants can be identified by established association methods and family-based approaches, complex phenotypic traits resulting from multi-gene interactions remain very difficult to characterize. Here we describe a new method based on information theory, and demonstrate how it improves on previous approaches to identifying genetic interactions, including both synthetic and modifier kinds of interactions. We apply our measure, called interaction distance, to previously analyzed data sets of yeast sporulation efficiency, lipid related mouse data and several human disease models to characterize the method. We show how the interaction distance can reveal novel gene interaction candidates in experimental and simulated data sets, and outperforms other measures in several circumstances. The method also allows us to optimize case/control sample composition for clinical studies.
Context dependence is central to the description of complexity. Keying on the pairwise definition of “set complexity,” we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multivariable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multivariable dependency, “differential interaction information.” This quantity for two variables reduces to the pairwise “set complexity” previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the “differential interaction information” are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for the study of complexity. The properties of “differential interaction information” also suggest new approaches to data analysis. Given a data set of system measurements, differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multivariable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.
complexity; entropy; gene network discovery; interaction information; multivariate dependency
Dissecting the molecular basis of quantitative traits is a significant challenge and is essential for understanding complex diseases. Even in model organisms, precisely determining causative genes and their interactions has remained elusive, due in part to difficulty in narrowing intervals to single genes and in detecting epistasis or linked quantitative trait loci. These difficulties are exacerbated by limitations in experimental design, such as low numbers of analyzed individuals or of polymorphisms between parental genomes. We address these challenges by applying three independent high-throughput approaches for QTL mapping to map the genetic variants underlying 11 phenotypes in two genetically distant Saccharomyces cerevisiae strains, namely (1) individual analysis of >700 meiotic segregants, (2) bulk segregant analysis, and (3) reciprocal hemizygosity scanning, a new genome-wide method that we developed. We reveal differences in the performance of each approach and, by combining them, identify eight polymorphic genes that affect eight different phenotypes: colony shape, flocculation, growth on two nonfermentable carbon sources, and resistance to two drugs, salt, and high temperature. Our results demonstrate the power of individual segregant analysis to dissect QTL and address the underestimated contribution of interactions between variants. We also reveal confounding factors like mutations and aneuploidy in pooled approaches, providing valuable lessons for future designs of complex trait mapping studies.
QTL mapping; bulk segregant analysis; individual segregant analysis; next generation sequencing; yeast; reciprocal hemizygosity scanning
The current gold standard for diagnosis of hepatic fibrosis and cirrhosis is the traditional invasive liver biopsy. It is desirable to assess hepatic fibrosis with noninvasive means. Targeted proteomic techniques allow an unbiased assessment of proteins and might be useful to identify proteins related to hepatic fibrosis. We utilized Selected Reaction Monitoring (SRM) targeted proteomics combined with an organ-specific blood protein strategy to identify and quantify 38 liver-specific proteins. A combination of protein C and retinol binding protein 4 in serum gave promising preliminary results as candidate biomarkers to distinguish patients at different stages of hepatic fibrosis due to chronic infection with hepatitis C virus (HCV). Also, alpha-1-B glycoprotein, complement factor H and insulin-like growth factor binding protein acid labile subunit performed well in distinguishing patients from healthy controls.
hepatitis C; fibrosis; liver-specific blood biomarkers; quantitation; selected reaction monitoring
Biomolecular pathways and networks are dynamic and complex, and the perturbations to them which cause disease are often multiple, heterogeneous and contingent. Pathway and network visualizations, rendered on a computer or published on paper, however, tend to be static, lacking in detail, and ill-equipped to explore the variety and quantities of data available today, and the complex causes we seek to understand.
RCytoscape integrates R (an open-ended programming environment rich in statistical power and data-handling facilities) and Cytoscape (powerful network visualization and analysis software). RCytoscape extends Cytoscape's functionality beyond what is possible with the Cytoscape graphical user interface. To illustrate the power of RCytoscape, a portion of the Glioblastoma multiforme (GBM) data set from the Cancer Genome Atlas (TCGA) is examined. Network visualization reveals previously unreported patterns in the data suggesting heterogeneous signaling mechanisms active in GBM Proneural tumors, with possible clinical relevance.
Progress in bioinformatics and computational biology depends upon exploratory and confirmatory data analysis, upon inference, and upon modeling. These activities will eventually permit the prediction and control of complex biological systems. Network visualizations -- molecular maps -- created from an open-ended programming environment rich in statistical power and data-handling facilities, such as RCytoscape, will play an essential role in this progression.
Biological networks; Visualization; Exploratory data analysis; Statistical programming; Bioinformatics
Blood carries a wide array of biomolecules, including nutrients, hormones, and molecules that are secreted by cells for specific biological functions. The recent finding of stable RNA of both endogenous and exogenous origin in circulation raises a number of questions and opens a broad, new field: exploring the origins, functions, and applications of these extracellular RNA molecules. These findings raise many important questions, including: what are the mechanisms of export and cellular uptake, what is the nature and source of their stability, what molecules do they interact with in the blood, and what are the possible biological functions of the circulating RNA? This review summarizes some key recent developments in circulating RNA research and discusses some of the open questions in the field.
microRNA; exosomes; microvesicles; cell–cell communication; exogenous RNA
Patients with Type 1 Diabetes (T1D) are particularly vulnerable to development of Diabetic nephropathy (DN) leading to End Stage Renal Disease. Hence a better understanding of the factors affecting kidney disease progression in T1D is urgently needed. In recent years microRNAs have emerged as important post-transcriptional regulators of gene expression in many different health conditions. We hypothesized that urinary microRNA profile of patients will differ in the different stages of diabetic renal disease.
Methods and Findings
We studied urine microRNA profiles with qPCR in 40 T1D with >20 year follow up 10 who never developed renal disease (N) matched against 10 patients who went on to develop overt nephropathy (DN), 10 patients with intermittent microalbuminuria (IMA) matched against 10 patients with persistent (PMA) microalbuminuria. A Bayesian procedure was used to normalize and convert raw signals to expression ratios. We applied formal statistical techniques to translate fold changes to profiles of microRNA targets which were then used to make inferences about biological pathways in the Gene Ontology and REACTOME structured vocabularies. A total of 27 microRNAs were found to be present at significantly different levels in different stages of untreated nephropathy. These microRNAs mapped to overlapping pathways pertaining to growth factor signaling and renal fibrosis known to be targeted in diabetic kidney disease.
Urinary microRNA profiles differ across the different stages of diabetic nephropathy. Previous work using experimental, clinical chemistry or biopsy samples has demonstrated differential expression of many of these microRNAs in a variety of chronic renal conditions and diabetes. Combining expression ratios of microRNAs with formal inferences about their predicted mRNA targets and associated biological pathways may yield useful markers for early diagnosis and risk stratification of DN in T1D by inferring the alteration of renal molecular processes.
Human plasma has long been a rich source for biomarker discovery. It has recently become clear that plasma RNA molecules, such as microRNA, in addition to proteins are common and can serve as biomarkers. Surveying human plasma for microRNA biomarkers using next generation sequencing technology, we observed that a significant fraction of the circulating RNA appear to originate from exogenous species. With careful analysis of sequence error statistics and other controls, we demonstrated that there is a wide range of RNA from many different organisms, including bacteria and fungi as well as from other species. These RNAs may be associated with protein, lipid or other molecules protecting them from RNase activity in plasma. Some of these RNAs are detected in intracellular complexes and may be able to influence cellular activities under in
vitro conditions. These findings raise the possibility that plasma RNAs of exogenous origin may serve as signaling molecules mediating for example the human-microbiome interaction and may affect and/or indicate the state of human health.
We describe some new conceptual tools for the rigorous, mathematical description of the “set-complexity” of graphs. This set-complexity has been shown previously to be a useful measure for analyzing some biological networks, and in discussing biological information in a quantitative fashion. The advances described here allow us to define some significant relationships between the set-complexity measure and the structure of graphs, and of their component sub-graphs. We show here that modular graph structures tend to maximize the set-complexity of graphs. We point out the relationship between modularity and redundancy, and discuss the significance of set-complexity in this regard. We specifically discuss the relationship between complexity and entropy in the case of complete-bipartite graphs, and present a new method for constructing highly complex, binary graphs. These results can be extended to the case of ternary graphs, and to other multi-edge graphs, which are fundamentally more relevant to biological structures and systems. Finally, our results lead us to an approach for extracting high complexity modular graphs from large, noisy graphs with low information content. We illustrate this approach with two examples.
Set-complexity; Biological networks; Modularity; Modular graphs; Bipartite graphs; Multi-partite graphs
MicroRNAs (miRNAs) are small, non-coding RNAs that regulate various biological processes, primarily through interaction with messenger RNAs. The levels of specific, circulating miRNAs in blood have been shown to associate with various pathological conditions including cancers. These miRNAs have great potential as biomarkers for various pathophysiological conditions. In this study we focused on different sample types’ effects on the spectrum of circulating miRNA in blood. Using serum and corresponding plasma samples from the same individuals, we observed higher miRNA concentrations in serum samples compared to the corresponding plasma samples. The difference between serum and plasma miRNA concentration showed some associations with miRNA from platelets, which may indicate that the coagulation process may affect the spectrum of extracellular miRNA in blood. Several miRNAs also showed platform dependent variations in measurements. Our results suggest that there are a number of factors that might affect the measurement of circulating miRNA concentration. Caution must be taken when comparing miRNA data generated from different sample types or measurement platforms.
Next-generation sequencing (NGS) technologies-based transcriptomic profiling method often called RNA-seq has been widely used to study global gene expression, alternative exon usage, new exon discovery, novel transcriptional isoforms and genomic sequence variations. However, this technique also poses many biological and informatics challenges to extracting meaningful biological information. The RNA-seq data analysis is built on the foundation of high quality initial genome localization and alignment information for RNA-seq sequences. Toward this goal, we have developed RNASEQR to accurately and effectively map millions of RNA-seq sequences. We have systematically compared RNASEQR with four of the most widely used tools using a simulated data set created from the Consensus CDS project and two experimental RNA-seq data sets generated from a human glioblastoma patient. Our results showed that RNASEQR yields more accurate estimates for gene expression, complete gene structures and new transcript isoforms, as well as more accurate detection of single nucleotide variants (SNVs). RNASEQR analyzes raw data from RNA-seq experiments effectively and outputs results in a manner that is compatible with a wide variety of specialized downstream analyses on desktop computers.
MicroRNAs (miRNAs) have been linked with various regulatory functions and disorders, such as cancers and heart diseases. They therefore present an important target for detection technologies for future medical diagnostics. We report here a novel method for rapid and sensitive miRNA detection and quantitation using surface plasmon resonance (SPR) sensor technology and a DNA*RNA antibody-based assay. The approach takes advantage of a novel high-performance portable SPR sensor instrument for spectroscopy of surface plasmons based on a special diffraction grating called a surface plasmon coupler and disperser (SPRCD). The surface of the grating is functionalized with thiolated DNA oligonucleotides which specifically capture miRNA from a liquid sample without amplification. Subsequently, an antibody that recognizes DNA*RNA hybrids is introduced to bind to the DNA*RNA complex and enhance sensor response to the captured miRNA. This approach allows detecting miRNA in less than 30 minutes at concentrations down to 2 pM with an absolute amount at high attomoles. The methodology is evaluated for analysis of miRNA from mouse liver tissues and is found to yield results which agree well with those provided by the quantitative polymerase chain reaction (qPCR).
microRNA; surface plasmon resonance; biosensor; liver toxicity; cancer diagnostics
MicroRNAs (miRNAs) are a recently discovered class of small, non-coding RNAs that regulate protein levels post-transcriptionally. miRNAs play important regulatory roles in many cellular processes, including differentiation, neoplastic transformation, and cell replication and regeneration. Because of these regulatory roles, it is not surprising that aberrant miRNA expression has been implicated in several diseases. Recent studies have reported significant levels of miRNAs in serum and other body fluids, raising the possibility that circulating miRNAs could serve as useful clinical biomarkers. Here, we provide a brief overview of miRNA biogenesis and function, the identification and potential roles of circulating extracellular miRNAs, and the prospective uses of miRNAs as clinical biomarkers. Finally, we address several issues associated with the accurate measurement of miRNAs from biological samples.
An endogenous molecular-cellular network for both normal and abnormal functions is assumed to exist. This endogenous network forms a nonlinear stochastic dynamical system, with many stable attractors in its functional landscape. Normal or abnormal robust states can be decided by this network in a manner similar to the neural network. In this context cancer is hypothesized as one of its robust intrinsic states.
This hypothesis implies that a nonlinear stochastic mathematical cancer model is constructible based on available experimental data and its quantitative prediction is directly testable. Within such model the genesis and progression of cancer may be viewed as stochastic transitions between different attractors. Thus it further suggests that progressions are not arbitrary. Other important issues on cancer, such as genetic vs epigenetics, double-edge effect, dormancy, are discussed in the light of present hypothesis. A different set of strategies for cancer prevention, cure, and care, is therefore suggested.
The ability to construct biologically meaningful gene networks and modules is critical for contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to shed light on the functioning of complex biological systems, most modules in these networks have shown little association with meaningful biological function. We have devised a method which directly incorporates gene ontology (GO) annotation in construction of gene modules in order to gain better functional association.
We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that integrates various gene-gene pairwise similarity values, including information obtained from gene expression, protein-protein interactions and GO annotations, in the construction of modules using affinity propagation clustering. We demonstrated the performance of the proposed method using data from two complex biological responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly stronger association with biological functions.
The incorporation of semantic similarity based on GO annotation with gene expression and protein-protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level.
Chronic lung diseases are the third leading cause of death in the United States due in part to an incomplete understanding of pathways that govern the progressive tissue remodeling that occurs in these disorders. Adenosine is elevated in the lungs of animal models and humans with chronic lung disease where it promotes air-space destruction and fibrosis. Adenosine signaling increases the production of the pro-fibrotic cytokine interleukin-6 (IL-6). Based on these observations, we hypothesized that IL-6 signaling contributes to tissue destruction and remodeling in a model of chronic lung disease where adenosine levels are elevated.
We tested this hypothesis by neutralizing or genetically removing IL-6 in adenosine deaminase (ADA)-deficient mice that develop adenosine dependent pulmonary inflammation and remodeling. Results demonstrated that both pharmacologic blockade and genetic removal of IL-6 attenuated pulmonary inflammation, remodeling and fibrosis in this model. The pursuit of mechanisms involved revealed adenosine and IL-6 dependent activation of STAT-3 in airway epithelial cells.
These findings demonstrate that adenosine enhances IL-6 signaling pathways to promote aspects of chronic lung disease. This suggests that blocking IL-6 signaling during chronic stages of disease may provide benefit in halting remodeling processes such as fibrosis and air-space destruction.
We propose an innovative, integrated, cost-effective health system to combat major non-communicable diseases (NCDs), including cardiovascular, chronic respiratory, metabolic, rheumatologic and neurologic disorders and cancers, which together are the predominant health problem of the 21st century. This proposed holistic strategy involves comprehensive patient-centered integrated care and multi-scale, multi-modal and multi-level systems approaches to tackle NCDs as a common group of diseases. Rather than studying each disease individually, it will take into account their intertwined gene-environment, socio-economic interactions and co-morbidities that lead to individual-specific complex phenotypes. It will implement a road map for predictive, preventive, personalized and participatory (P4) medicine based on a robust and extensive knowledge management infrastructure that contains individual patient information. It will be supported by strategic partnerships involving all stakeholders, including general practitioners associated with patient-centered care. This systems medicine strategy, which will take a holistic approach to disease, is designed to allow the results to be used globally, taking into account the needs and specificities of local economies and health systems.
The antineoplastic drug bleomycin leads to the side effect of pulmonary fibrosis in both humans and mice. We challenged genetically diverse inbred lines of mice from the Collaborative Cross with bleomycin to determine the heritability of this phenotype. Sibling pairs of mice from 40 lines were treated with bleomycin. Lung disease was assessed by scoring lung pathology and by measuring soluble collagen levels in lavage fluid. Serum micro ribonucleic acids (miRNAs) were also measured. Inbred sibling pairs of animals demonstrated high coinheritance of the phenotypes of disease susceptibility or disease resistance. The plasma levels of one miRNA were clearly correlated in sibling mice. The results showed that, as in humans, the lines that comprise the Collaborative Cross exhibited wide genetic variation in response to this drug. This finding suggests that the genetically diverse Collaborative Cross animals may reveal drug effects that might be missed if a study were based on a conventional mouse strain.
collaborative cross; drug side effects; genetic diversity; disease susceptibility; disease resistance; bleomycin; lung disease
We analyzed the whole genome sequences of a family of four, consisting of two siblings and their parents. Family-based sequencing allowed us to delineate recombination sites precisely, identify 70% of the sequencing errors, and identify very rare SNVs. We also directly estimated a human intergeneration mutation rate of ∼1.1×10-8 per position per haploid genome. Both offspring in this family have two recessive disorders--Miller syndrome, for which the gene was concurrently identified, and primary ciliary dyskinesia, for which causative genes have been previously identified. Family-based genome analysis enabled us to narrow the candidate genes for both of these Mendelian disorders to only four. Our results demonstrate the unique value of complete genome sequencing in families.
whole genome sequencing; rare genetic disease; inheritance analysis; recessive models; de novo mutations; recombination hotspot; crossover; haploidentity; haploidentical block; inheritance state; inheritance vector; HMM; haplotype; Miller syndrome; POADS; DHODH; DNAH5; KIAA0556; CES1
The molecular pathways involved in the interstitial lung diseases (ILDs) are poorly understood. Systems biology approaches, with global expression data sets, were used to identify perturbed gene networks, to gain some understanding of the underlying mechanisms, and to develop specific hypotheses relevant to these chronic lung diseases.
Lung tissue samples from patients with different types of ILD were obtained from the Lung Tissue Research Consortium and total cell RNA was isolated. Global mRNA and microRNA were profiled by hybridization and amplification-based methods. Differentially expressed genes were compiled and used to identify critical signaling pathways and potential biomarkers. Modules of genes were identified that formed a regulatory network, and studies were performed on cultured cells in vitro for comparison with the in vivo results.
By profiling mRNA and microRNA (miRNA) expression levels, we found subsets of differentially expressed genes that distinguished patients with ILDs from controls and that correlated with different disease stages and subtypes of ILDs. Network analysis, based on pathway databases, revealed several disease-associated gene modules, involving genes from the TGF-β, Wnt, focal adhesion, and smooth muscle actin pathways that are implicated in advancing fibrosis, a critical pathological process in ILDs. A more comprehensive approach was also adapted to construct a putative global gene regulatory network based on the perturbation of key regulatory elements, transcription factors and microRNAs. Our data underscores the importance of TGF-β signaling and the persistence of smooth muscle actin-containing fibroblasts in these diseases. We present evidence that, downstream of TGF-β signaling, microRNAs of the miR-23a cluster and the transcription factor Zeb1 could have roles in mediating an epithelial to mesenchymal transition (EMT) and the resultant persistence of mesenchymal cells in these diseases.
We present a comprehensive overview of the molecular networks perturbed in ILDs, discuss several potential key molecular regulatory circuits, and identify microRNA species that may play central roles in facilitating the progression of ILDs. These findings advance our understanding of these diseases at the molecular level, provide new molecular signatures in defining the specific characteristics of the diseases, suggest new hypotheses, and reveal new potential targets for therapeutic intervention.
Systems biology is an approach to the science that views biology as an information science, studies biological systems as a whole and their interactions with the environment. This approach, for the reasons described here, has particular power in the search for informative diagnostic biomarkers of diseases because it focuses on the fundamental causes and keys on the identification and understanding of disease- perturbed molecular networks. In this review, we describe some recent developments that have used systems biology to address complex diseases – prion disease and drug induced liver injury- and use these as examples to illustrate the importance of understanding network structure and dynamics. The knowledge of network dynamics through in vitro experimental perturbation and modeling allows us to determine the state of the networks, to identify molecular correlates, and to derive new disease treatment approaches to reverse the pathology or prevent its progress into a more severe state through the manipulation of network states. This general approach, including diagnostics and therapeutics, is becoming known as systems medicine.
Systems biology; biomarkers; systems medicine; prion disease; drug induced liver injury; microRNA; organ-specific proteins
Complex, non-additive genetic interactions are common and can be critical in determining phenotypes. Genome-wide association studies (GWAS) and similar statistical studies of linkage data, however, assume additive models of gene interactions in looking for genotype-phenotype associations. These statistical methods view the compound effects of multiple genes on a phenotype as a sum of influences of each gene and often miss a substantial part of the heritable effect. Such methods do not use any biological knowledge about underlying mechanisms. Modeling approaches from the artificial intelligence (AI) field that incorporate deterministic knowledge into models to perform statistical analysis can be applied to include prior knowledge in genetic analysis. We chose to use the most general such approach, Markov Logic Networks (MLNs), for combining deterministic knowledge with statistical analysis. Using simple, logistic regression-type MLNs we can replicate the results of traditional statistical methods, but we also show that we are able to go beyond finding independent markers linked to a phenotype by using joint inference without an independence assumption. The method is applied to genetic data on yeast sporulation, a complex phenotype with gene interactions. In addition to detecting all of the previously identified loci associated with sporulation, our method identifies four loci with smaller effects. Since their effect on sporulation is small, these four loci were not detected with methods that do not account for dependence between markers due to gene interactions. We show how gene interactions can be detected using more complex models, which can be used as a general framework for incorporating systems biology with genetics.
genetic interactions; genetic networks; genome-wide analysis; prior biological knowledge; probabilistic; logic-based modeling