Search tips
Search criteria

Results 1-25 (1353021)

Clipboard (0)

Related Articles

1.  Data recovery and integration from public databases uncovers transformation-specific transcriptional downregulation of cAMP-PKA pathway-encoding genes 
BMC Bioinformatics  2009;10(Suppl 12):S1.
The integration of data from multiple genome-wide assays is essential for understanding dynamic spatio-temporal interactions within cells. Such integration, which leads to a more complete view of cellular processes, offers the opportunity to rationalize better the high amount of "omics" data freely available in several public databases.
In particular, integration of microarray-derived transcriptome data with other high-throughput analyses (genomic and mutational analysis, promoter analysis) may allow us to unravel transcriptional regulatory networks under a variety of physio-pathological situations, such as the alteration in the cross-talk between signal transduction pathways in transformed cells.
Here we sequentially apply web-based and statistical tools to a case study: the role of oncogenic activation of different signal transduction pathways in the transcriptional regulation of genes encoding proteins involved in the cAMP-PKA pathway. To this end, we first re-analyzed available genome-wide expression data for genes encoding proteins of the downstream branch of the PKA pathway in normal tissues and human tumor cell lines. Then, in order to identify mutation-dependent transcriptional signatures, we classified cancer cells as a function of their mutational state. The results of such procedure were used as a starting point to analyze the structure of PKA pathway-encoding genes promoters, leading to identification of specific combinations of transcription factor binding sites, which are neatly consistent with available experimental data and help to clarify the relation between gene expression, transcriptional factors and oncogenes in our case study.
Genome-wide, large-scale "omics" experimental technologies give different, complementary perspectives on the structure and regulatory properties of complex systems. Even the relatively simple, integrated workflow presented here offers opportunities not only for filtering data noise intrinsic in high throughput data, but also to progressively extract novel information that would have remained hidden otherwise. In fact we have been able to detect a strong transcriptional repression of genes encoding proteins of cAMP/PKA pathway in cancer cells of different genetic origins. The basic workflow presented herein may be easily extended by incorporating other tools and can be applied even by researchers with poor bioinformatics skills.
PMCID: PMC2762058  PMID: 19828069
2.  Joint analysis of transcriptional and post- transcriptional brain tumor data: searching for emergent properties of cellular systems 
BMC Bioinformatics  2011;12:86.
Advances in biotechnology offer a fast growing variety of high-throughput data for screening molecular activities of genomic, transcriptional, post-transcriptional and translational observations. However, to date, most computational and algorithmic efforts have been directed at mining data from each of these molecular levels (genomic, transcriptional, etc.) separately. In view of the rapid advances in technology (new generation sequencing, high-throughput proteomics) it is important to address the problem of analyzing these data as a whole, i.e. preserving the emergent properties that appear in the cellular system when all molecular levels are interacting. We analyzed one of the (currently) few datasets that provide both transcriptional and post-transcriptional data of the same samples to investigate the possibility to extract more information, using a joint analysis approach.
We use Factor Analysis coupled with pre-established knowledge as a theoretical base to achieve this goal. Our intention is to identify structures that contain information from both mRNAs and miRNAs, and that can explain the complexity of the data. Despite the small sample available, we can show that this approach permits identification of meaningful structures, in particular two polycistronic miRNA genes related to transcriptional activity and likely to be relevant in the discrimination between gliosarcomas and other brain tumors.
This suggests the need to develop methodologies to simultaneously mine information from different levels of biological organization, rather than linking separate analyses performed in parallel.
PMCID: PMC3078861  PMID: 21450054
3.  Multilevel omic data integration in cancer cell lines: advanced annotation and emergent properties 
BMC Systems Biology  2013;7:14.
High-throughput (omic) data have become more widespread in both quantity and frequency of use, thanks to technological advances, lower costs and higher precision. Consequently, computational scientists are confronted by two parallel challenges: on one side, the design of efficient methods to interpret each of these data in their own right (gene expression signatures, protein markers, etc.) and, on the other side, realization of a novel, pressing request from the biological field to design methodologies that allow for these data to be interpreted as a whole, i.e. not only as the union of relevant molecules in each of these layers, but as a complex molecular signature containing proteins, mRNAs and miRNAs, all of which must be directly associated in the results of analyses that are able to capture inter-layers connections and complexity.
We address the latter of these two challenges by testing an integrated approach on a known cancer benchmark: the NCI-60 cell panel. Here, high-throughput screens for mRNA, miRNA and proteins are jointly analyzed using factor analysis, combined with linear discriminant analysis, to identify the molecular characteristics of cancer. Comparisons with separate (non-joint) analyses show that the proposed integrated approach can uncover deeper and more precise biological information. In particular, the integrated approach gives a more complete picture of the set of miRNAs identified and the Wnt pathway, which represents an important surrogate marker of melanoma progression. We further test the approach on a more challenging patient-dataset, for which we are able to identify clinically relevant markers.
The integration of multiple layers of omics can bring more information than analysis of single layers alone. Using and expanding the proposed integrated framework to integrate omic data from other molecular levels will allow researchers to uncover further systemic information. The application of this approach to a clinically challenging dataset shows its promising potential.
PMCID: PMC3610285  PMID: 23418673
Multi-omic; Emergent property; Factor analysis; Linear discriminant analysis; NCI-60 cell panel
4.  Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism 
The first computational approach for the rapid generation of genome-scale tissue-specific models from a generic species model.A genome scale model of human liver metabolism, which is comprehensively tested and validated using cross-validation and the ability to carry out complex hepatic metabolic functions.The model's flux predictions are shown to correlate with flux measurements across a variety of hormonal and dietary conditions, and are successfully used to predict biomarker changes in genetic metabolic disorders, both with higher accuracy than the generic human model.
The study of normal human metabolism and its alterations is central to the understanding and treatment of a variety of human diseases, including diabetes, metabolic syndrome, neurodegenerative disorders, and cancer. A promising systems biology approach for studying human metabolism is through the development and analysis of large-scale stoichiometric network models of human metabolism. The reconstruction of these network models has followed two main paths: the former being the reconstruction of generic (non-tissue specific) models, characterizing the complete metabolic potential of human cells, based mostly on genomic data to trace enzyme-coding genes (Duarte et al, 2007; Ma et al, 2007), and the latter is the reconstruction of cell type- and tissue-specific models (Wiback and Palsson, 2002; Chatziioannou et al, 2003; Vo et al, 2004), based on a similar methodology to that described above, with the extra complexity of manual curation of literature evidence for the cell/system specificity of metabolic enzymes and pathways.
On this background, we present in this study, to the best of our knowledge, the first computational approach for a rapid generation of genome-scale tissue-specific models. The method relies on integrating the previously reconstructed generic human models with a variety of high-throughput molecular ‘omics' data, including transcriptomic, proteomic, metabolomic, and phenotypic data, as well as literature-based knowledge, characterizing the tissue in hand (Figure 1). Hence, it can be readily used to quite rapidly build and use a large array of human tissue-specific models. The resulting model satisfies stoichiometric, mass-balance, and thermodynamic constraints. It serves as a functional metabolic network that can then be used to explore the metabolic state of a tissue under various genetic and physiological conditions, simulating enzymatic inhibition or drug applications through standard constraint-based modeling methods, without requiring additional context-specific molecular data.
We applied this approach to build a genome scale model of liver metabolism, which is then comprehensively tested and validated. The model is shown to be able to simulate complex hepatic metabolic functions, as well as depicting the pathological alterations caused by urea cycle deficiencies. The liver model was applied to predict measured intra-cellular metabolic fluxes given measured metabolite uptake and secretion rates at different hepatic metabolic conditions. The predictions were tested using a comprehensive set of flux measurements performed by (Chan et al, 2003), showing that the liver model obtained more accurate predictions compared to those obtained by the original, generic human model (an overall prediction accuracy of 0.67 versus 0.46). Furthermore, it was applied to identify metabolic biomarkers for liver in-born errors of metabolism—once again, displaying superiority vs. the predictions generated by the generic human model (accuracy of 0.67 versus 0.59).
From a biotechnological standpoint, the liver model generated here can serve as a basis for future studies aiming to optimize the functioning of bio artificial liver devices. The application of the method to rapidly construct metabolic models of other human tissues can obviously lead to many other important clinical insights, e.g., concerning means for metabolic salvage of ischemic heart and brain tissues. Last but not least, the application of the new method is not limited to the realm of human modeling; it can be used to generate tissue models for any multi-tissue organism for which a generic model exists, such as the Mus musculus (Quek and Nielsen, 2008; Sheikh et al, 2005) and the model plant Arabidopsis thaliana (Poolman et al, 2009).
The computational study of human metabolism has been advanced with the advent of the first generic (non-tissue specific) stoichiometric model of human metabolism. In this study, we present a new algorithm for rapid reconstruction of tissue-specific genome-scale models of human metabolism. The algorithm generates a tissue-specific model from the generic human model by integrating a variety of tissue-specific molecular data sources, including literature-based knowledge, transcriptomic, proteomic, metabolomic and phenotypic data. Applying the algorithm, we constructed the first genome-scale stoichiometric model of hepatic metabolism. The model is verified using standard cross-validation procedures, and through its ability to carry out hepatic metabolic functions. The model's flux predictions correlate with flux measurements across a variety of hormonal and dietary conditions, and improve upon the predictive performance obtained using the original, generic human model (prediction accuracy of 0.67 versus 0.46). Finally, the model better predicts biomarker changes in genetic metabolic disorders than the generic human model (accuracy of 0.67 versus 0.59). The approach presented can be used to construct other human tissue-specific models, and be applied to other organisms.
PMCID: PMC2964116  PMID: 20823844
constraint based; hepatic; liver; metabolism
5.  De-Convoluting the “Omics” for Organ Transplantation 
Purpose of review
The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. This article reviews some recent applications of the many evolving “omic” technologies to organ transplantation.
Recent findings
With the advancement of many high throughput “omic” techniques such as genomics, metabolomics, antibiomics, peptidomics and proteomics, efforts have been made to understand potential mechanisms of specific graft injuries and develop novel biomarkers for acute rejection, chronic rejection, and operational tolerance.
The translation of potential biomarkers from the lab bench to the clinical bedside is not an easy task and will require the concerted effort of the immunologists, molecular biologists, transplantation specialists, geneticists, and experts in bioinformatics. Rigorous prospective validation studies will be needed using large sets of independent patient samples. The appropriate and timely exploitation of evolving “omic” technologies will lay the cornerstone for a new age of translational research for organ transplant monitoring.
PMCID: PMC2993238  PMID: 19644370
genomics; proteomics; organ transplant; biomarker; translational medicine
6.  Metabolic network reconstruction of Chlamydomonas offers insight into light-driven algal metabolism 
A comprehensive genome-scale metabolic network of Chlamydomonas reinhardtii, including a detailed account of light-driven metabolism, is reconstructed and validated. The model provides a new resource for research of C. reinhardtii metabolism and in algal biotechnology.
The genome-scale metabolic network of Chlamydomonas reinhardtii (iRC1080) was reconstructed, accounting for >32% of the estimated metabolic genes encoded in the genome, and including extensive details of lipid metabolic pathways.This is the first metabolic network to explicitly account for stoichiometry and wavelengths of metabolic photon usage, providing a new resource for research of C. reinhardtii metabolism and developments in algal biotechnology.Metabolic functional annotation and the largest transcript verification of a metabolic network to date was performed, at least partially verifying >90% of the transcripts accounted for in iRC1080. Analysis of the network supports hypotheses concerning the evolution of latent lipid pathways in C. reinhardtii, including very long-chain polyunsaturated fatty acid and ceramide synthesis pathways.A novel approach for modeling light-driven metabolism was developed that accounts for both light source intensity and spectral quality of emitted light. The constructs resulting from this approach, termed prism reactions, were shown to significantly improve the accuracy of model predictions, and their use was demonstrated for evaluation of light source efficiency and design.
Algae have garnered significant interest in recent years, especially for their potential application in biofuel production. The hallmark, model eukaryotic microalgae Chlamydomonas reinhardtii has been widely used to study photosynthesis, cell motility and phototaxis, cell wall biogenesis, and other fundamental cellular processes (Harris, 2001). Characterizing algal metabolism is key to engineering production strains and understanding photobiological phenomena. Based on extensive literature on C. reinhardtii metabolism, its genome sequence (Merchant et al, 2007), and gene functional annotation, we have reconstructed and experimentally validated the genome-scale metabolic network for this alga, iRC1080, the first network to account for detailed photon absorption permitting growth simulations under different light sources. iRC1080 accounts for 1080 genes, associated with 2190 reactions and 1068 unique metabolites and encompasses 83 subsystems distributed across 10 cellular compartments (Figure 1A). Its >32% coverage of estimated metabolic genes is a tremendous expansion over previous algal reconstructions (Boyle and Morgan, 2009; Manichaikul et al, 2009). The lipid metabolic pathways of iRC1080 are considerably expanded relative to existing networks, and chemical properties of all metabolites in these pathways are accounted for explicitly, providing sufficient detail to completely specify all individual molecular species: backbone molecule and stereochemical numbering of acyl-chain positions; acyl-chain length; and number, position, and cis–trans stereoisomerism of carbon–carbon double bonds. Such detail in lipid metabolism will be critical for model-driven metabolic engineering efforts.
We experimentally verified transcripts accounted for in the network under permissive growth conditions, detecting >90% of tested transcript models (Figure 1B) and providing validating evidence for the contents of iRC1080. We also analyzed the extent of transcript verification by specific metabolic subsystems. Some subsystems stood out as more poorly verified, including chloroplast and mitochondrial transport systems and sphingolipid metabolism, all of which exhibited <80% of transcripts detected, reflecting incomplete characterization of compartmental transporters and supporting a hypothesis of latent pathway evolution for ceramide synthesis in C. reinhardtii. Additional lines of evidence from the reconstruction effort similarly support this hypothesis including lack of ceramide synthetase and other annotation gaps downstream in sphingolipid metabolism. A similar hypothesis of latent pathway evolution was established for very long-chain fatty acids (VLCFAs) and their polyunsaturated analogs (VLCPUFAs) (Figure 1C), owing to the absence of this class of lipids in previous experimental measurements, lack of a candidate VLCFA elongase in the functional annotation, and additional downstream annotation gaps in arachidonic acid metabolism.
The network provides a detailed account of metabolic photon absorption by light-driven reactions, including photosystems I and II, light-dependent protochlorophyllide oxidoreductase, provitamin D3 photoconversion to vitamin D3, and rhodopsin photoisomerase; this network accounting permits the precise modeling of light-dependent metabolism. iRC1080 accounts for effective light spectral ranges through analysis of biochemical activity spectra (Figure 3A), either reaction activity or absorbance at varying light wavelengths. Defining effective spectral ranges associated with each photon-utilizing reaction enabled our network to model growth under different light sources via stoichiometric representation of the spectral composition of emitted light, termed prism reactions. Coefficients for different photon wavelengths in a prism reaction correspond to the ratios of photon flux in the defined effective spectral ranges to the total emitted photon flux from a given light source (Figure 3B). This approach distinguishes the amount of emitted photons that drive different metabolic reactions. We created prism reactions for most light sources that have been used in published studies for algal and plant growth including solar light, various light bulbs, and LEDs. We also included regulatory effects, resulting from lighting conditions insofar as published studies enabled. Light and dark conditions have been shown to affect metabolic enzyme activity in C. reinhardtii on multiple levels: transcriptional regulation, chloroplast RNA degradation, translational regulation, and thioredoxin-mediated enzyme regulation. Through application of our light model and prism reactions, we were able to closely recapitulate experimental growth measurements under solar, incandescent, and red LED lights. Through unbiased sampling, we were able to establish the tremendous statistical significance of the accuracy of growth predictions achievable through implementation of prism reactions. Finally, application of the photosynthetic model was demonstrated prospectively to evaluate light utilization efficiency under different light sources. The results suggest that, of the existing light sources, red LEDs provide the greatest efficiency, about three times as efficient as sunlight. Extending this analysis, the model was applied to design a maximally efficient LED spectrum for algal growth. The result was a 677-nm peak LED spectrum with a total incident photon flux of 360 μE/m2/s, suggesting that for the simple objective of maximizing growth efficiency, LED technology has already reached an effective theoretical optimum.
In summary, the C. reinhardtii metabolic network iRC1080 that we have reconstructed offers insight into the basic biology of this species and may be employed prospectively for genetic engineering design and light source design relevant to algal biotechnology. iRC1080 was used to analyze lipid metabolism and generate novel hypotheses about the evolution of latent pathways. The predictive capacity of metabolic models developed from iRC1080 was demonstrated in simulating mutant phenotypes and in evaluation of light source efficiency. Our network provides a broad knowledgebase of the biochemistry and genomics underlying global metabolism of a photoautotroph, and our modeling approach for light-driven metabolism exemplifies how integration of largely unvisited data types, such as physicochemical environmental parameters, can expand the diversity of applications of metabolic networks.
Metabolic network reconstruction encompasses existing knowledge about an organism's metabolism and genome annotation, providing a platform for omics data analysis and phenotype prediction. The model alga Chlamydomonas reinhardtii is employed to study diverse biological processes from photosynthesis to phototaxis. Recent heightened interest in this species results from an international movement to develop algal biofuels. Integrating biological and optical data, we reconstructed a genome-scale metabolic network for this alga and devised a novel light-modeling approach that enables quantitative growth prediction for a given light source, resolving wavelength and photon flux. We experimentally verified transcripts accounted for in the network and physiologically validated model function through simulation and generation of new experimental growth data, providing high confidence in network contents and predictive applications. The network offers insight into algal metabolism and potential for genetic engineering and efficient light source design, a pioneering resource for studying light-driven metabolism and quantitative systems biology.
PMCID: PMC3202792  PMID: 21811229
Chlamydomonas reinhardtii; lipid metabolism; metabolic engineering; photobioreactor
7.  Reverse engineering biomolecular systems using −omic data: challenges, progress and opportunities 
Briefings in Bioinformatics  2012;13(4):430-445.
Recent advances in high-throughput biotechnologies have led to the rapid growing research interest in reverse engineering of biomolecular systems (REBMS). ‘Data-driven’ approaches, i.e. data mining, can be used to extract patterns from large volumes of biochemical data at molecular-level resolution while ‘design-driven’ approaches, i.e. systems modeling, can be used to simulate emergent system properties. Consequently, both data- and design-driven approaches applied to –omic data may lead to novel insights in reverse engineering biological systems that could not be expected before using low-throughput platforms. However, there exist several challenges in this fast growing field of reverse engineering biomolecular systems: (i) to integrate heterogeneous biochemical data for data mining, (ii) to combine top–down and bottom–up approaches for systems modeling and (iii) to validate system models experimentally. In addition to reviewing progress made by the community and opportunities encountered in addressing these challenges, we explore the emerging field of synthetic biology, which is an exciting approach to validate and analyze theoretical system models directly through experimental synthesis, i.e. analysis-by-synthesis. The ultimate goal is to address the present and future challenges in reverse engineering biomolecular systems (REBMS) using integrated workflow of data mining, systems modeling and synthetic biology.
PMCID: PMC3404400  PMID: 22833495
reverse engineering biological systems; high-throughput technology; –omic data; synthetic biology; analysis-by-synthesis
8.  Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? 
Briefings in Bioinformatics  2012;14(3):315-326.
In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
PMCID: PMC3659301  PMID: 22786785
Random Forest; variable importance; local importance; conditional relationships; variable interaction; proximity
9.  Stem cell systems informatics for advanced clinical biodiagnostics: tracing molecular signatures from bench to bedside 
Croatian Medical Journal  2013;54(4):319-329.
Development of innovative high throughput technologies has enabled a variety of molecular landscapes to be interrogated with an unprecedented degree of detail. Emergence of next generation nucleotide sequencing methods, advanced proteomic techniques, and metabolic profiling approaches continue to produce a wealth of biological data that captures molecular frameworks underlying phenotype. The advent of these novel technologies has significant translational applications, as investigators can now explore molecular underpinnings of developmental states with a high degree of resolution. Application of these leading-edge techniques to patient samples has been successfully used to unmask nuanced molecular details of disease vs healthy tissue, which may provide novel targets for palliative intervention. To enhance such approaches, concomitant development of algorithms to reprogram differentiated cells in order to recapitulate pluripotent capacity offers a distinct advantage to advancing diagnostic methodology. Bioinformatic deconvolution of several “-omic” layers extracted from reprogrammed patient cells, could, in principle, provide a means by which the evolution of individual pathology can be developmentally monitored. Significant logistic challenges face current implementation of this novel paradigm of patient treatment and care, however, several of these limitations have been successfully addressed through continuous development of cutting edge in silico archiving and processing methods. Comprehensive elucidation of genomic, transcriptomic, proteomic, and metabolomic networks that define normal and pathological states, in combination with reprogrammed patient cells are thus poised to become high value resources in modern diagnosis and prognosis of patient disease.
PMCID: PMC3760656  PMID: 23986272
10.  An Integrative Approach for Interpretation of Clinical NGS Genomic Variant Data 
Antibody (Ab) discovery research has accelerated as monoclonal Ab (mAb)-based biologic strategies have proved efficacious in the treatment of many human diseases, ranging from cancer to autoimmunity. Initial steps in the discovery of therapeutic mAb require epitope characterization and preclinical studies in vitro and in animal models often using limited quantities of Ab. To facilitate this research, our Shared Resource Laboratory (SRL) offers microscale Ab conjugation. Ab submitted for conjugation may or may not be commercially produced, but have not been characterized for use in immunofluorescence applications. Purified mAb and even polyclonal Ab (pAb) can be efficiently conjugated, although the advantages of direct conjugation are more obvious for mAb. To improve consistency of results in microscale (<100ug) conjugation reactions, we chose to utilize several different varieties of commercial kits. Kits tested were limited to covalent fluorophore labeling. Established quality control (QC) processes to validate fluorophore labeling either rely solely on spectrophotometry or utilize flow cytometry of cells expected to express the target antigen. This methodology is not compatible with microscale reactions using uncharacterized Ab. We developed a novel method for cell-free QC of our conjugates that reflects conjugation quality, but is independent of the biological properties of the Ab itself. QC is critical, as amine reactive chemistry relies on the absence of even trace quantities of competing amine moieties such as those found in the Good buffers (HEPES, MOPS, TES, etc.) or irrelevant proteins. Herein, we present data used to validate our method of assessing the extent of labeling and the removal of free dye by using flow cytometric analysis of polystyrene Ab capture beads to verify product quality. This microscale custom conjugation and QC allows for the rapid development and validation of high quality reagents, specific to the needs of our colleagues and clientele. Next generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis. Most analysis pipelines do not connect genomic variants to disease and protein specific information during the initial filtering and selection of relevant variants. Robust bioinformatics pipelines were implemented for trimming, genome alignment, SNP, INDEL, or structural variation detection of whole genome or exon-capture sequencing data from Illumina. Quality control metrics were analyzed at each step of the pipeline to ensure data integrity for clinical applications. We further annotate the variants with statistics regarding the diseased population and variant impact. Custom algorithms were developed to analyze the variant data by filtering variants based upon criteria such as quality of variant, inheritance pattern (e.g. dominant, recessive, X-linked), and impact of variant. The resulting variants and their associated genes are linked to Integrated Genome Browser (IGV) in a genome context, and to the PIR iProXpress system for rich protein and disease information. This poster will present detailed analysis of whole exome sequencing performed on patients with facio-skeletal anomalies. We will compare and contrast data analysis methods and report on potential clinically relevant leads discovered by implementing our new clinical variant pipeline. Our variant analysis of these patients and their unaffected family members resulted in more than 500,000 variants. By applying our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for further filtering of disease relevant variants that impact protein coding genes. Taken together, the integrative approach allows better selection of disease relevant genomic variants by using both genomic and disease/protein centric information. This type of clustering approach can help clinicians better understand the association of variants to the disease phenotype, enabling application to personalized medicine approaches.
PMCID: PMC4162289
11.  Classifying Variants of Undetermined Significance in BRCA2 with Protein Likelihood Ratios 
Cancer informatics  2008;6:203-216.
Missense (amino-acid changing) variants found in cancer predisposition genes often create difficulties when clinically interpreting genetic testing results. Although bioinformatics has developed approaches to predicting the impact of these variants, many of these approaches have not been readily applicable in the clinical setting. Bioinformatics approaches for predicting the impact of these variants have not yet found their footing in clinical practice because 1) interpreting the medical relevance of predictive scores is difficult; 2) the relationship between bioinformatics “predictors” (sequence conservation, protein structure) and cancer susceptibility is not understood.
Methodology/Principal Findings
We present a computational method that produces a probabilistic likelihood ratio predictive of whether a missense variant impairs protein function. We apply the method to a tumor suppressor gene, BRCA2, whose loss of function is important to cancer susceptibility. Protein likelihood ratios are computed for 229 unclassified variants found in individuals from high-risk breast/ovarian cancer families. We map the variants onto a protein structure model, and suggest that a cluster of predicted deleterious variants in the BRCA2 OB1 domain may destabilize BRCA1 and a protein binding partner, the small acidic protein DSS1. We compare our predictions with variant “re-classifications” provided by Myriad Genetics, a biotechnology company that holds the patent on BRCA2 genetic testing in the U.S., and with classifications made by an established medical genetics model [1]. Our approach uses bioinformatics data that is independent of these genetics-based classifications and yet shows significant agreement with them. Preliminary results indicate that our method is less likely to make false positive errors than other bioinformatics methods, which were designed to predict the impact of missense mutations in general.
Missense mutations are the most common disease-producing genetic variants. We present a fast, scalable bioinformatics method that integrates information about protein sequence, conservation, and structure in a likelihood ratio that can be integrated with medical genetics likelihood ratios. The protein likelihood ratio, together with medical genetics likelihood ratios, can be used by clinicians and counselors to communicate the relevance of a VUS to the individual who has that VUS. The approach described here is generalizable to regions of any tumor suppressor gene that have been structurally determined by X-ray crystallography or for which a protein homology model can be built.
PMCID: PMC2587343  PMID: 19043619
Breast cancer; Risk assessment; Mutagenesis; Cancer susceptibility genes; Bioinformatics and computational biology; missense variants
12.  Classifying Variants of Undetermined Significance in BRCA2 with Protein Likelihood Ratios 
Cancer Informatics  2008;6:203-216.
Missense (aminoacid changing) variants found in cancer predisposition genes often create difficulties when clinically interpreting genetic testing results. Although bioinformatics has developed approaches to predicting the impact of these variants, many of these approaches have not been readily applicable in the clinical setting. Bioinformatics approaches for predicting the impact of these variants have not yet found their footing in clinical practice because 1) interpreting the medical relevance of predictive scores is difficult; 2) the relationship between bioinformatics “predictors” (sequence conservation, protein structure) and cancer susceptibility is not understood.
Methodology/Principal Findings
We present a computational method that produces a probabilistic likelihood ratio predictive of whether a missense variant impairs protein function. We apply the method to a tumor suppressor gene, BRCA2, whose loss of function is important to cancer susceptibility. Protein likelihood ratios are computed for 229 unclassified variants found in individuals from high-risk breast/ovarian cancer families. We map the variants onto a protein structure model, and suggest that a cluster of predicted deleterious variants in the BRCA2 OB1 domain may destabilize BRCA2 and a protein binding partner, the small acidic protein DSS1. We compare our predictions with variant “re-classifications” provided by Myriad Genetics, a biotechnology company that holds the patent on BRCA2 genetic testing in the U.S., and with classifications made by an established medical genetics model [1]. Our approach uses bioinformatics data that is independent of these genetics-based classifications and yet shows significant agreement with them. Preliminary results indicate that our method is less likely to make false positive errors than other bioinformatics methods, which were designed to predict the impact of missense mutations in general.
Missense mutations are the most common disease-producing genetic variants. We present a fast, scalable bioinformatics method that integrates information about protein sequence, conservation, and structure in a likelihood ratio that can be integrated with medical genetics likelihood ratios. The protein likelihood ratio, together with medical genetics likelihood ratios, can be used by clinicians and counselors to communicate the relevance of a VUS to the individual who has that VUS. The approach described here is generalizable to regions of any tumor suppressor gene that have been structurally determined by X-ray crystallography or for which a protein homology model can be built.
PMCID: PMC2587343  PMID: 19043619
breast cancer; risk assessment; mutagenesis; cancer susceptibility genes; bioinformatics and computational biology; missense variants
13.  The Genome Organization of Thermotoga maritima Reflects Its Lifestyle 
PLoS Genetics  2013;9(4):e1003485.
The generation of genome-scale data is becoming more routine, yet the subsequent analysis of omics data remains a significant challenge. Here, an approach that integrates multiple omics datasets with bioinformatics tools was developed that produces a detailed annotation of several microbial genomic features. This methodology was used to characterize the genome of Thermotoga maritima—a phylogenetically deep-branching, hyperthermophilic bacterium. Experimental data were generated for whole-genome resequencing, transcription start site (TSS) determination, transcriptome profiling, and proteome profiling. These datasets, analyzed in combination with bioinformatics tools, served as a basis for the improvement of gene annotation, the elucidation of transcription units (TUs), the identification of putative non-coding RNAs (ncRNAs), and the determination of promoters and ribosome binding sites. This revealed many distinctive properties of the T. maritima genome organization relative to other bacteria. This genome has a high number of genes per TU (3.3), a paucity of putative ncRNAs (12), and few TUs with multiple TSSs (3.7%). Quantitative analysis of promoters and ribosome binding sites showed increased sequence conservation relative to other bacteria. The 5′UTRs follow an atypical bimodal length distribution comprised of “Short” 5′UTRs (11–17 nt) and “Common” 5′UTRs (26–32 nt). Transcriptional regulation is limited by a lack of intergenic space for the majority of TUs. Lastly, a high fraction of annotated genes are expressed independent of growth state and a linear correlation of mRNA/protein is observed (Pearson r = 0.63, p<2.2×10−16 t-test). These distinctive properties are hypothesized to be a reflection of this organism's hyperthermophilic lifestyle and could yield novel insights into the evolutionary trajectory of microbial life on earth.
Author Summary
Genomic studies have greatly benefited from the advent of high-throughput technologies and bioinformatics tools. Here, a methodology integrating genome-scale data and bioinformatics tools is developed to characterize the genome organization of the hyperthermophilic, phylogenetically deep-branching bacterium Thermotoga maritima. This approach elucidates several features of the genome organization and enables comparative analysis of these features across diverse taxa. Our results suggest that the genome of T. maritima is reflective of its hyperthermophilic lifestyle. Ultimately, constraints imposed on the genome have negative impacts on regulatory complexity and phenotypic diversity. Investigating the genome organization of Thermotogae species will help resolve various causal factors contributing to the genome organization such as phylogeny and environment. Applying a similar analysis of the genome organization to numerous taxa will likely provide insights into microbial evolution.
PMCID: PMC3636130  PMID: 23637642
14.  Integrative Genomic Analyses Identify BRF2 as a Novel Lineage-Specific Oncogene in Lung Squamous Cell Carcinoma 
PLoS Medicine  2010;7(7):e1000315.
William Lockwood and colleagues show that the focal amplification of a gene, BRF2, on Chromosome 8p12 plays a key role in squamous cell carcinoma of the lung.
Traditionally, non-small cell lung cancer is treated as a single disease entity in terms of systemic therapy. Emerging evidence suggests the major subtypes—adenocarcinoma (AC) and squamous cell carcinoma (SqCC)—respond differently to therapy. Identification of the molecular differences between these tumor types will have a significant impact in designing novel therapies that can improve the treatment outcome.
Methods and Findings
We used an integrative genomics approach, combing high-resolution comparative genomic hybridization and gene expression microarray profiles, to compare AC and SqCC tumors in order to uncover alterations at the DNA level, with corresponding gene transcription changes, which are selected for during development of lung cancer subtypes. Through the analysis of multiple independent cohorts of clinical tumor samples (>330), normal lung tissues and bronchial epithelial cells obtained by bronchial brushing in smokers without lung cancer, we identified the overexpression of BRF2, a gene on Chromosome 8p12, which is specific for development of SqCC of lung. Genetic activation of BRF2, which encodes a RNA polymerase III (Pol III) transcription initiation factor, was found to be associated with increased expression of small nuclear RNAs (snRNAs) that are involved in processes essential for cell growth, such as RNA splicing. Ectopic expression of BRF2 in human bronchial epithelial cells induced a transformed phenotype and demonstrates downstream oncogenic effects, whereas RNA interference (RNAi)-mediated knockdown suppressed growth and colony formation of SqCC cells overexpressing BRF2, but not AC cells. Frequent activation of BRF2 in >35% preinvasive bronchial carcinoma in situ, as well as in dysplastic lesions, provides evidence that BRF2 expression is an early event in cancer development of this cell lineage.
This is the first study, to our knowledge, to show that the focal amplification of a gene in Chromosome 8p12, plays a key role in squamous cell lineage specificity of the disease. Our data suggest that genetic activation of BRF2 represents a unique mechanism of SqCC lung tumorigenesis through the increase of Pol III-mediated transcription. It can serve as a marker for lung SqCC and may provide a novel target for therapy.
Please see later in the article for the Editors' Summary
Editors' Summary
Lung cancer is the commonest cause of cancer-related death. Every year, 1.3 million people die from this disease, which is mainly caused by smoking. Most cases of lung cancer are “non-small cell lung cancers” (NSCLCs). Like all cancers, NSCLC starts when cells begin to divide uncontrollably and to move round the body (metastasize) because of changes (mutations) in their genes. These mutations are often in “oncogenes,” genes that, when activated, encourage cell division. Oncogenes can be activated by mutations that alter the properties of the proteins they encode or by mutations that increase the amount of protein made from them, such as gene amplification (an increase in the number of copies of a gene). If NSCLC is diagnosed before it has spread from the lungs (stage I disease), it can be surgically removed and many patients with stage I NSCLC survive for more than 5 years after their diagnosis. Unfortunately, in more than half of patients, NSCLC has metastasized before it is diagnosed. This stage IV NSCLC can be treated with chemotherapy (toxic chemicals that kill fast-growing cancer cells) but only 2% of patients with stage IV lung cancer are alive 5 years after diagnosis.
Why Was This Study Done?
Traditionally, NSCLC has been regarded as a single disease in terms of treatment. However, emerging evidence suggests that the two major subtypes of NSCLC—adenocarcinoma and squamous cell carcinoma (SqCC)—respond differently to chemotherapy. Adenocarcinoma and SqCC start in different types of lung cell and experts think that for each cell type in the body, specific combinations of mutations interact with the cell type's own unique characteristics to provide the growth and survival advantage needed for cancer development. If this is true, then identifying the molecular differences between adenocarcinoma and SqCC could provide targets for more effective therapies for these major subtypes of NSCLC. Amplification of a chromosome region called 8p12 is very common in NSCLC, which suggests that an oncogene that drives lung cancer development is present in this chromosome region. In this study, the researchers investigate this possibility by looking for an amplified gene in the 8p12 chromosome region that makes increased amounts of protein in lung SqCC but not in lung adenocarcinoma.
What Did the Researchers Do and Find?
The researchers used a technique called comparative genomic hybridization to show that focal regions of Chromosome 8p are amplified in about 40% of lung SqCCs, but that DNA loss in this region is the most common alteration in lung adenocarcinomas. Ten genes in the 8p12 chromosome region were expressed at higher levels in the SqCC samples that they examined than in adenocarcinoma samples, they report, and overexpression of five of these genes correlated with amplification of the 8p12 region in the SqCC samples. Only one of the genes—BRF2—was more highly expressed in squamous carcinoma cells than in normal bronchial epithelial cells (the cell type that lines the tubes that take air into the lungs and from which SqCC develops). Artificially induced expression of BRF2 in bronchial epithelial cells made these normal cells behave like tumor cells, whereas reduction of BRF2 expression in squamous carcinoma cells made them behave more like normal bronchial epithelial cells. Finally, BRF2 was frequently activated in two early stages of squamous cell carcinoma—bronchial carcinoma in situ and dysplastic lesions.
What Do These Findings Mean?
Together, these findings show that the focal amplification of chromosome region 8p12 plays a role in the development of lung SqCC but not in the development of lung adenocarcinoma, the other major subtype of NSCLC. These findings identify BRF2 (which encodes a RNA polymerase III transcription initiation factor, a protein that is required for the synthesis of RNA molecules that help to control cell growth) as a lung SqCC-specific oncogene and uncover a unique mechanism for lung SqCC development. Most importantly, these findings suggest that genetic activation of BRF2 could be used as a marker for lung SqCC, which might facilitate the early detection of this type of NSCLC and that BRF2 might provide a new target for therapy.
Additional Information
Please access these Web sites via the online version of this summary at
The US National Cancer Institute provides detailed information for patients and professionals about all aspects of lung cancer, including information on non-small cell carcinoma (in English and Spanish)
Cancer Research UK also provides information about lung cancer and information on how cancer starts
MedlinePlus has links to other resources about lung cancer (in English and Spanish)
PMCID: PMC2910599  PMID: 20668658
15.  Nuclear Receptor Expression Defines a Set of Prognostic Biomarkers for Lung Cancer 
PLoS Medicine  2010;7(12):e1000378.
David Mangelsdorf and colleagues show that nuclear receptor expression is strongly associated with clinical outcomes of lung cancer patients, and this expression profile is a potential prognostic signature for lung cancer patient survival time, particularly for individuals with early stage disease.
The identification of prognostic tumor biomarkers that also would have potential as therapeutic targets, particularly in patients with early stage disease, has been a long sought-after goal in the management and treatment of lung cancer. The nuclear receptor (NR) superfamily, which is composed of 48 transcription factors that govern complex physiologic and pathophysiologic processes, could represent a unique subset of these biomarkers. In fact, many members of this family are the targets of already identified selective receptor modulators, providing a direct link between individual tumor NR quantitation and selection of therapy. The goal of this study, which begins this overall strategy, was to investigate the association between mRNA expression of the NR superfamily and the clinical outcome for patients with lung cancer, and to test whether a tumor NR gene signature provided useful information (over available clinical data) for patients with lung cancer.
Methods and Findings
Using quantitative real-time PCR to study NR expression in 30 microdissected non-small-cell lung cancers (NSCLCs) and their pair-matched normal lung epithelium, we found great variability in NR expression among patients' tumor and non-involved lung epithelium, found a strong association between NR expression and clinical outcome, and identified an NR gene signature from both normal and tumor tissues that predicted patient survival time and disease recurrence. The NR signature derived from the initial 30 NSCLC samples was validated in two independent microarray datasets derived from 442 and 117 resected lung adenocarcinomas. The NR gene signature was also validated in 130 squamous cell carcinomas. The prognostic signature in tumors could be distilled to expression of two NRs, short heterodimer partner and progesterone receptor, as single gene predictors of NSCLC patient survival time, including for patients with stage I disease. Of equal interest, the studies of microdissected histologically normal epithelium and matched tumors identified expression in normal (but not tumor) epithelium of NGFIB3 and mineralocorticoid receptor as single gene predictors of good prognosis.
NR expression is strongly associated with clinical outcomes for patients with lung cancer, and this expression profile provides a unique prognostic signature for lung cancer patient survival time, particularly for those with early stage disease. This study highlights the potential use of NRs as a rational set of therapeutically tractable genes as theragnostic biomarkers, and specifically identifies short heterodimer partner and progesterone receptor in tumors, and NGFIB3 and MR in non-neoplastic lung epithelium, for future detailed translational study in lung cancer.
Please see later in the article for the Editors' Summary
Editors' Summary
Lung cancer, the most common cause of cancer-related death, kills 1.3 million people annually. Most lung cancers are “non-small-cell lung cancers” (NSCLCs), and most are caused by smoking. Exposure to chemicals in smoke causes changes in the genes of the cells lining the lungs that allow the cells to grow uncontrollably and to move around the body. How NSCLC is treated and responds to treatment depends on its “stage.” Stage I tumors, which are small and confined to the lung, are removed surgically, although chemotherapy is also sometimes given. Stage II tumors have spread to nearby lymph nodes and are treated with surgery and chemotherapy, as are some stage III tumors. However, because cancer cells in stage III tumors can be present throughout the chest, surgery is not always possible. For such cases, and for stage IV NSCLC, where the tumor has spread around the body, patients are treated with chemotherapy alone. About 70% of patients with stage I and II NSCLC but only 2% of patients with stage IV NSCLC survive for five years after diagnosis; more than 50% of patients have stage IV NSCLC at diagnosis.
Why Was This Study Done?
Patient responses to treatment vary considerably. Oncologists (doctors who treat cancer) would like to know which patients have a good prognosis (are likely to do well) to help them individualize their treatment. Consequently, the search is on for “prognostic tumor biomarkers,” molecules made by cancer cells that can be used to predict likely clinical outcomes. Such biomarkers, which may also be potential therapeutic targets, can be identified by analyzing the overall pattern of gene expression in a panel of tumors using a technique called microarray analysis and looking for associations between the expression of sets of genes and clinical outcomes. In this study, the researchers take a more directed approach to identifying prognostic biomarkers by investigating the association between the expression of the genes encoding nuclear receptors (NRs) and clinical outcome in patients with lung cancer. The NR superfamily contains 48 transcription factors (proteins that control the expression of other genes) that respond to several hormones and to diet-derived fats. NRs control many biological processes and are targets for several successful drugs, including some used to treat cancer.
What Did the Researchers Do and Find?
The researchers analyzed the expression of NR mRNAs using “quantitative real-time PCR” in 30 microdissected NSCLCs and in matched normal lung tissue samples (mRNA is the blueprint for protein production). They then used an approach called standard classification and regression tree analysis to build a prognostic model for NSCLC based on the expression data. This model predicted both survival time and disease recurrence among the patients from whom the tumors had been taken. The researchers validated their prognostic model in two large independent lung adenocarcinoma microarray datasets and in a squamous cell carcinoma dataset (adenocarcinomas and squamous cell carcinomas are two major NSCLC subtypes). Finally, they explored the roles of specific NRs in the prediction model. This analysis revealed that the ability of the NR signature in tumors to predict outcomes was mainly due to the expression of two NRs—the short heterodimer partner (SHP) and the progesterone receptor (PR). Expression of either gene could be used as a single gene predictor of the survival time of patients, including those with stage I disease. Similarly, the expression of either nerve growth factor induced gene B3 (NGFIB3) or mineralocorticoid receptor (MR) in normal tissue was a single gene predictor of a good prognosis.
What Do These Findings Mean?
These findings indicate that the expression of NR mRNA is strongly associated with clinical outcomes in patients with NSCLC. Furthermore, they identify a prognostic NR expression signature that provides information on the survival time of patients, including those with early stage disease. The signature needs to be confirmed in more patients before it can be used clinically, and researchers would like to establish whether changes in mRNA expression are reflected in changes in protein expression if NRs are to be targeted therapeutically. Nevertheless, these findings highlight the potential use of NRs as prognostic tumor biomarkers. Furthermore, they identify SHP and PR in tumors and two NRs in normal lung tissue as molecules that might provide new targets for the treatment of lung cancer and new insights into the early diagnosis, pathogenesis, and chemoprevention of lung cancer.
Additional Information
Please access these Web sites via the online version of this summary at
The Nuclear Receptor Signaling Atlas (NURSA) is consortium of scientists sponsored by the US National Institutes of Health that provides scientific reagents, datasets, and educational material on nuclear receptors and their co-regulators to the scientific community through a Web-based portal
The Cancer Prevention and Research Institute of Texas (CPRIT) provides information and resources to anyone interested in the prevention and treatment of lung and other cancers
The US National Cancer Institute provides detailed information for patients and professionals about all aspects of lung cancer, including information on non-small-cell carcinoma and on tumor markers (in English and Spanish)
Cancer Research UK also provides information about lung cancer and information on how cancer starts
MedlinePlus has links to other resources about lung cancer (in English and Spanish)
Wikipedia has a page on nuclear receptors (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
PMCID: PMC3001894  PMID: 21179495
16.  G-DOC: A Systems Medicine Platform for Personalized Oncology1 
Neoplasia (New York, N.Y.)  2011;13(9):771-783.
Currently, cancer therapy remains limited by a “one-size-fits-all” approach, whereby treatment decisions are based mainly on the clinical stage of disease, yet fail to reference the individual's underlying biology and its role driving malignancy. Identifying better personalized therapies for cancer treatment is hindered by the lack of high-quality “omics” data of sufficient size to produce meaningful results and the ability to integrate biomedical data from disparate technologies. Resolving these issues will help translation of therapies from research to clinic by helping clinicians develop patient-specific treatments based on the unique signatures of patient's tumor. Here we describe the Georgetown Database of Cancer (G-DOC), a Web platform that enables basic and clinical research by integrating patient characteristics and clinical outcome data with a variety of high-throughput research data in a unified environment. While several rich data repositories for high-dimensional research data exist in the public domain, most focus on a single-data type and do not support integration across multiple technologies. Currently, G-DOC contains data from more than 2500 breast cancer patients and 800 gastrointestinal cancer patients, G-DOC includes a broad collection of bioinformatics and systems biology tools for analysis and visualization of four major “omics” types: DNA, mRNA, microRNA, and metabolites. We believe that G-DOC will help facilitate systems medicine by providing identification of trends and patterns in integrated data sets and hence facilitate the use of better targeted therapies for cancer. A set of representative usage scenarios is provided to highlight the technical capabilities of this resource.
PMCID: PMC3182270  PMID: 21969811
17.  Robust clinical outcome prediction based on Bayesian analysis of transcriptional profiles and prior causal networks 
Bioinformatics  2014;30(12):i69-i77.
Motivation: Understanding and predicting an individual’s response in a clinical trial is the key to better treatments and cost-effective medicine. Over the coming years, more and more large-scale omics datasets will become available to characterize patients with complex and heterogeneous diseases at a molecular level. Unfortunately, genetic, phenotypical and environmental variation is much higher in a human trial population than currently modeled or measured in most animal studies. In our experience, this high variability can lead to failure of trained predictors in independent studies and undermines the credibility and utility of promising high-dimensional datasets.
Methods: We propose a method that utilizes patient-level genome-wide expression data in conjunction with causal networks based on prior knowledge. Our approach determines a differential expression profile for each patient and uses a Bayesian approach to infer corresponding upstream regulators. These regulators and their corresponding posterior probabilities of activity are used in a regularized regression framework to predict response.
Results: We validated our approach using two clinically relevant phenotypes, namely acute rejection in kidney transplantation and response to Infliximab in ulcerative colitis. To demonstrate pitfalls in translating trained predictors across independent trials, we analyze performance characteristics of our approach as well as alternative feature sets in the regression on two independent datasets for each phenotype. We show that the proposed approach is able to successfully incorporate causal prior knowledge to give robust performance estimates.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4058945  PMID: 24932007
18.  A regression model approach to enable cell morphology correction in high-throughput flow cytometry 
Large variations in cell size and shape can undermine traditional gating methods for analyzing flow cytometry data. Correcting for these effects enables analysis of high-throughput data sets, including >5000 yeast samples with diverse cell morphologies.
The regression model approach corrects for the effects of cell morphology on fluorescence, as well as an extremely small and restrictive gate, but without removing any of the cells.In contrast to traditional gating, this approach enables the quantitative analysis of high-throughput flow cytometry experiments, since the regression model can compare between biological samples that show no or little overlap in terms of the morphology of the cells.The analysis of a high-throughput yeast flow cytometry data set consisting of >5000 biological samples identified key proteins that affect the time and intensity of the bifurcation event that happens after the carbon source transition from glucose to fatty acids. Here, some yeast cells undergo major structural changes, while others do not.
Flow cytometry is a widely used technique that enables the measurement of different optical properties of individual cells within large populations of cells in a fast and automated manner. For example, by targeting cell-specific markers with fluorescent probes, flow cytometry is used to identify (and isolate) cell types within complex mixtures of cells. In addition, fluorescence reporters can be used in conjunction with flow cytometry to measure protein, RNA or DNA concentration within single cells of a population.
One of the biggest advantages of this technique is that it provides information of how each cell behaves instead of just measuring the population average. This can be essential when analyzing complex samples that consist of diverse cell types or when measuring cellular responses to stimuli. For example, there is an important difference between a 50% expression increase of all cells in a population after stimulation and a 100% increase in only half of the cells, while the other half remains unresponsive. Another important advantage of flow cytometry is automation, which enables high-throughput studies with thousands of samples and conditions. However, current methods are confounded by populations of cells that are non-uniform in terms of size and granularity. Such variability affects the emitted fluorescence of the cell and adds undesired variability when estimating population fluorescence. This effect also frustrates a sensible comparison between conditions, where not only fluorescence but also cell size and granularity may be affected.
Traditionally, this problem has been addressed by using ‘gates' that restrict the analysis to cells with similar morphological properties (i.e. cell size and cell granularity). Because cells inside the gate are morphologically similar to one another, they will show a smaller variability in their response within the population. Moreover, applying the same gate in all samples assures that observed differences between these samples are not due to differential cell morphologies.
Gating, however, comes with costs. First, since only a subgroup of cells is selected, the final number of cells analyzed can be significantly reduced. This means that in order to have sufficient statistical power, more cells have to be acquired, which, if even possible in the first place, increases the time and cost of the experiment. Second, finding a good gate for all samples and conditions can be challenging if not impossible, especially in cases where cellular morphology changes dramatically between conditions. Finally, gating is a very user-dependent process, where both the size and shape of the gate are determined by the researcher and will affect the outcome, introducing subjectivity in the analysis that complicates reproducibility.
In this paper, we present an alternative method to gating that addresses the issues stated above. The method is based on a regression model containing linear and non-linear terms that estimates and corrects for the effect of cell size and granularity on the observed fluorescence of each cell in a sample. The corrected fluorescence thus becomes ‘free' of the morphological effects.
Because the model uses all cells in the sample, it assures that the corrected fluorescence is an accurate representation of the sample. In addition, the regression model can predict the expected fluorescence of a sample in areas where there are no cells. This makes it possible to compare between samples that have little overlap with good confidence. Furthermore, because the regression model is automated, it is fully reproducible between labs and conditions. Finally, it allows for a rapid analysis of big data sets containing thousands of samples.
To probe the validity of the model, we performed several experiments. We show how the regression model is able to remove the morphological-associated variability as well as an extremely small and restrictive gate, but without the caveat of removing cells. We test the method in different organisms (yeast and human) and applications (protein level detection, separation of mixed subpopulations). We then apply this method to unveil new biological insights in the mechanistic processes involved in transcriptional noise.
Gene transcription is a process subjected to the randomness intrinsic to any molecular event. Although such randomness may seem to be undesirable for the cell, since it prevents consistent behavior, there are situations where some degree of randomness is beneficial (e.g. bet hedging). For this reason, each gene is tuned to exhibit different levels of randomness or noise depending on its functions. For core and essential genes, the cell has developed mechanisms to lower the level of noise, while for genes involved in the response to stress, the variability is greater.
This gene transcription tuning can be determined at many levels, from the architecture of the transcriptional network, to epigenetic regulation. In our study, we analyze the latter using the response of yeast to the presence of fatty acid in the environment. Fatty acid can be used as energy by yeast, but it requires major structural changes and commitments. We have observed that at the population level, there is a bifurcation event whereby some cells undergo these changes and others do not. We have analyzed this bifurcation event in mutants for all the non-essential epigenetic regulators in yeast and identified key proteins that affect the time and intensity of this bifurcation. Even though fatty acid triggers major morphological changes in the cell, the regression model still makes it possible to analyze the over 5000 flow cytometry samples in this data set in an automated manner, whereas a traditional gating approach would be impossible.
Cells exposed to stimuli exhibit a wide range of responses ensuring phenotypic variability across the population. Such single cell behavior is often examined by flow cytometry; however, gating procedures typically employed to select a small subpopulation of cells with similar morphological characteristics make it difficult, even impossible, to quantitatively compare cells across a large variety of experimental conditions because these conditions can lead to profound morphological variations. To overcome these limitations, we developed a regression approach to correct for variability in fluorescence intensity due to differences in cell size and granularity without discarding any of the cells, which gating ipso facto does. This approach enables quantitative studies of cellular heterogeneity and transcriptional noise in high-throughput experiments involving thousands of samples. We used this approach to analyze a library of yeast knockout strains and reveal genes required for the population to establish a bimodal response to oleic acid induction. We identify a group of epigenetic regulators and nucleoporins that, by maintaining an ‘unresponsive population,' may provide the population with the advantage of diversified bet hedging.
PMCID: PMC3202802  PMID: 21952134
flow cytometry; high-throughput experiments; statistical regression model; transcriptional noise
19.  Guidelines for the design, analysis and interpretation of ‘omics’ data: focus on human endometrium 
Human Reproduction Update  2013;20(1):12-28.
‘Omics’ high-throughput analyses, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, are widely applied in human endometrial studies. Analysis of endometrial transcriptome patterns in physiological and pathophysiological conditions has been to date the most commonly applied ‘omics’ technique in human endometrium. As the technologies improve, proteomics holds the next big promise for this field. The ‘omics’ technologies have undoubtedly advanced our knowledge of human endometrium in relation to fertility and different diseases. Nevertheless, the challenges arising from the vast amount of data generated and the broad variation of ‘omics’ profiling according to different environments and stimuli make it difficult to assess the validity, reproducibility and interpretation of such ‘omics’ data. With the expansion of ‘omics’ analyses in the study of the endometrium, there is a growing need to develop guidelines for the design of studies, and the analysis and interpretation of ‘omics’ data.
Systematic review of the literature in PubMed, and references from relevant articles were investigated up to March 2013.
The current review aims to provide guidelines for future ‘omics’ studies on human endometrium, together with a summary of the status and trends, promise and shortcomings in the high-throughput technologies. In addition, the approaches presented here can be adapted to other areas of high-throughput ‘omics’ studies.
A highly rigorous approach to future studies, based on the guidelines provided here, is a prerequisite for obtaining data on biological systems which can be shared among researchers worldwide and will ultimately be of clinical benefit.
PMCID: PMC3845681  PMID: 24082038
endometrium; epigenomics; genomics; metabolomics; proteomics
20.  Information encoded in a network of inflammation proteins predicts clinical outcome after myocardial infarction 
BMC Medical Genomics  2011;4:59.
Inflammation plays an important role in cardiac repair after myocardial infarction (MI). Nevertheless, the systems-level characterization of inflammation proteins in MI remains incomplete. There is a need to demonstrate the potential value of molecular network-based approaches to translational research. We investigated the interplay of inflammation proteins and assessed network-derived knowledge to support clinical decisions after MI. The main focus is the prediction of clinical outcome after MI.
We assembled My-Inflamome, a network of protein interactions related to inflammation and prognosis in MI. We established associations between network properties, disease biology and capacity to distinguish between prognostic categories. The latter was tested with classification models built on blood-derived microarray data from post-MI patients with different outcomes. This was followed by experimental verification of significant associations.
My-Inflamome is organized into modules highly specialized in different biological processes relevant to heart repair. Highly connected proteins also tend to be high-traffic components. Such bottlenecks together with genes extracted from the modules provided the basis for novel prognostic models, which could not have been uncovered by standard analyses. Modules with significant involvement in transcriptional regulation are targeted by a small set of microRNAs. We suggest a new panel of gene expression biomarkers (TRAF2, SHKBP1 and UBC) with high discriminatory capability. Follow-up validations reported promising outcomes and motivate future research.
This study enhances understanding of the interaction network that executes inflammatory responses in human MI. Network-encoded information can be translated into knowledge with potential prognostic application. Independent evaluations are required to further estimate the clinical relevance of the new prognostic genes.
PMCID: PMC3152897  PMID: 21756327
21.  Genomic Predictors for Recurrence Patterns of Hepatocellular Carcinoma: Model Derivation and Validation 
PLoS Medicine  2014;11(12):e1001770.
In this study, Lee and colleagues develop a genomic predictor that can identify patients at high risk for late recurrence of hepatocellular carcinoma (HCC) and provided new biomarkers for risk stratification.
Typically observed at 2 y after surgical resection, late recurrence is a major challenge in the management of hepatocellular carcinoma (HCC). We aimed to develop a genomic predictor that can identify patients at high risk for late recurrence and assess its clinical implications.
Methods and Findings
Systematic analysis of gene expression data from human liver undergoing hepatic injury and regeneration revealed a 233-gene signature that was significantly associated with late recurrence of HCC. Using this signature, we developed a prognostic predictor that can identify patients at high risk of late recurrence, and tested and validated the robustness of the predictor in patients (n = 396) who underwent surgery between 1990 and 2011 at four centers (210 recurrences during a median of 3.7 y of follow-up). In multivariate analysis, this signature was the strongest risk factor for late recurrence (hazard ratio, 2.2; 95% confidence interval, 1.3–3.7; p = 0.002). In contrast, our previously developed tumor-derived 65-gene risk score was significantly associated with early recurrence (p = 0.005) but not with late recurrence (p = 0.7). In multivariate analysis, the 65-gene risk score was the strongest risk factor for very early recurrence (<1 y after surgical resection) (hazard ratio, 1.7; 95% confidence interval, 1.1–2.6; p = 0.01). The potential significance of STAT3 activation in late recurrence was predicted by gene network analysis and validated later. We also developed and validated 4- and 20-gene predictors from the full 233-gene predictor. The main limitation of the study is that most of the patients in our study were hepatitis B virus–positive. Further investigations are needed to test our prediction models in patients with different etiologies of HCC, such as hepatitis C virus.
Two independently developed predictors reflected well the differences between early and late recurrence of HCC at the molecular level and provided new biomarkers for risk stratification.
Please see later in the article for the Editors' Summary
Editors' Summary
Primary liver cancer—a tumor that starts when a liver cell acquires genetic changes that allow it to grow uncontrollably—is the second-leading cause of cancer-related deaths worldwide, killing more than 600,000 people annually. If hepatocellular cancer (HCC; the most common type of liver cancer) is diagnosed in its early stages, it can be treated by surgically removing part of the liver (resection), by liver transplantation, or by local ablation, which uses an electric current to destroy the cancer cells. Unfortunately, the symptoms of HCC, which include weight loss, tiredness, and jaundice (yellowing of the skin and eyes), are vague and rarely appear until the cancer has spread throughout the liver. Consequently, HCC is rarely diagnosed before the cancer is advanced and untreatable, and has a poor prognosis (likely outcome)—fewer than 5% of patients survive for five or more years after diagnosis. The exact cause of HCC is unclear, but chronic liver (hepatic) injury and inflammation (caused, for example, by infection with hepatitis B virus [HBV] or by alcohol abuse) promote tumor development.
Why Was This Study Done?
Even when it is diagnosed early, HCC has a poor prognosis because it often recurs. Patients treated for HCC can experience two distinct types of tumor recurrence. Early recurrence, which usually happens within the first two years after surgery, arises from the spread of primary cancer cells into the surrounding liver that left behind during surgery. Late recurrence, which typically happens more than two years after surgery, involves the development of completely new tumors and seems to be the result of chronic liver damage. Because early and late recurrence have different clinical courses, it would be useful to be able to predict which patients are at high risk of which type of recurrence. Given that injury, inflammation, and regeneration seem to prime the liver for HCC development, might the gene expression patterns associated with these conditions serve as predictive markers for the identification of patients at risk of late recurrence of HCC? Here, the researchers develop a genomic predictor for the late recurrence of HCC by examining gene expression patterns in tissue samples from livers that were undergoing injury and regeneration.
What Did the Researchers Do and Find?
By comparing gene expression data obtained from liver biopsies taken before and after liver transplantation or resection and recorded in the US National Center for Biotechnology Information Gene Expression Omnibus database, the researchers identified 233 genes whose expression in liver differed before and after liver injury (the hepatic injury and regeneration, or HIR, signature). Statistical analyses indicate that the expression of the HIR signature in archived tissue samples was significantly associated with late recurrence of HCC in three independent groups of patients, but not with early recurrence (a significant association between two variables is one that is unlikely to have arisen by chance). By contrast, a tumor-derived 65-gene signature previously developed by the researchers was significantly associated with early recurrence but not with late recurrence. Notably, as few as four genes from the HIR signature were sufficient to construct a reliable predictor for late recurrence of HCC. Finally, the researchers report that many of the genes in the HIR signature encode proteins involved in inflammation and cell death, but that others encode proteins involved in cellular growth and proliferation such as STAT3, a protein with a well-known role in liver regeneration.
What Do These Findings Mean?
These findings identify a gene expression signature that was significantly associated with late recurrence of HCC in three independent groups of patients. Because most of these patients were infected with HBV, the ability of the HIR signature to predict late occurrence of HCC may be limited to HBV-related HCC and may not be generalizable to HCC related to other causes. Moreover, the predictive ability of the HIR signature needs to be tested in a prospective study in which samples are taken and analyzed at baseline and patients are followed to see whether their HCC recurs; the current retrospective study analyzed stored tissue samples. Importantly, however, the HIR signature associated with late recurrence and the 65-gene signature associated with early recurrence provide new insights into the biological differences between late and early recurrence of HCC at the molecular level. Knowing about these differences may lead to new treatments for HCC and may help clinicians choose the most appropriate treatments for their patients.
Additional Information
Please access these websites via the online version of this summary at
The US National Cancer Institute provides information about all aspects of cancer, including detailed information for patients and professionals about primary liver cancer (in English and Spanish)
The American Cancer Society also provides information about liver cancer (including information on support programs and services; available in several languages)
The UK National Health Service Choices website provides information about primary liver cancer (including a video about coping with cancer)
Cancer Research UK (a not-for-profit organization) also provides detailed information about primary liver cancer (including information about living with primary liver cancer)
MD Anderson Cancer Center provides information about symptoms, diagnosis, treatment, and prevention of primary liver cancer
MedlinePlus provides links to further resources about liver cancer (in English and Spanish)
PMCID: PMC4275163  PMID: 25536056
22.  Newt-omics: a comprehensive repository for omics data from the newt Notophthalmus viridescens 
Nucleic Acids Research  2011;40(Database issue):D895-D900.
Notophthalmus viridescens, a member of the salamander family is an excellent model organism to study regenerative processes due to its unique ability to replace lost appendages and to repair internal organs. Molecular insights into regenerative events have been severely hampered by the lack of genomic, transcriptomic and proteomic data, as well as an appropriate database to store such novel information. Here, we describe ‘Newt-omics’ (, a database, which enables researchers to locate, retrieve and store data sets dedicated to the molecular characterization of newts. Newt-omics is a transcript-centred database, based on an Expressed Sequence Tag (EST) data set from the newt, covering ∼50 000 Sanger sequenced transcripts and a set of high-density microarray data, generated from regenerating hearts. Newt-omics also contains a large set of peptides identified by mass spectrometry, which was used to validate 13 810 ESTs as true protein coding. Newt-omics is open to implement additional high-throughput data sets without changing the database structure. Via a user-friendly interface Newt-omics allows access to a huge set of molecular data without the need for prior bioinformatical expertise.
PMCID: PMC3245081  PMID: 22039101
23.  Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration 
BMC Medicine  2013;11:220.
High-throughput ?omics? technologies that generate molecular profiles for biospecimens have been extensively used in preclinical studies to reveal molecular subtypes and elucidate the biological mechanisms of disease, and in retrospective studies on clinical specimens to develop mathematical models to predict clinical endpoints. Nevertheless, the translation of these technologies into clinical tests that are useful for guiding management decisions for patients has been relatively slow. It can be difficult to determine when the body of evidence for an omics-based test is sufficiently comprehensive and reliable to support claims that it is ready for clinical use, or even that it is ready for definitive evaluation in a clinical trial in which it may be used to direct patient therapy. Reasons for this difficulty include the exploratory and retrospective nature of many of these studies, the complexity of these assays and their application to clinical specimens, and the many potential pitfalls inherent in the development of mathematical predictor models from the very high-dimensional data generated by these omics technologies. Here we present a checklist of criteria to consider when evaluating the body of evidence supporting the clinical use of a predictor to guide patient therapy. Included are issues pertaining to specimen and assay requirements, the soundness of the process for developing predictor models, expectations regarding clinical study design and conduct, and attention to regulatory, ethical, and legal issues. The proposed checklist should serve as a useful guide to investigators preparing proposals for studies involving the use of omics-based tests. The US National Cancer Institute plans to refer to these guidelines for review of proposals for studies involving omics tests, and it is hoped that other sponsors will adopt the checklist as well.
PMCID: PMC3852338  PMID: 24228635
Analytical validation; Biomarker; Diagnostic test; Genomic classifier; Model validation; Molecular profile; Omics; Personalized medicine; Precision Medicine; Treatment selection
24.  BiofOmics: A Web Platform for the Systematic and Standardized Collection of High-Throughput Biofilm Data 
PLoS ONE  2012;7(6):e39960.
Consortia of microorganisms, commonly known as biofilms, are attracting much attention from the scientific community due to their impact in human activity. As biofilm research grows to be a data-intensive discipline, the need for suitable bioinformatics approaches becomes compelling to manage and validate individual experiments, and also execute inter-laboratory large-scale comparisons. However, biofilm data is widespread across ad hoc, non-standardized individual files and, thus, data interchange among researchers, or any attempt of cross-laboratory experimentation or analysis, is hardly possible or even attempted.
Methodology/Principal Findings
This paper presents BiofOmics, the first publicly accessible Web platform specialized in the management and analysis of data derived from biofilm high-throughput studies. The aim is to promote data interchange across laboratories, implementing collaborative experiments, and enable the development of bioinformatics tools in support of the processing and analysis of the increasing volumes of experimental biofilm data that are being generated. BiofOmics’ data deposition facility enforces data structuring and standardization, supported by controlled vocabulary. Researchers are responsible for the description of the experiments, their results and conclusions. BiofOmics’ curators interact with submitters only to enforce data structuring and the use of controlled vocabulary. Then, BiofOmics’ search facility makes publicly available the profile and data associated with a submitted study so that any researcher can profit from these standardization efforts to compare similar studies, generate new hypotheses to be tested or even extend the conditions experimented in the study.
BiofOmics’ novelty lies in its support to standardized data deposition, the availability of computerizable data files and the free-of-charge dissemination of biofilm studies across the community. Hopefully, this will open promising research possibilities, namely the comparison of results between different laboratories, the reproducibility of methods within and between laboratories, and the development of guidelines and standardized protocols for biofilm formation operating procedures and analytical methods.
PMCID: PMC3386978  PMID: 22768184
25.  Integrative clustering methods for high-dimensional molecular data 
Translational cancer research  2014;3(3):202-216.
High-throughput ‘omic’ data, such as gene expression, DNA methylation, DNA copy number, has played an instrumental role in furthering our understanding of the molecular basis in states of human health and disease. As cells with similar morphological characteristics can exhibit entirely different molecular profiles and because of the potential that these discrepancies might further our understanding of patient-level variability in clinical outcomes, there is significant interest in the use of high-throughput ‘omic’ data for the identification of novel molecular subtypes of a disease. While numerous clustering methods have been proposed for identifying of molecular subtypes, most were developed for single “omic’ data types and may not be appropriate when more than one ‘omic’ data type are collected on study subjects. Given that complex diseases, such as cancer, arise as a result of genomic, epigenomic, transcriptomic, and proteomic alterations, integrative clustering methods for the simultaneous clustering of multiple ‘omic’ data types have great potential to aid in molecular subtype discovery. Traditionally, ad hoc manual data integration has been performed using the results obtained from the clustering of individual ‘omic’ data types on the same set of patient samples. However, such methods often result in inconsistent assignment of subjects to the molecular cancer subtypes. Recently, several methods have been proposed in the literature that offers a rigorous framework for the simultaneous integration of multiple ‘omic’ data types in a single comprehensive analysis. In this paper, we present a systematic review of existing integrative clustering methods.
PMCID: PMC4166480  PMID: 25243110
Consensus clustering; cophenetic correlation; latent models; mixture models; non-negative matrix factorization

Results 1-25 (1353021)