Search tips
Search criteria

Results 1-25 (1249299)

Clipboard (0)

Related Articles

1.  Three RNA Binding Proteins Form a Complex to Promote Differentiation of Germline Stem Cell Lineage in Drosophila 
PLoS Genetics  2014;10(11):e1004797.
In regenerative tissues, one of the strategies to protect stem cells from genetic aberrations, potentially caused by frequent cell division, is to transiently expand the stem cell daughters before further differentiation. However, failure to exit the transit amplification may lead to overgrowth, and the molecular mechanism governing this regulation remains vague. In a Drosophila mutagenesis screen for factors involved in the regulation of germline stem cell (GSC) lineage, we isolated a mutation in the gene CG32364, which encodes a putative RNA-binding protein (RBP) and is designated as tumorous testis (tut). In tut mutant, spermatogonia fail to differentiate and over-amplify, a phenotype similar to that in mei-P26 mutant. Mei-P26 is a TRIM-NHL tumor suppressor homolog required for the differentiation of GSC lineage. We found that Tut binds preferentially a long isoform of mei-P26 3′UTR, and is essential for the translational repression of mei-P26 reporter. Bam and Bgcn are both RBPs that have also been shown to repress mei-P26 expression. Our genetic analyses indicate that tut, bam, or bgcn is required to repress mei-P26 and to promote the differentiation of GSCs. Biochemically, we demonstrate that Tut, Bam, and Bgcn can form a physical complex in which Bam holds Tut on its N-terminus and Bgcn on its C-terminus. Our in vivo and in vitro evidence illustrate that Tut acts with Bam, Bgcn to accurately coordinate proliferation and differentiation in Drosophila germline stem cell lineage.
Author Summary
In regenerative tissues, the successive differentiation of stem cell lineage is well controlled and coordinated with proper cell proliferation at each differentiation stage. Disruption of the control mechanism can lead to tumor growth or tissue degeneration. The germline stem cell lineage of Drosophila spermatogenesis provides an ideal research model to unravel the genetic network coordinating proliferation and differentiation. In a genetic screen, we identified a male-sterile mutant whose germ cells are under-differentiated and overproliferating. The responsible gene encodes an RNA-binding protein whose target belongs to a tumor suppressor family. We demonstrate that this and two other RNA-binding proteins form a physical and functional unit to ensure the proper differentiation and accurate proliferation of germline stem cell lineage.
PMCID: PMC4238977  PMID: 25412508
2.  Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation 
Bioinformatics  2014;30(12):i113-i120.
Motivation: Gene expression profiling using RNA-seq is a powerful technique for screening RNA species’ landscapes and their dynamics in an unbiased way. While several advanced methods exist for differential expression analysis of RNA-seq data, proper tools to anal.yze RNA-seq time-course have not been proposed.
Results: In this study, we use RNA-seq to measure gene expression during the early human T helper 17 (Th17) cell differentiation and T-cell activation (Th0). To quantify Th17-specific gene expression dynamics, we present a novel statistical methodology, DyNB, for analyzing time-course RNA-seq data. We use non-parametric Gaussian processes to model temporal correlation in gene expression and combine that with negative binomial likelihood for the count data. To account for experiment-specific biases in gene expression dynamics, such as differences in cell differentiation efficiencies, we propose a method to rescale the dynamics between replicated measurements. We develop an MCMC sampling method to make inference of differential expression dynamics between conditions. DyNB identifies several known and novel genes involved in Th17 differentiation. Analysis of differentiation efficiencies revealed consistent patterns in gene expression dynamics between different cultures. We use qRT-PCR to validate differential expression and differentiation efficiencies for selected genes. Comparison of the results with those obtained via traditional timepoint-wise analysis shows that time-course analysis together with time rescaling between cultures identifies differentially expressed genes which would not otherwise be detected.
Availability: An implementation of the proposed computational methods will be available at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4058923  PMID: 24931974
3.  Computational Methods for Estimation of Cell Cycle Phase Distributions of Yeast Cells 
Two computational methods for estimating the cell cycle phase distribution of a budding yeast (Saccharomyces cerevisiae) cell population are presented. The first one is a nonparametric method that is based on the analysis of DNA content in the individual cells of the population. The DNA content is measured with a fluorescence-activated cell sorter (FACS). The second method is based on budding index analysis. An automated image analysis method is presented for the task of detecting the cells and buds. The proposed methods can be used to obtain quantitative information on the cell cycle phase distribution of a budding yeast S. cerevisiae population. They therefore provide a solid basis for obtaining the complementary information needed in deconvolution of gene expression data. As a case study, both methods are tested with data that were obtained in a time series experiment with S. cerevisiae. The details of the time series experiment as well as the image and FACS data obtained in the experiment can be found in the online additional material at
PMCID: PMC3171340  PMID: 18354733
4.  A protein–protein interaction guided method for competitive transcription factor binding improves target predictions 
Nucleic Acids Research  2009;37(22):e146.
An important milestone in revealing cells' functions is to build a comprehensive understanding of transcriptional regulation processes. These processes are largely regulated by transcription factors (TFs) binding to DNA sites. Several TF binding site (TFBS) prediction methods have been developed, but they usually model binding of a single TF at a time albeit few methods for predicting binding of multiple TFs also exist. In this article, we propose a probabilistic model that predicts binding of several TFs simultaneously. Our method explicitly models the competitive binding between TFs and uses the prior knowledge of existing protein–protein interactions (PPIs), which mimics the situation in the nucleus. Modeling DNA binding for multiple TFs improves the accuracy of binding site prediction remarkably when compared with other programs and the cases where individual binding prediction results of separate TFs have been combined. The traditional TFBS prediction methods usually predict overwhelming number of false positives. This lack of specificity is overcome remarkably with our competitive binding prediction method. In addition, previously unpredictable binding sites can be detected with the help of PPIs. Source codes are available at∼harrila/.
PMCID: PMC2794167  PMID: 19786498
5.  In silico microdissection of microarray data from heterogeneous cell populations 
BMC Bioinformatics  2005;6:54.
Very few analytical approaches have been reported to resolve the variability in microarray measurements stemming from sample heterogeneity. For example, tissue samples used in cancer studies are usually contaminated with the surrounding or infiltrating cell types. This heterogeneity in the sample preparation hinders further statistical analysis, significantly so if different samples contain different proportions of these cell types. Thus, sample heterogeneity can result in the identification of differentially expressed genes that may be unrelated to the biological question being studied. Similarly, irrelevant gene combinations can be discovered in the case of gene expression based classification.
We propose a computational framework for removing the effects of sample heterogeneity by "microdissecting" microarray data in silico. The computational method provides estimates of the expression values of the pure (non-heterogeneous) cell samples. The inversion of the sample heterogeneity can be facilitated by providing accurate estimates of the mixing percentages of different cell types in each measurement. For those cases where no such information is available, we develop an optimization-based method for joint estimation of the mixing percentages and the expression values of the pure cell samples. We also consider the problem of selecting the correct number of cell types.
The efficiency of the proposed methods is illustrated by applying them to a carefully controlled cDNA microarray data obtained from heterogeneous samples. The results demonstrate that the methods are capable of reconstructing both the sample and cell type specific expression values from heterogeneous mixtures and that the mixing percentages of different cell types can also be estimated. Furthermore, a general purpose model selection method can be used to select the correct number of cell types.
PMCID: PMC1274251  PMID: 15766384
6.  Patterns of basal signaling heterogeneity can distinguish cellular populations with different drug sensitivities 
Non small cell lung cancer H460 clones exhibit a high degree of heterogeneity in signaling states.Clones with similar patterns of basal signaling heterogeneity have similar paclitaxel sensitivities.Models of signaling heterogeneity among the clones can be used to classify sensitivity to paclitaxel for other cancer populations.
A high degree of phenotypic diversity has been classically observed among cancer cells, even within a single tumor (Heppner, 1984; Anderson et al, 2006; Ichim and Wells, 2006; Campbell and Polyak, 2007). Importantly, not all cancer cells contribute equally to disease progression or respond equally to therapeutic intervention (Campbell and Polyak, 2007). This heterogeneity has traditionally been viewed as an impediment to efficient diagnosis and treatment. Understanding the relevance of cellular diversity to cancer requires methods for relating patterns of phenotypic heterogeneity to functional outcomes, such as drug sensitivity. Recent advances in fluorescence microscopy image-based analysis have enabled quantitative single-cell measurements of the activation and (co-)localization of signaling molecules within large cellular populations (Boland and Murphy, 2001; Perlman et al, 2004). Here, we apply this technology to explore the extent to which patterns of basal signaling heterogeneity, present within cancer populations before treatment, reveal information about population-level response to drug perturbation.
To investigate basal cell signaling heterogeneity among a collection of cancer populations having minimal exogenous differences, such as those due to environment, cell type, and genetic background, we generated a collection of 49 low-passage clonal populations from the highly metastatic nonsmall cell lung cancer cell line H460 (Kozaki et al, 2000). We chose to observe patterns of spatial organization and activation for multiple components from diverse signaling pathways associated with cancer (marker sets 1–4: DNA/pSTAT3/pPTEN; DNA/pERK/pP38; DNA/E-cadherin/β-catenin/pGSK3; DNA/pAkt/H3K9-Ac).
We identified an objective set of signaling stereotypes from each marker set based on a probabilistic description of the distribution of cells in the feature space. For each marker set, a ‘reference' set of representative cells was sampled from all 50 H460 cancer populations. Then, each reference set was represented as a mixture of subpopulations modeled as Gaussian distributions with means centered on distinct, ‘stereotyped' signaling states (Slack et al, 2008). Our quantitative analysis suggested that a small collection of signaling stereotypes was sufficient to characterize the complexity of observed cellular phenotypes among all clones. For simplicity, we chose to use five subpopulations to model cellular heterogeneity in each marker set.
For each clone, we computed the fraction of cells in each of the identified subpopulations (Figure 2, scatter plots). Estimation of these fractions allowed us to represent each clone as a probabilistic ensemble of subpopulations. Visual differences among the clones (Figure 2, thumbnail images) were reflected by clear differences in subpopulation mixtures (Figure 2, scatter plots). To compare the subpopulation mixtures of each clone to the parent, a ‘subpopulation enrichment' profile vector was computed. The vector measured the log-fold change between the clone and the H460 parent population for each subpopulation (Figure 2, heat map).
We applied hierarchical clustering to group clones based on the similarity of their subpopulation enrichment profiles (Figure 2). Clustering by subpopulation enrichment profiles revealed only a small number of distinct patterns (or ‘signatures') of subpopulation mixtures (Figure 2, dendrogram and heat map). Thus, parameterization of observed cellular heterogeneity using subpopulation enrichment profiles succinctly encapsulated the apparent complexity of cancer cell phenotypes, and further allowed comparison of clonal populations at a resolution greater than provided by population means.
We next assessed the degree to which clones with distinct patterns of heterogeneity had distinct responses to the drug paclitaxel. We used a multidimensional scaling (Borg and Groenen, 1997) plot to visualize similarity among the clones and annotated each clone with the index of drug sensitivity. This visualization revealed striking geometric separation in ‘profile space' of paclitaxel-sensitive from paclitaxel-nonsensitive clones for each marker set (Figure 3A, green versus red and black circles). The significance of separation was further confirmed by machine learning-based classification studies. Thus heterogeneity of basal cellular signaling states contained information that could be used to predict sensitivity to drug treatment.
Our approach is general, and makes heterogeneity a computable property of cellular populations. Interrogation at subpopulation-resolution facilitated a dramatic reduction in the observed phenotypic complexity of cancer populations, yet retained sufficient biological information to identify drug responses. Our work suggests that rigorous analysis of cancer heterogeneity can provide a new resolution at which to match disease to more effective therapies.
Phenotypic heterogeneity has been widely observed in cellular populations. However, the extent to which heterogeneity contains biologically or clinically important information is not well understood. Here, we investigated whether patterns of basal signaling heterogeneity, in untreated cancer cell populations, could distinguish cellular populations with different drug sensitivities. We modeled cellular heterogeneity as a mixture of stereotyped signaling states, identified based on colocalization patterns of activated signaling molecules from microscopy images. We found that patterns of heterogeneity could be used to separate the most sensitive and resistant populations to paclitaxel within a set of H460 lung cancer clones and within the NCI-60 panel of cancer cell lines, but not for a set of less heterogeneous, immortalized noncancer human bronchial epithelial cell (HBEC) clones. Our results suggest that patterns of signaling heterogeneity, characterized as ensembles of a small number of distinct phenotypic states, can reveal functional differences among cellular populations.
PMCID: PMC2890326  PMID: 20461076
cancer; heterogeneity; multivariate analysis; signaling; systems biology
7.  NanoMiner — Integrative Human Transcriptomics Data Resource for Nanoparticle Research 
PLoS ONE  2013;8(7):e68414.
The potential impact of nanoparticles on the environment and on human health has attracted considerable interest worldwide. The amount of transcriptomics data, in which tissues and cell lines are exposed to nanoparticles, increases year by year. In addition to the importance of the original findings, this data can have value in broader context when combined with other previously acquired and published results. In order to facilitate the efficient usage of the data, we have developed the NanoMiner web resource (, which contains 404 human transcriptome samples exposed to various types of nanoparticles. All the samples in NanoMiner have been annotated, preprocessed and normalized using standard methods that ensure the quality of the data analyses and enable the users to utilize the database systematically across the different experimental setups and platforms. With NanoMiner it is possible to 1) search and plot the expression profiles of one or several genes of interest, 2) cluster the samples within the datasets, 3) find differentially expressed genes in various nanoparticle studies, 4) detect the nanoparticles causing differential expression of selected genes, 5) analyze enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms for the detected genes and 6) search the expression values and differential expressions of the genes belonging to a specific KEGG pathway or Gene Ontology. In sum, NanoMiner database is a valuable collection of microarray data which can be also used as a data repository for future analyses.
PMCID: PMC3709991  PMID: 23874618
8.  A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues 
BMC Bioinformatics  2013;14(Suppl 5):S11.
RNA-seq, a next-generation sequencing based method for transcriptome analysis, is rapidly emerging as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNA-seq can be highly impacted by the purity of samples. A prominent, outstanding problem in RNA-seq is how to estimate transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. Although experimental methods have been proposed to dissect multiple distinct cell types, computationally "deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as the subsequent molecular content yield intact.
Here we propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT), to estimate the transcript abundances of each cell type of interest from RNA-seq data of heterogeneous tissue samples. TEMT incorporates positional and sequence-specific biases, and its online EM algorithm only requires a runtime proportional to the data size and a small constant memory. We test the proposed method on both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current state-of-the-art methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting from multi cell types. TEMT is written in python, and is freely available at
The probabilistic model-based approach proposed here provides a new method for analyzing RNA-seq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
PMCID: PMC3622628  PMID: 23735186
9.  EPEPT: A web service for enhanced P-value estimation in permutation tests 
BMC Bioinformatics  2011;12:411.
In computational biology, permutation tests have become a widely used tool to assess the statistical significance of an event under investigation. However, the common way of computing the P-value, which expresses the statistical significance, requires a very large number of permutations when small (and thus interesting) P-values are to be accurately estimated. This is computationally expensive and often infeasible. Recently, we proposed an alternative estimator, which requires far fewer permutations compared to the standard empirical approach while still reliably estimating small P-values [1].
The proposed P-value estimator has been enriched with additional functionalities and is made available to the general community through a public website and web service, called EPEPT. This means that the EPEPT routines can be accessed not only via a website, but also programmatically using any programming language that can interact with the web. Examples of web service clients in multiple programming languages can be downloaded. Additionally, EPEPT accepts data of various common experiment types used in computational biology. For these experiment types EPEPT first computes the permutation values and then performs the P-value estimation. Finally, the source code of EPEPT can be downloaded.
Different types of users, such as biologists, bioinformaticians and software engineers, can use the method in an appropriate and simple way.
PMCID: PMC3277916  PMID: 22024252
10.  The Human TUT1 Nucleotidyl Transferase as a Global Regulator of microRNA Abundance 
PLoS ONE  2013;8(7):e69630.
Post-transcriptional modifications of miRNAs with 3′ non-templated nucleotide additions (NTA) are a common phenomenon, and for a handful of miRNAs the additions have been demonstrated to modulate miRNA stability. However, it is unknown for the vast majority of miRNAs whether nucleotide additions are associated with changes in miRNA expression levels. We previously showed that miRNA 3′ additions are regulated by multiple nucleotidyl transferase enzymes. Here we examine the changes in abundance of miRNAs that exhibit altered 3′ NTA following the suppression of a panel of nucleotidyl transferases in cancer cell lines. Among the miRNAs examined, those with increased 3′ additions showed a significant decrease in abundance. More specifically, miRNAs that gained a 3′ uridine were associated with the greatest decrease in expression, consistent with a model in which 3′ uridylation influences miRNA stability. We also observed that suppression of one nucleotidyl transferase, TUT1, resulted in a global decrease in miRNA levels of approximately 40% as measured by qRT-PCR-based miRNA profiling. The mechanism of this global miRNA suppression appears to be indirect, as it occurred irrespective of changes in 3′ nucleotide addition. Also, expression of miRNA primary transcripts did not decrease following TUT1 knockdown, indicating that the mechanism is post-transcriptional. In conclusion, our results suggest that TUT1 affects miRNAs through both a direct effect on 3′ nucleotide additions to specific miRNAs and a separate, indirect effect on miRNA abundance more globally.
PMCID: PMC3715485  PMID: 23874977
11.  Dynamic Training Volume: A Construct of Both Time Under Tension and Volume Load 
The purpose of this study was to investigate the effects of three different weight training protocols, that varied in the way training volume was measured, on acute muscular fatigue. Ten resistance-trained males performed all three protocols which involved dynamic constant resistance exercise of the elbow flexors. Protocol A provided a standard for the time the muscle group was under tension (TUT) and volume load (VL), expressed as the product of the total number of repetitions and the load that was lifted. Protocol B involved 40% of the TUT but the same VL compared to protocol A; protocol C was equated with protocol A for TUT but only involved 50% of the VL. Fatigue was assessed by changes in maximum voluntary isometric force and integrated electromyography (iEMG) between the pre- and post-training protocols. The results of the study showed that, when equated for VL, greater TUT produced greater overall muscular fatigue (p ≤ 0.001) as reflected by the reduction in the force generating capability of the muscle. When the protocols were equated for TUT, greater VL (p ≤ 0.01) resulted in greater overall muscular fatigue. All three protocols resulted in significant decreases in iEMG (p ≤ 0.05) but they were not significantly different from each other. It was concluded that, because of the importance of training volume to neuromuscular adaptation, the training volume needs to be clearly described when designing resistance training programs.
Key PointsIncrease in either time under tension (TUT) or volume load (VL) increases the acute fatigue response, despite being equated for volume (by another method).A potential discrepancy in training volume may be present with training parameters that fail to control for either TUT or VL.Neural fatigue may be a contributing factor to the development of muscular fatigue but is not influenced by various methods of calculating volume such as TUT or VL.
PMCID: PMC3861774  PMID: 24357968
Resistance training; maximal voluntary contraction; fatigue; electromyography
12.  Moving from Data on Deaths to Public Health Policy in Agincourt, South Africa: Approaches to Analysing and Understanding Verbal Autopsy Findings 
PLoS Medicine  2010;7(8):e1000325.
Peter Byass and colleagues compared two methods of assessing data from verbal autopsies, review by physicians or probabilistic modeling, and show that probabilistic modeling is the most efficient means of analyzing these data
Cause of death data are an essential source for public health planning, but their availability and quality are lacking in many parts of the world. Interviewing family and friends after a death has occurred (a procedure known as verbal autopsy) provides a source of data where deaths otherwise go unregistered; but sound methods for interpreting and analysing the ensuing data are essential. Two main approaches are commonly used: either physicians review individual interview material to arrive at probable cause of death, or probabilistic models process the data into likely cause(s). Here we compare and contrast these approaches as applied to a series of 6,153 deaths which occurred in a rural South African population from 1992 to 2005. We do not attempt to validate either approach in absolute terms.
Methods and Findings
The InterVA probabilistic model was applied to a series of 6,153 deaths which had previously been reviewed by physicians. Physicians used a total of 250 cause-of-death codes, many of which occurred very rarely, while the model used 33. Cause-specific mortality fractions, overall and for population subgroups, were derived from the model's output, and the physician causes coded into comparable categories. The ten highest-ranking causes accounted for 83% and 88% of all deaths by physician interpretation and probabilistic modelling respectively, and eight of the highest ten causes were common to both approaches. Top-ranking causes of death were classified by population subgroup and period, as done previously for the physician-interpreted material. Uncertainty around the cause(s) of individual deaths was recognised as an important concept that should be reflected in overall analyses. One notably discrepant group involved pulmonary tuberculosis as a cause of death in adults aged over 65, and these cases are discussed in more detail, but the group only accounted for 3.5% of overall deaths.
There were no differences between physician interpretation and probabilistic modelling that might have led to substantially different public health policy conclusions at the population level. Physician interpretation was more nuanced than the model, for example in identifying cancers at particular sites, but did not capture the uncertainty associated with individual cases. Probabilistic modelling was substantially cheaper and faster, and completely internally consistent. Both approaches characterised the rise of HIV-related mortality in this population during the period observed, and reached similar findings on other major causes of mortality. For many purposes probabilistic modelling appears to be the best available means of moving from data on deaths to public health actions.
Please see later in the article for the Editors' Summary
Editors' Summary
Whenever someone dies in a developed country, the cause of death is determined by a doctor and entered into a “vital registration system,” a record of all the births and deaths in that country. Public-health officials and medical professionals use this detailed and complete information about causes of death to develop public-health programs and to monitor how these programs affect the nation's health. Unfortunately, in many developing countries dying people are not attended by doctors and vital registration systems are incomplete. In most African countries, for example, less than one-quarter of deaths are recorded in vital registration systems. One increasingly important way to improve knowledge about the patterns of death in developing countries is “verbal autopsy” (VA). Using a standard form, trained personnel ask relatives and caregivers about the symptoms that the deceased had before his/her death and about the circumstances surrounding the death. Physicians then review these forms and assign a specific cause of death from a shortened version of the International Classification of Diseases, a list of codes for hundreds of diseases.
Why Was This Study Done?
Physician review of VA forms is time-consuming and expensive. Consequently, computer-based, “probabilistic” models have been developed that process the VA data and provide a likely cause of death. These models are faster and cheaper than physician review of VAs and, because they do not rely on the views of local doctors about the likely causes of death, they are more internally consistent. But are physician review and probabilistic models equally sound ways of interpreting VA data? In this study, the researchers compare and contrast the interpretation of VA data by physician review and by a probabilistic model called the InterVA model by applying these two approaches to the deaths that occurred in Agincourt, a rural region of northeast South Africa, between 1992 and 2005. The Agincourt health and sociodemographic surveillance system is a member of the INDEPTH Network, a global network that is evaluating the health and demographic characteristics (for example, age, gender, and education) of populations in low- and middle-income countries over several years.
What Did the Researchers Do and Find?
The researchers applied the InterVA probabilistic model to 6,153 deaths that had been previously reviewed by physicians. They grouped the 250 cause-of-death codes used by the physicians into categories comparable with the 33 cause-of-death codes used by the InterVA model and derived cause-specific mortality fractions (the proportions of the population dying from specific causes) for the whole population and for subgroups (for example, deaths in different age groups and deaths occurring over specific periods of time) from the output of both approaches. The ten highest-ranking causes of death accounted for 83% and 88% of all deaths by physician interpretation and by probabilistic modelling, respectively. Eight of the most frequent causes of death—HIV, tuberculosis, chronic heart conditions, diarrhea, pneumonia/sepsis, transport-related accidents, homicides, and indeterminate—were common to both interpretation methods. Both methods coded about a third of all deaths as indeterminate, often because of incomplete VA data. Generally, there was close agreement between the methods for the five principal causes of death for each age group and for each period of time, although one notable discrepancy was pulmonary (lung) tuberculosis, which accounted for 6.4% and 21.3% of deaths in this age group, respectively, according to the physicians and to the model. However, these deaths accounted for only 3.5% of all the deaths.
What Do These Findings Mean?
These findings reveal no differences between the cause-specific mortality fractions determined from VA data by physician interpretation and by probabilistic modelling that might have led to substantially different public-health policy programmes being initiated in this population. Importantly, both approaches clearly chart the rise of HIV-related mortality in this South African population between 1992 and 2005 and reach similar findings on other major causes of mortality. The researchers note that, although preparing the amount of VA data considered here for entry into the probabilistic model took several days, the model itself runs very quickly and always gives consistent answers. Given these findings, the researchers conclude that in many settings probabilistic modeling represents the best means of moving from VA data to public-health actions.
Additional Information
Please access these Web sites via the online version of this summary at
The importance of accurate data on death is further discussed in a perspective previously published in PLoS Medicine Perspective by Colin Mathers and Ties Boerma
The World Health Organization (WHO) provides information on the vital registration of deaths and on the International Classification of Diseases; the WHO Health Metrics Network is a global collaboration focused on improving sources of vital statistics; and the WHO Global Health Observatory brings together core health statistics for WHO member states
The INDEPTH Network is a global collaboration that is collecting health statistics from developing countries; it provides more information about the Agincourt health and socio-demographic surveillance system and access to standard VA forms
Information on the Agincourt health and sociodemographic surveillance system is available on the University of Witwatersrand Web site
The InterVA Web site provides resources for interpreting verbal autopsy data and the Umeå Centre for Global Health Reseach, where the InterVA model was developed, is found at
A recent PLoS Medicine Essay by Peter Byass, lead author of this study, discusses The Unequal World of Health Data
PMCID: PMC2923087  PMID: 20808956
13.  POMO - Plotting Omics analysis results for Multiple Organisms 
BMC Genomics  2013;14:918.
Systems biology experiments studying different topics and organisms produce thousands of data values across different types of genomic data. Further, data mining analyses are yielding ranked and heterogeneous results and association networks distributed over the entire genome. The visualization of these results is often difficult and standalone web tools allowing for custom inputs and dynamic filtering are limited.
We have developed POMO (, an interactive web-based application to visually explore omics data analysis results and associations in circular, network and grid views. The circular graph represents the chromosome lengths as perimeter segments, as a reference outer ring, such as cytoband for human. The inner arcs between nodes represent the uploaded network. Further, multiple annotation rings, for example depiction of gene copy number changes, can be uploaded as text files and represented as bar, histogram or heatmap rings. POMO has built-in references for human, mouse, nematode, fly, yeast, zebrafish, rice, tomato, Arabidopsis, and Escherichia coli. In addition, POMO provides custom options that allow integrated plotting of unsupported strains or closely related species associations, such as human and mouse orthologs or two yeast wild types, studied together within a single analysis. The web application also supports interactive label and weight filtering. Every iterative filtered result in POMO can be exported as image file and text file for sharing or direct future input.
The POMO web application is a unique tool for omics data analysis, which can be used to visualize and filter the genome-wide networks in the context of chromosomal locations as well as multiple network layouts. With the several illustration and filtering options the tool supports the analysis and visualization of any heterogeneous omics data analysis association results for many organisms. POMO is freely available and does not require any installation or registration.
PMCID: PMC3880012  PMID: 24365393
Omics; Association; Visualization; Ortholog; Phenolog; Genome-wide; Network; Model organism
14.  A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays 
BMC Bioinformatics  2006;7:514.
Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples.
We propose an extension of the well-known Naïve Bayes classifier, which accounts for biological heterogeneity in a probabilistic framework, relying on Bayesian hierarchical models. The model, which can be efficiently learned from the training dataset, exploits a closed-form of classification equation, thus providing no additional computational cost with respect to the standard Naïve Bayes classifier. We validated the approach on several simulated datasets comparing its performances with the Naïve Bayes classifier. Moreover, we demonstrated that explicitly dealing with heterogeneity can improve classification accuracy on a TMA prostate cancer dataset.
The proposed Hierarchical Naïve Bayes classifier can be conveniently applied in problems where within sample heterogeneity must be taken into account, such as TMA experiments and biological contexts where several measurements (replicates) are available for the same biological sample. The performance of the new approach is better than the standard Naïve Bayes model, in particular when the within sample heterogeneity is different in the different classes.
PMCID: PMC1698579  PMID: 17125514
15.  Nuclear Receptor Expression Defines a Set of Prognostic Biomarkers for Lung Cancer 
PLoS Medicine  2010;7(12):e1000378.
David Mangelsdorf and colleagues show that nuclear receptor expression is strongly associated with clinical outcomes of lung cancer patients, and this expression profile is a potential prognostic signature for lung cancer patient survival time, particularly for individuals with early stage disease.
The identification of prognostic tumor biomarkers that also would have potential as therapeutic targets, particularly in patients with early stage disease, has been a long sought-after goal in the management and treatment of lung cancer. The nuclear receptor (NR) superfamily, which is composed of 48 transcription factors that govern complex physiologic and pathophysiologic processes, could represent a unique subset of these biomarkers. In fact, many members of this family are the targets of already identified selective receptor modulators, providing a direct link between individual tumor NR quantitation and selection of therapy. The goal of this study, which begins this overall strategy, was to investigate the association between mRNA expression of the NR superfamily and the clinical outcome for patients with lung cancer, and to test whether a tumor NR gene signature provided useful information (over available clinical data) for patients with lung cancer.
Methods and Findings
Using quantitative real-time PCR to study NR expression in 30 microdissected non-small-cell lung cancers (NSCLCs) and their pair-matched normal lung epithelium, we found great variability in NR expression among patients' tumor and non-involved lung epithelium, found a strong association between NR expression and clinical outcome, and identified an NR gene signature from both normal and tumor tissues that predicted patient survival time and disease recurrence. The NR signature derived from the initial 30 NSCLC samples was validated in two independent microarray datasets derived from 442 and 117 resected lung adenocarcinomas. The NR gene signature was also validated in 130 squamous cell carcinomas. The prognostic signature in tumors could be distilled to expression of two NRs, short heterodimer partner and progesterone receptor, as single gene predictors of NSCLC patient survival time, including for patients with stage I disease. Of equal interest, the studies of microdissected histologically normal epithelium and matched tumors identified expression in normal (but not tumor) epithelium of NGFIB3 and mineralocorticoid receptor as single gene predictors of good prognosis.
NR expression is strongly associated with clinical outcomes for patients with lung cancer, and this expression profile provides a unique prognostic signature for lung cancer patient survival time, particularly for those with early stage disease. This study highlights the potential use of NRs as a rational set of therapeutically tractable genes as theragnostic biomarkers, and specifically identifies short heterodimer partner and progesterone receptor in tumors, and NGFIB3 and MR in non-neoplastic lung epithelium, for future detailed translational study in lung cancer.
Please see later in the article for the Editors' Summary
Editors' Summary
Lung cancer, the most common cause of cancer-related death, kills 1.3 million people annually. Most lung cancers are “non-small-cell lung cancers” (NSCLCs), and most are caused by smoking. Exposure to chemicals in smoke causes changes in the genes of the cells lining the lungs that allow the cells to grow uncontrollably and to move around the body. How NSCLC is treated and responds to treatment depends on its “stage.” Stage I tumors, which are small and confined to the lung, are removed surgically, although chemotherapy is also sometimes given. Stage II tumors have spread to nearby lymph nodes and are treated with surgery and chemotherapy, as are some stage III tumors. However, because cancer cells in stage III tumors can be present throughout the chest, surgery is not always possible. For such cases, and for stage IV NSCLC, where the tumor has spread around the body, patients are treated with chemotherapy alone. About 70% of patients with stage I and II NSCLC but only 2% of patients with stage IV NSCLC survive for five years after diagnosis; more than 50% of patients have stage IV NSCLC at diagnosis.
Why Was This Study Done?
Patient responses to treatment vary considerably. Oncologists (doctors who treat cancer) would like to know which patients have a good prognosis (are likely to do well) to help them individualize their treatment. Consequently, the search is on for “prognostic tumor biomarkers,” molecules made by cancer cells that can be used to predict likely clinical outcomes. Such biomarkers, which may also be potential therapeutic targets, can be identified by analyzing the overall pattern of gene expression in a panel of tumors using a technique called microarray analysis and looking for associations between the expression of sets of genes and clinical outcomes. In this study, the researchers take a more directed approach to identifying prognostic biomarkers by investigating the association between the expression of the genes encoding nuclear receptors (NRs) and clinical outcome in patients with lung cancer. The NR superfamily contains 48 transcription factors (proteins that control the expression of other genes) that respond to several hormones and to diet-derived fats. NRs control many biological processes and are targets for several successful drugs, including some used to treat cancer.
What Did the Researchers Do and Find?
The researchers analyzed the expression of NR mRNAs using “quantitative real-time PCR” in 30 microdissected NSCLCs and in matched normal lung tissue samples (mRNA is the blueprint for protein production). They then used an approach called standard classification and regression tree analysis to build a prognostic model for NSCLC based on the expression data. This model predicted both survival time and disease recurrence among the patients from whom the tumors had been taken. The researchers validated their prognostic model in two large independent lung adenocarcinoma microarray datasets and in a squamous cell carcinoma dataset (adenocarcinomas and squamous cell carcinomas are two major NSCLC subtypes). Finally, they explored the roles of specific NRs in the prediction model. This analysis revealed that the ability of the NR signature in tumors to predict outcomes was mainly due to the expression of two NRs—the short heterodimer partner (SHP) and the progesterone receptor (PR). Expression of either gene could be used as a single gene predictor of the survival time of patients, including those with stage I disease. Similarly, the expression of either nerve growth factor induced gene B3 (NGFIB3) or mineralocorticoid receptor (MR) in normal tissue was a single gene predictor of a good prognosis.
What Do These Findings Mean?
These findings indicate that the expression of NR mRNA is strongly associated with clinical outcomes in patients with NSCLC. Furthermore, they identify a prognostic NR expression signature that provides information on the survival time of patients, including those with early stage disease. The signature needs to be confirmed in more patients before it can be used clinically, and researchers would like to establish whether changes in mRNA expression are reflected in changes in protein expression if NRs are to be targeted therapeutically. Nevertheless, these findings highlight the potential use of NRs as prognostic tumor biomarkers. Furthermore, they identify SHP and PR in tumors and two NRs in normal lung tissue as molecules that might provide new targets for the treatment of lung cancer and new insights into the early diagnosis, pathogenesis, and chemoprevention of lung cancer.
Additional Information
Please access these Web sites via the online version of this summary at
The Nuclear Receptor Signaling Atlas (NURSA) is consortium of scientists sponsored by the US National Institutes of Health that provides scientific reagents, datasets, and educational material on nuclear receptors and their co-regulators to the scientific community through a Web-based portal
The Cancer Prevention and Research Institute of Texas (CPRIT) provides information and resources to anyone interested in the prevention and treatment of lung and other cancers
The US National Cancer Institute provides detailed information for patients and professionals about all aspects of lung cancer, including information on non-small-cell carcinoma and on tumor markers (in English and Spanish)
Cancer Research UK also provides information about lung cancer and information on how cancer starts
MedlinePlus has links to other resources about lung cancer (in English and Spanish)
Wikipedia has a page on nuclear receptors (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
PMCID: PMC3001894  PMID: 21179495
16.  Concentric and Eccentric Time-Under-Tension during Strengthening Exercises: Validity and Reliability of Stretch-Sensor Recordings from an Elastic Exercise-Band 
PLoS ONE  2013;8(6):e68172.
Total, single repetition and contraction-phase specific (concentric and eccentric) time-under-tension (TUT) are important exercise-descriptors, as they are linked to the physiological and clinical response in exercise and rehabilitation.
To investigate the validity and reliability of total, single repetition, and contraction-phase specific TUT during shoulder abduction exercises, based on data from a stretch-sensor attached to an elastic exercise band.
A concurrent validity and interrater reliability study with two raters was conducted. Twelve participants performed five sets of 10 repetitions of shoulder abduction exercises with an elastic exercise band. Exercises were video-recorded to assess concurrent validity between TUT from stretch-sensor data and from video recordings (gold standard). Agreement between methods was calculated using Limits of Agreement (LoA), and the association was assessed by Pearson correlation coefficients. Interrater reliability was calculated using intraclass correlation coefficients (ICC 2.1).
Total, single repetition, and contraction-phase specific TUT – determined from video and stretch-sensor data – were highly correlated (r>0.99). Agreement between methods was high, as LoA ranged from 0.0 to 3.1 seconds for total TUT (2.6% of mean TUT), from -0.26 to 0.56 seconds for single repetition TUT (6.9%), and from -0.29 to 0.56 seconds for contraction-phase specific TUT (13.2-21.1%). Interrater reliability for total, single repetition and contraction-phase specific TUT was high (ICC>0.99). Interrater agreement was high, as LoA ranged from -2.11 to 2.56 seconds for total TUT (4.7%), from -0.46 to 0.50 seconds for single repetition TUT (9.7%) and from -0.41 to 0.44 seconds for contraction-phase specific TUT (5.2-14.5%).
Data from a stretch-sensor attached to an elastic exercise band is a valid measure of total and single repetition time-under-tension, and the procedure is highly reliable. This method will enable clinicians and researchers to objectively quantify if home-based exercises are performed as prescribed, with respect to time-under-tension.
PMCID: PMC3692465  PMID: 23825696
17.  New Low Cost Cell and Tissue Acquisition System (CTAS): Microdissection of Live and Frozen Tissues 
Tissue heterogeneity is a serious limiting factor for sound cell-specific molecular studies including genomic and proteomic analyses. Although tissue microdissection technologies (e.g. laser capture microdissection) have advanced tremendously over the last decades several factors such as their generally high cost and inability to microdissect fresh or live tissues limit their widespread use. Therefore, there is a need for a low-cost and easy-to-use microdissection device. Here, we developed a low-cost vacuum-assisted capillary-based cell and tissue acquisition system (CTAS) and demonstrated its use for microdissection of brain tissues samples for several downstream applications including isolation of high quality RNA from microdissected brain tissue samples, their use for proteomics studies and electron microscopy as well as microdissection of native living brain tissues for primary cell culturing. Unlike LCM, CTAS is capable of microdissecting fresh frozen and live tissues, works in a thicker tissue sections ranging from 10 mm to 300 mm and can collect individual cells, cell clusters and subanatomical regions. CTAS has been established as a straightforward and robust microdissection tool, allowing rapid, precise and efficient procurement of specific tissue and cell types at low cost. Developed microdissection protocol avoids extensive heating, chemical treatment, laser beam exposure, and other potentially harmful physical treatment of the tissue samples, thus preserving the primary functions of the dissected cells and the macromolecules within for subsequent downstream applications.
PMCID: PMC3630683
18.  Intra-tumor Genetic Heterogeneity and Mortality in Head and Neck Cancer: Analysis of Data from The Cancer Genome Atlas 
PLoS Medicine  2015;12(2):e1001786.
Although the involvement of intra-tumor genetic heterogeneity in tumor progression, treatment resistance, and metastasis is established, genetic heterogeneity is seldom examined in clinical trials or practice. Many studies of heterogeneity have had prespecified markers for tumor subpopulations, limiting their generalizability, or have involved massive efforts such as separate analysis of hundreds of individual cells, limiting their clinical use. We recently developed a general measure of intra-tumor genetic heterogeneity based on whole-exome sequencing (WES) of bulk tumor DNA, called mutant-allele tumor heterogeneity (MATH). Here, we examine data collected as part of a large, multi-institutional study to validate this measure and determine whether intra-tumor heterogeneity is itself related to mortality.
Methods and Findings
Clinical and WES data were obtained from The Cancer Genome Atlas in October 2013 for 305 patients with head and neck squamous cell carcinoma (HNSCC), from 14 institutions. Initial pathologic diagnoses were between 1992 and 2011 (median, 2008). Median time to death for 131 deceased patients was 14 mo; median follow-up of living patients was 22 mo. Tumor MATH values were calculated from WES results. Despite the multiple head and neck tumor subsites and the variety of treatments, we found in this retrospective analysis a substantial relation of high MATH values to decreased overall survival (Cox proportional hazards analysis: hazard ratio for high/low heterogeneity, 2.2; 95% CI 1.4 to 3.3). This relation of intra-tumor heterogeneity to survival was not due to intra-tumor heterogeneity’s associations with other clinical or molecular characteristics, including age, human papillomavirus status, tumor grade and TP53 mutation, and N classification. MATH improved prognostication over that provided by traditional clinical and molecular characteristics, maintained a significant relation to survival in multivariate analyses, and distinguished outcomes among patients having oral-cavity or laryngeal cancers even when standard disease staging was taken into account. Prospective studies, however, will be required before MATH can be used prognostically in clinical trials or practice. Such studies will need to examine homogeneously treated HNSCC at specific head and neck subsites, and determine the influence of cancer therapy on MATH values. Analysis of MATH and outcome in human-papillomavirus-positive oropharyngeal squamous cell carcinoma is particularly needed.
To our knowledge this study is the first to combine data from hundreds of patients, treated at multiple institutions, to document a relation between intra-tumor heterogeneity and overall survival in any type of cancer. We suggest applying the simply calculated MATH metric of heterogeneity to prospective studies of HNSCC and other tumor types.
In this study, Rocco and colleagues examine data collected as part of a large, multi-institutional study, to validate a measure of tumor heterogeneity called MATH and determine whether intra-tumor heterogeneity is itself related to mortality.
Editors’ Summary
Normally, the cells in human tissues and organs only reproduce (a process called cell division) when new cells are needed for growth or to repair damaged tissues. But sometimes a cell somewhere in the body acquires a genetic change (mutation) that disrupts the control of cell division and allows the cell to grow continuously. As the mutated cell grows and divides, it accumulates additional mutations that allow it to grow even faster and eventually from a lump, or tumor (cancer). Other mutations subsequently allow the tumor to spread around the body (metastasize) and destroy healthy tissues. Tumors can arise anywhere in the body—there are more than 200 different types of cancer—and about one in three people will develop some form of cancer during their lifetime. Many cancers can now be successfully treated, however, and people often survive for years after a diagnosis of cancer before, eventually, dying from another disease.
Why Was This Study Done?
The gradual acquisition of mutations by tumor cells leads to the formation of subpopulations of cells, each carrying a different set of mutations. This “intra-tumor heterogeneity” can produce tumor subclones that grow particularly quickly, that metastasize aggressively, or that are resistant to cancer treatments. Consequently, researchers have hypothesized that high intra-tumor heterogeneity leads to worse clinical outcomes and have suggested that a simple measure of this heterogeneity would be a useful addition to the cancer staging system currently used by clinicians for predicting the likely outcome (prognosis) of patients with cancer. Here, the researchers investigate whether a measure of intra-tumor heterogeneity called “mutant-allele tumor heterogeneity” (MATH) is related to mortality (death) among patients with head and neck squamous cell carcinoma (HNSCC)—cancers that begin in the cells that line the moist surfaces inside the head and neck, such as cancers of the mouth and the larynx (voice box). MATH is based on whole-exome sequencing (WES) of tumor and matched normal DNA. WES uses powerful DNA-sequencing systems to determine the variations of all the coding regions (exons) of the known genes in the human genome (genetic blueprint).
What Did the Researchers Do and Find?
The researchers obtained clinical and WES data for 305 patients who were treated in 14 institutions, primarily in the US, after diagnosis of HNSCC from The Cancer Genome Atlas, a catalog established by the US National Institutes of Health to map the key genomic changes in major types and subtypes of cancer. They calculated tumor MATH values for the patients from their WES results and retrospectively analyzed whether there was an association between the MATH values and patient survival. Despite the patients having tumors at various subsites and being given different treatments, every 10% increase in MATH value corresponded to an 8.8% increased risk (hazard) of death. Using a previously defined MATH-value cutoff to distinguish high- from low-heterogeneity tumors, compared to patients with low-heterogeneity tumors, patients with high-heterogeneity tumors were more than twice as likely to die (a hazard ratio of 2.2). Other statistical analyses indicated that MATH provided improved prognostic information compared to that provided by established clinical and molecular characteristics and human papillomavirus (HPV) status (HPV-positive HNSCC at some subsites has a better prognosis than HPV-negative HNSCC). In particular, MATH provided prognostic information beyond that provided by standard disease staging among patients with mouth or laryngeal cancers.
What Do These Findings Mean?
By using data from more than 300 patients treated at multiple institutions, these findings validate the use of MATH as a measure of intra-tumor heterogeneity in HNSCC. Moreover, they provide one of the first large-scale demonstrations that intra-tumor heterogeneity is clinically important in the prognosis of any type of cancer. Before the MATH metric can be used in clinical trials or in clinical practice as a prognostic tool, its ability to predict outcomes needs to be tested in prospective studies that examine the relation between MATH and the outcomes of patients with identically treated HNSCC at specific head and neck subsites, that evaluate the use of MATH for prognostication in other tumor types, and that determine the influence of cancer treatments on MATH values. Nevertheless, these findings suggest that MATH should be considered as a biomarker for survival in HNSCC and other tumor types, and raise the possibility that clinicians could use MATH values to decide on the best treatment for individual patients and to choose patients for inclusion in clinical trials.
Additional Information
Please access these websites via the online version of this summary at
The US National Cancer Institute (NCI) provides information about cancer and how it develops and about head and neck cancer (in English and Spanish)
Cancer Research UK, a not-for-profit organization, provides general information about cancer and how it develops, and detailed information about head and neck cancer; the Merseyside Regional Head and Neck Cancer Centre provides patient stories about HNSCC
Wikipedia provides information about tumor heterogeneity, and about whole-exome sequencing (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
Information about The Cancer Genome Atlas is available
A PLOS Blog entry by Jessica Wapner explains more about MATH
PMCID: PMC4323109  PMID: 25668320
19.  Computational deconvolution of genome wide expression data from Parkinson's and Huntington's disease brain tissues using population-specific expression analysis 
The characterization of molecular changes in diseased tissues gives insight into pathophysiological mechanisms and is important for therapeutic development. Genome-wide gene expression analysis has proven valuable for identifying biological processes in neurodegenerative diseases using post mortem human brain tissue and numerous datasets are publically available. However, many studies utilize heterogeneous tissue samples consisting of multiple cell types, all of which contribute to global gene expression values, confounding biological interpretation of the data. In particular, changes in numbers of neuronal and glial cells occurring in neurodegeneration confound transcriptomic analyses, particularly in human brain tissues where sample availability and controls are limited. To identify cell specific gene expression changes in neurodegenerative disease, we have applied our recently published computational deconvolution method, population specific expression analysis (PSEA). PSEA estimates cell-type-specific expression values using reference expression measures, which in the case of brain tissue comprises mRNAs with cell-type-specific expression in neurons, astrocytes, oligodendrocytes and microglia. As an exercise in PSEA implementation and hypothesis development regarding neurodegenerative diseases, we applied PSEA to Parkinson's and Huntington's disease (PD, HD) datasets. Genes identified as differentially expressed in substantia nigra pars compacta neurons by PSEA were validated using external laser capture microdissection data. Network analysis and Annotation Clustering (DAVID) identified molecular processes implicated by differential gene expression in specific cell types. The results of these analyses provided new insights into the implementation of PSEA in brain tissues and additional refinement of molecular signatures in human HD and PD.
PMCID: PMC4288238  PMID: 25620908
computational deconvolution; Huntington's disease; Parkinson's disease; autophagy; microarray; transcriptomic analysis
20.  Cell population-specific expression analysis of human cerebellum 
BMC Genomics  2012;13:610.
Interpreting gene expression profiles obtained from heterogeneous samples can be difficult because bulk gene expression measures are not resolved to individual cell populations. We have recently devised Population-Specific Expression Analysis (PSEA), a statistical method that identifies individual cell types expressing genes of interest and achieves quantitative estimates of cell type-specific expression levels. This procedure makes use of marker gene expression and circumvents the need for additional experimental information like tissue composition.
To systematically assess the performance of statistical deconvolution, we applied PSEA to gene expression profiles from cerebellum tissue samples and compared with parallel, experimental separation methods. Owing to the particular histological organization of the cerebellum, we could obtain cellular expression data from in situ hybridization and laser-capture microdissection experiments and successfully validated computational predictions made with PSEA. Upon statistical deconvolution of whole tissue samples, we identified a set of transcripts showing age-related expression changes in the astrocyte population.
PSEA can predict cell-type specific expression levels from tissues homogenates on a genome-wide scale. It thus represents a computational alternative to experimental separation methods and allowed us to identify age-related expression changes in the astrocytes of the cerebellum. These molecular changes might underlie important physiological modifications previously observed in the aging brain.
PMCID: PMC3561119  PMID: 23145530
Genomics; Computational biology; Cerebellum; Gene expression; Aging; Astrocyte
21.  A Mouse to Human Search for Plasma Proteome Changes Associated with Pancreatic Tumor Development 
PLoS Medicine  2008;5(6):e123.
The complexity and heterogeneity of the human plasma proteome have presented significant challenges in the identification of protein changes associated with tumor development. Refined genetically engineered mouse (GEM) models of human cancer have been shown to faithfully recapitulate the molecular, biological, and clinical features of human disease. Here, we sought to exploit the merits of a well-characterized GEM model of pancreatic cancer to determine whether proteomics technologies allow identification of protein changes associated with tumor development and whether such changes are relevant to human pancreatic cancer.
Methods and Findings
Plasma was sampled from mice at early and advanced stages of tumor development and from matched controls. Using a proteomic approach based on extensive protein fractionation, we confidently identified 1,442 proteins that were distributed across seven orders of magnitude of abundance in plasma. Analysis of proteins chosen on the basis of increased levels in plasma from tumor-bearing mice and corroborating protein or RNA expression in tissue documented concordance in the blood from 30 newly diagnosed patients with pancreatic cancer relative to 30 control specimens. A panel of five proteins selected on the basis of their increased level at an early stage of tumor development in the mouse was tested in a blinded study in 26 humans from the CARET (Carotene and Retinol Efficacy Trial) cohort. The panel discriminated pancreatic cancer cases from matched controls in blood specimens obtained between 7 and 13 mo prior to the development of symptoms and clinical diagnosis of pancreatic cancer.
Our findings indicate that GEM models of cancer, in combination with in-depth proteomic analysis, provide a useful strategy to identify candidate markers applicable to human cancer with potential utility for early detection.
Samir Hanash and colleagues identify proteins that are increased at an early stage of pancreatic tumor development in a mouse model and may be a useful tool in detecting early tumors in humans.
Editors' Summary
Cancers are life-threatening, disorganized masses of cells that can occur anywhere in the human body. They develop when cells acquire genetic changes that allow them to grow uncontrollably and to spread around the body (metastasize). If a cancer is detected when it is still small and has not metastasized, surgery can often provide a cure. Unfortunately, many cancers are detected only when they are large enough to press against surrounding tissues and cause pain or other symptoms. By this time, surgical removal of the original (primary) tumor may be impossible and there may be secondary cancers scattered around the body. In such cases, radiotherapy and chemotherapy can sometimes help, but the outlook for patients whose cancers are detected late is often poor. One cancer type for which late detection is a particular problem is pancreatic adenocarcinoma. This cancer rarely causes any symptoms in its early stages. Furthermore, the symptoms it eventually causes—jaundice, abdominal and back pain, and weight loss—are seen in many other illnesses. Consequently, pancreatic cancer has usually spread before it is diagnosed, and most patients die within a year of their diagnosis.
Why Was This Study Done?
If a test could be developed to detect pancreatic cancer in its early stages, the lives of many patients might be extended. Tumors often release specific proteins—“cancer biomarkers”—into the blood, a bodily fluid that can be easily sampled. If a protein released into the blood by pancreatic cancer cells could be identified, it might be possible to develop a noninvasive screening test for this deadly cancer. In this study, the researchers use a “proteomic” approach to identify potential biomarkers for early pancreatic cancer. Proteomics is the study of the patterns of proteins made by an organism, tissue, or cell and of the changes in these patterns that are associated with various diseases.
What Did the Researchers Do and Find?
The researchers started their search for pancreatic cancer biomarkers by studying the plasma proteome (the proteins in the fluid portion of blood) of mice genetically engineered to develop cancers that closely resemble human pancreatic tumors. Through the use of two techniques called high-resolution mass spectrometry and acrylamide isotopic labeling, the researchers identified 165 proteins that were present in larger amounts in plasma collected from mice with early and/or advanced pancreatic cancer than in plasma from control mice. Then, to test whether any of these protein changes were relevant to human pancreatic cancer, the researchers analyzed blood samples collected from patients with pancreatic cancer. These samples, they report, contained larger amounts of some of these proteins than blood collected from patients with chronic pancreatitis, a condition that has similar symptoms to pancreatic cancer. Finally, using blood samples collected during a clinical trial, the Carotene and Retinol Efficacy Trial (a cancer-prevention study), the researchers showed that the measurement of five of the proteins present in increased amounts at an early stage of tumor development in the mouse model discriminated between people with pancreatic cancer and matched controls up to 13 months before cancer diagnosis.
What Do These Findings Mean?
These findings suggest that in-depth proteomic analysis of genetically engineered mouse models of human cancer might be an effective way to identify biomarkers suitable for the early detection of human cancers. Previous attempts to identify such biomarkers using human samples have been hampered by the many noncancer-related differences in plasma proteins that exist between individuals and by problems in obtaining samples from patients with early cancer. The use of a mouse model of human cancer, these findings indicate, can circumvent both of these problems. More specifically, these findings identify a panel of proteins that might allow earlier detection of pancreatic cancer and that might, therefore, extend the life of some patients who develop this cancer. However, before a routine screening test becomes available, additional markers will need to be identified and extensive validation studies in larger groups of patients will have to be completed.
Additional Information.
Please access these Web sites via the online version of this summary at
The MedlinePlus Encyclopedia has a page on pancreatic cancer (in English and Spanish). Links to further information are provided by MedlinePlus
The US National Cancer Institute has information about pancreatic cancer for patients and health professionals (in English and Spanish)
The UK charity Cancerbackup also provides information for patients about pancreatic cancer
The Clinical Proteomic Technologies for Cancer Initiative (a US National Cancer Institute initiative) provides a tutorial about proteomics and cancer and information on the Mouse Proteomic Technologies Initiative
PMCID: PMC2504036  PMID: 18547137
22.  Laser capture microdissection in pathology 
Journal of Clinical Pathology  2000;53(9):666-672.
The molecular examination of pathologically altered cells and tissues at the DNA, RNA, and protein level has revolutionised research and diagnostics in pathology. However, the inherent heterogeneity of primary tissues with an admixture of various reactive cell populations can affect the outcome and interpretation of molecular studies. Recently, microdissection of tissue sections and cytological preparations has been used increasingly for the isolation of homogeneous, morphologically identified cell populations, thus overcoming the obstacle of tissue complexity. In conjunction with sensitive analytical techniques, such as the polymerase chain reaction, microdissection allows precise in vivo examination of cell populations, such as carcinoma in situ or the malignant cells of Hodgkin's disease, which are otherwise inaccessible for conventional molecular studies. However, most microdissection techniques are very time consuming and require a high degree of manual dexterity, which limits their practical use. Laser capture microdissection (LCM), a novel technique developed at the National Cancer Institute, is an important advance in terms of speed, ease of use, and versatility of microdissection. LCM is based on the adherence of visually selected cells to a thermoplastic membrane, which overlies the dehydrated tissue section and is focally melted by triggering of a low energy infrared laser pulse. The melted membrane forms a composite with the selected tissue area, which can be removed by simple lifting of the membrane. LCM can be applied to a wide range of cell and tissue preparations including paraffin wax embedded material. The use of immunohistochemical stains allows the selection of cells according to phenotypic and functional characteristics. Depending on the starting material, DNA, good quality mRNA, and proteins can be extracted successfully from captured tissue fragments, down to the single cell level. In combination with techniques like expression library construction, cDNA array hybridisation and differential display, LCM will allow the establishment of "genetic fingerprints"of specific pathological lesions, especially malignant neoplasms. In addition to the identification of new diagnostic and prognostic markers, this approach could help in establishing individualised treatments tailored to the molecular profile of a tumour. This review provides an overview of the technique of LCM, summarises current applications and new methodical approaches, and tries to give a perspective on future developments. In addition, LCM is compared with other recently developed laser microdissection techniques.
Key Words: laser capture microdissection • RNA analysis • DNA analysis • gene expression • profiling • immunohistochemistry
PMCID: PMC1731250  PMID: 11041055
23.  Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources 
PLoS ONE  2008;3(3):e1820.
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at:
PMCID: PMC2268002  PMID: 18364997
24.  Predicting Survival within the Lung Cancer Histopathological Hierarchy Using a Multi-Scale Genomic Model of Development 
PLoS Medicine  2006;3(7):e232.
The histopathologic heterogeneity of lung cancer remains a significant confounding factor in its diagnosis and prognosis—spurring numerous recent efforts to find a molecular classification of the disease that has clinical relevance.
Methods and Findings
Molecular profiles of tumors from 186 patients representing four different lung cancer subtypes (and 17 normal lung tissue samples) were compared with a mouse lung development model using principal component analysis in both temporal and genomic domains. An algorithm for the classification of lung cancers using a multi-scale developmental framework was developed. Kaplan–Meier survival analysis was conducted for lung adenocarcinoma patient subgroups identified via their developmental association. We found multi-scale genomic similarities between four human lung cancer subtypes and the developing mouse lung that are prognostically meaningful. Significant association was observed between the localization of human lung cancer cases along the principal mouse lung development trajectory and the corresponding patient survival rate at three distinct levels of classical histopathologic resolution: among different lung cancer subtypes, among patients within the adenocarcinoma subtype, and within the stage I adenocarcinoma subclass. The earlier the genomic association between a human tumor profile and the mouse lung development sequence, the poorer the patient's prognosis. Furthermore, decomposing this principal lung development trajectory identified a gene set that was significantly enriched for pyrimidine metabolism and cell-adhesion functions specific to lung development and oncogenesis.
From a multi-scale disease modeling perspective, the molecular dynamics of murine lung development provide an effective framework that is not only data driven but also informed by the biology of development for elucidating the mechanisms of human lung cancer biology and its clinical outcome.
Editors' Summary
Lung cancer causes the most deaths from cancer worldwide—around a quarter of all cancer deaths—and the number of deaths is rising each year. There are a number of different types of the disease, whose names come from early descriptions of the cancer cells when seen under the microscope: carcinoid, small cell, and non–small cell, which make up 2%, 13%, and 86% of lung cancers, respectively. To make things more complicated, each of these cancer types can be subdivided further. It is important to distinguish the different types of cancer because they differ in their rates of growth and how they respond to treatment; for example, small cell lung cancer is the most rapidly progressing type of lung cancer. But although these current classifications of cancers are useful, researchers believe that if the underlying molecular changes in these cancers could be discovered then a more accurate way of classifying cancers, and hence predicting outcome and response to treatment, might be possible.
Why Was This Study Done?
Previous work has suggested that some cancers come from very immature cells, that is, cells that are present in the early stages of an animal's development from an embryo in the womb to an adult animal. Many animals have been closely studied so as to understand how they develop; the best studied model that is also relevant to human disease is the mouse, and researchers have previously studied lung development in mice in detail. This group of researchers wanted to see if there was any relation between the activity (known as expression) of mouse genes during the development of the lung and the expression of genes in human lung cancers, particularly whether they could use gene expression to try to predict the outcome of lung cancer in patients.
What Did the Researchers Do and Find?
They compared the gene expression in lung cancer samples from 186 patients with four different types of lung cancer (and in 17 normal lung tissue samples) to the gene expression found in normal mice during development. They found similarities between expression patterns in the lung cancer subtypes and the developing mouse lung, and that these similarities explain some of the different outcomes for the patients. In general, they found that when the gene expression in the human cancer was similar to that of very immature mouse lung cells, patients had a poor prognosis. When the gene expression in the human cancer was more similar to mature mouse lung cells, the prognosis was better. However, the researchers found that carcinoid tumors had rather different expression profiles compared to the other tumors.
  The researchers were also able to discover some specific gene types that seemed to have particularly strong associations between mouse development and the human cancers. Two of these gene types were ones that are involved in building and breaking down DNA itself, and ones involved in how cells stick together. This latter group of genes is thought to be involved in how cancers spread.
What Do These Findings Mean?
These results provide a new way of thinking about how to classify lung cancers, and also point to a few groups of genes that may be particularly important in the development of the tumor. However, before these results are used in any clinical assessment, further work will need to be done to work out whether they are true for other groups of patients.
Additional Information.
Please access these Web sites via the online version of this summary at
•  MedlinePlus has information from the United States National Library of Medicine and other government agencies and health-related organizations [MedlinePlus]
•  National Institute on Aging is also a good place to start looking for information [National Institute for Aging]
•  [The National Cancer Institute] and Lung Cancer Online [ Lung Cancer Online] have a wide range of information on lung cancer
Comparison of gene expression patterns in patients with lung cancer and in mouse lung development showed that those tumors associated with earlier mouse lung development had a poorer prognosis.
PMCID: PMC1483910  PMID: 16800721
25.  PERT: A Method for Expression Deconvolution of Human Blood Samples from Varied Microenvironmental and Developmental Conditions 
PLoS Computational Biology  2012;8(12):e1002838.
The cellular composition of heterogeneous samples can be predicted using an expression deconvolution algorithm to decompose their gene expression profiles based on pre-defined, reference gene expression profiles of the constituent populations in these samples. However, the expression profiles of the actual constituent populations are often perturbed from those of the reference profiles due to gene expression changes in cells associated with microenvironmental or developmental effects. Existing deconvolution algorithms do not account for these changes and give incorrect results when benchmarked against those measured by well-established flow cytometry, even after batch correction was applied. We introduce PERT, a new probabilistic expression deconvolution method that detects and accounts for a shared, multiplicative perturbation in the reference profiles when performing expression deconvolution. We applied PERT and three other state-of-the-art expression deconvolution methods to predict cell frequencies within heterogeneous human blood samples that were collected under several conditions (uncultured mono-nucleated and lineage-depleted cells, and culture-derived lineage-depleted cells). Only PERT's predicted proportions of the constituent populations matched those assigned by flow cytometry. Genes associated with cell cycle processes were highly enriched among those with the largest predicted expression changes between the cultured and uncultured conditions. We anticipate that PERT will be widely applicable to expression deconvolution strategies that use profiles from reference populations that vary from the corresponding constituent populations in cellular state but not cellular phenotypic identity.
Author Summary
The cellular composition of heterogeneous samples can be predicted from reference gene expression profiles that represent the homogeneous, constituent populations of the heterogeneous samples. However, existing methods fail when the reference profiles are not representative of the constituent populations. We developed PERT, a new probabilistic expression deconvolution method, to address this limitation. PERT was used to deconvolve the cellular composition of variably sourced and treated heterogeneous human blood samples. Our results indicate that even after batch correction is applied, cells presenting the same cell surface antigens display different transcriptional programs when they are uncultured versus culture-derived. Given gene expression profiles of culture-derived heterogeneous samples and profiles of uncultured reference populations, PERT was able to accurately recover proportions of the constituent populations composing the heterogeneous samples. We anticipate that PERT will be widely applicable to expression deconvolution strategies that use profiles from reference populations that vary from the corresponding constituent populations in cellular state but not cellular phenotypic identity.
PMCID: PMC3527275  PMID: 23284283

Results 1-25 (1249299)