PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1235111)

Clipboard (0)
None

Related Articles

1.  Characterization of Vascular Disease Risk in Postmenopausal Women and Its Association with Cognitive Performance 
PLoS ONE  2013;8(7):e68741.
Objectives
While global measures of cardiovascular (CV) risk are used to guide prevention and treatment decisions, these estimates fail to account for the considerable interindividual variability in pre-clinical risk status. This study investigated heterogeneity in CV risk factor profiles and its association with demographic, genetic, and cognitive variables.
Methods
A latent profile analysis was applied to data from 727 recently postmenopausal women enrolled in the Kronos Early Estrogen Prevention Study (KEEPS). Women were cognitively healthy, within three years of their last menstrual period, and free of current or past CV disease. Education level, apolipoprotein E ε4 allele (APOE4), ethnicity, and age were modeled as predictors of latent class membership. The association between class membership, characterizing CV risk profiles, and performance on five cognitive factors was examined. A supervised random forest algorithm with a 10-fold cross-validation estimator was used to test accuracy of CV risk classification.
Results
The best-fitting model generated two distinct phenotypic classes of CV risk 62% of women were “low-risk” and 38% “high-risk”. Women classified as low-risk outperformed high-risk women on language and mental flexibility tasks (p = 0.008) and a global measure of cognition (p = 0.029). Women with a college degree or above were more likely to be in the low-risk class (OR = 1.595, p = 0.044). Older age and a Hispanic ethnicity increased the probability of being at high-risk (OR = 1.140, p = 0.002; OR = 2.622, p = 0.012; respectively). The prevalence rate of APOE-ε4 was higher in the high-risk class compared with rates in the low-risk class.
Conclusion
Among recently menopausal women, significant heterogeneity in CV risk is associated with education level, age, ethnicity, and genetic indicators. The model-based latent classes were also associated with cognitive function. These differences may point to phenotypes for CV disease risk. Evaluating the evolution of phenotypes could in turn clarify preclinical disease, and screening and preventive strategies.
ClinicalTrials.gov NCT00154180
doi:10.1371/journal.pone.0068741
PMCID: PMC3714288  PMID: 23874743
2.  Cascaded discrimination of normal, abnormal, and confounder classes in histopathology: Gleason grading of prostate cancer 
BMC Bioinformatics  2012;13:282.
Background
Automated classification of histopathology involves identification of multiple classes, including benign, cancerous, and confounder categories. The confounder tissue classes can often mimic and share attributes with both the diseased and normal tissue classes, and can be particularly difficult to identify, both manually and by automated classifiers. In the case of prostate cancer, they may be several confounding tissue types present in a biopsy sample, posing as major sources of diagnostic error for pathologists. Two common multi-class approaches are one-shot classification (OSC), where all classes are identified simultaneously, and one-versus-all (OVA), where a “target” class is distinguished from all “non-target” classes. OSC is typically unable to handle discrimination of classes of varying similarity (e.g. with images of prostate atrophy and high grade cancer), while OVA forces several heterogeneous classes into a single “non-target” class. In this work, we present a cascaded (CAS) approach to classifying prostate biopsy tissue samples, where images from different classes are grouped to maximize intra-group homogeneity while maximizing inter-group heterogeneity.
Results
We apply the CAS approach to categorize 2000 tissue samples taken from 214 patient studies into seven classes: epithelium, stroma, atrophy, prostatic intraepithelial neoplasia (PIN), and prostate cancer Gleason grades 3, 4, and 5. A series of increasingly granular binary classifiers are used to split the different tissue classes until the images have been categorized into a single unique class. Our automatically-extracted image feature set includes architectural features based on location of the nuclei within the tissue sample as well as texture features extracted on a per-pixel level. The CAS strategy yields a positive predictive value (PPV) of 0.86 in classifying the 2000 tissue images into one of 7 classes, compared with the OVA (0.77 PPV) and OSC approaches (0.76 PPV).
Conclusions
Use of the CAS strategy increases the PPV for a multi-category classification system over two common alternative strategies. In classification problems such as histopathology, where multiple class groups exist with varying degrees of heterogeneity, the CAS system can intelligently assign class labels to objects by performing multiple binary classifications according to domain knowledge.
doi:10.1186/1471-2105-13-282
PMCID: PMC3563463  PMID: 23110677
3.  Developmental Profiles of Eczema, Wheeze, and Rhinitis: Two Population-Based Birth Cohort Studies 
PLoS Medicine  2014;11(10):e1001748.
Using data from two population-based birth cohorts, Danielle Belgrave and colleagues examine the evidence for atopic march in developmental profiles for allergic disorders.
Please see later in the article for the Editors' Summary
Background
The term “atopic march” has been used to imply a natural progression of a cascade of symptoms from eczema to asthma and rhinitis through childhood. We hypothesize that this expression does not adequately describe the natural history of eczema, wheeze, and rhinitis during childhood. We propose that this paradigm arose from cross-sectional analyses of longitudinal studies, and may reflect a population pattern that may not predominate at the individual level.
Methods and Findings
Data from 9,801 children in two population-based birth cohorts were used to determine individual profiles of eczema, wheeze, and rhinitis and whether the manifestations of these symptoms followed an atopic march pattern. Children were assessed at ages 1, 3, 5, 8, and 11 y. We used Bayesian machine learning methods to identify distinct latent classes based on individual profiles of eczema, wheeze, and rhinitis. This approach allowed us to identify groups of children with similar patterns of eczema, wheeze, and rhinitis over time.
Using a latent disease profile model, the data were best described by eight latent classes: no disease (51.3%), atopic march (3.1%), persistent eczema and wheeze (2.7%), persistent eczema with later-onset rhinitis (4.7%), persistent wheeze with later-onset rhinitis (5.7%), transient wheeze (7.7%), eczema only (15.3%), and rhinitis only (9.6%). When latent variable modelling was carried out separately for the two cohorts, similar results were obtained. Highly concordant patterns of sensitisation were associated with different profiles of eczema, rhinitis, and wheeze. The main limitation of this study was the difference in wording of the questions used to ascertain the presence of eczema, wheeze, and rhinitis in the two cohorts.
Conclusions
The developmental profiles of eczema, wheeze, and rhinitis are heterogeneous; only a small proportion of children (∼7% of those with symptoms) follow trajectory profiles resembling the atopic march.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
Our immune system protects us from viruses, bacteria, and other pathogens by recognizing specific molecules on the invader's surface and initiating a sequence of events that culminates in the death of the pathogen. Sometimes, however, our immune system responds to harmless materials (allergens such as pollen) and triggers allergic, or atopic, symptoms. Common atopic symptoms include eczema (transient dry itchy patches on the skin), wheeze (high pitched whistling in the chest, a symptom of asthma), and rhinitis (sneezing or a runny nose in the absence of a cold or influenza). All these symptoms are very common during childhood, but recent epidemiological studies (examinations of the patterns and causes of diseases in a population) have revealed age-related changes in the proportions of children affected by each symptom. So, for example, eczema is more common in infants than in school-age children. These findings have led to the idea of “atopic march,” a natural progression of symptoms within individual children that starts with eczema, then progresses to wheeze and finally rhinitis.
Why Was This Study Done?
The concept of atopic march has led to the initiation of studies that aim to prevent the development of asthma in children who are thought to be at risk of asthma because they have eczema. Moreover, some guidelines recommend that clinicians tell parents that children with eczema may later develop asthma or rhinitis. However, because of the design of the epidemiological studies that support the concept of atopic march, children with eczema who later develop wheeze and rhinitis may actually belong to a distinct subgroup of children, rather than representing the typical progression of atopic diseases. It is important to know whether atopic march adequately describes the natural history of atopic diseases during childhood to avoid the imposition of unnecessary strategies on children with eczema to prevent asthma. Here, the researchers use machine learning techniques to model the developmental profiles of eczema, wheeze, and rhinitis during childhood in two large population-based birth cohorts by taking into account time-related (longitudinal) changes in symptoms within individuals. Machine learning is a data-driven approach that identifies structure within the data (for example, a typical progression of symptoms) using unsupervised learning of latent variables (variables that are not directly measured but are inferred from other observable characteristics).
What Did the Researchers Do and Find?
The researchers used data from two UK birth cohorts—the Avon Longitudinal Study of Parents and Children (ALSPAC) and the Manchester Asthma and Allergy Study (MAAS)—for their study (9,801 children in total). Both studies enrolled children at birth and monitored their subsequent health at regular review clinics. At each review clinic, information about eczema, wheeze, and rhinitis was collected from the parents using validated questionnaires. The researchers then used these data and machine learning methods to identify groups of children with similar patterns of onset of eczema, wheeze, and rhinitis over the first 11 years of life. Using a type of statistical model called a latent disease profile model, the researchers found that the data were best described by eight latent classes—no disease (51.3% of the children), atopic march (3.1%), persistent eczema and wheeze (2.7%), persistent eczema with later-onset rhinitis (4.7%), persistent wheeze with later-onset rhinitis (5.7%), transient wheeze (7.7%), eczema only (15.3%), and rhinitis only (9.6%).
What Do These Findings Mean?
These findings show that, in two large UK birth cohorts, the developmental profiles of eczema, wheeze, and rhinitis were heterogeneous. Most notably, the progression of symptoms fitted the profile of atopic march in fewer than 7% of children with symptoms. The researchers acknowledge that their study has some limitations. For example, small differences in the wording of the questions used to gather information from parents about their children's symptoms in the two cohorts may have slightly affected the findings. However, based on their findings, the researchers propose that, because eczema, wheeze, and rhinitis are common, these symptoms often coexist in individuals, but as independent entities rather than as a linked progression of symptoms. Thus, using eczema as an indicator of subsequent asthma risk and assigning “preventative” measures to children with eczema is flawed. Importantly, clinicians need to understand the heterogeneity of patterns of atopic diseases in children and to communicate this variability to parents when advising them about the development and resolution of atopic symptoms in their children.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001748.
The UK National Health Service Choices website provides information about eczema (including personal stories), asthma (including personal stories), and rhinitis
The US National Institute of Allergy and Infectious Diseases provides information about atopic diseases
The UK not-for-profit organization Allergy UK provides information about atopic diseases and a description of the atopic march
MedlinePlus encyclopedia has pages on eczema, wheezing, and rhinitis (in English and Spanish)
MedlinePlus provides links to further resources about allergies, eczema, and asthma (in English and Spanish)
Information about ALSPAC and MAAS is available
Wikipedia has pages on machine learning and latent disease profile models (note that Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
doi:10.1371/journal.pmed.1001748
PMCID: PMC4204810  PMID: 25335105
4.  Patterns of basal signaling heterogeneity can distinguish cellular populations with different drug sensitivities 
Non small cell lung cancer H460 clones exhibit a high degree of heterogeneity in signaling states.Clones with similar patterns of basal signaling heterogeneity have similar paclitaxel sensitivities.Models of signaling heterogeneity among the clones can be used to classify sensitivity to paclitaxel for other cancer populations.
A high degree of phenotypic diversity has been classically observed among cancer cells, even within a single tumor (Heppner, 1984; Anderson et al, 2006; Ichim and Wells, 2006; Campbell and Polyak, 2007). Importantly, not all cancer cells contribute equally to disease progression or respond equally to therapeutic intervention (Campbell and Polyak, 2007). This heterogeneity has traditionally been viewed as an impediment to efficient diagnosis and treatment. Understanding the relevance of cellular diversity to cancer requires methods for relating patterns of phenotypic heterogeneity to functional outcomes, such as drug sensitivity. Recent advances in fluorescence microscopy image-based analysis have enabled quantitative single-cell measurements of the activation and (co-)localization of signaling molecules within large cellular populations (Boland and Murphy, 2001; Perlman et al, 2004). Here, we apply this technology to explore the extent to which patterns of basal signaling heterogeneity, present within cancer populations before treatment, reveal information about population-level response to drug perturbation.
To investigate basal cell signaling heterogeneity among a collection of cancer populations having minimal exogenous differences, such as those due to environment, cell type, and genetic background, we generated a collection of 49 low-passage clonal populations from the highly metastatic nonsmall cell lung cancer cell line H460 (Kozaki et al, 2000). We chose to observe patterns of spatial organization and activation for multiple components from diverse signaling pathways associated with cancer (marker sets 1–4: DNA/pSTAT3/pPTEN; DNA/pERK/pP38; DNA/E-cadherin/β-catenin/pGSK3; DNA/pAkt/H3K9-Ac).
We identified an objective set of signaling stereotypes from each marker set based on a probabilistic description of the distribution of cells in the feature space. For each marker set, a ‘reference' set of representative cells was sampled from all 50 H460 cancer populations. Then, each reference set was represented as a mixture of subpopulations modeled as Gaussian distributions with means centered on distinct, ‘stereotyped' signaling states (Slack et al, 2008). Our quantitative analysis suggested that a small collection of signaling stereotypes was sufficient to characterize the complexity of observed cellular phenotypes among all clones. For simplicity, we chose to use five subpopulations to model cellular heterogeneity in each marker set.
For each clone, we computed the fraction of cells in each of the identified subpopulations (Figure 2, scatter plots). Estimation of these fractions allowed us to represent each clone as a probabilistic ensemble of subpopulations. Visual differences among the clones (Figure 2, thumbnail images) were reflected by clear differences in subpopulation mixtures (Figure 2, scatter plots). To compare the subpopulation mixtures of each clone to the parent, a ‘subpopulation enrichment' profile vector was computed. The vector measured the log-fold change between the clone and the H460 parent population for each subpopulation (Figure 2, heat map).
We applied hierarchical clustering to group clones based on the similarity of their subpopulation enrichment profiles (Figure 2). Clustering by subpopulation enrichment profiles revealed only a small number of distinct patterns (or ‘signatures') of subpopulation mixtures (Figure 2, dendrogram and heat map). Thus, parameterization of observed cellular heterogeneity using subpopulation enrichment profiles succinctly encapsulated the apparent complexity of cancer cell phenotypes, and further allowed comparison of clonal populations at a resolution greater than provided by population means.
We next assessed the degree to which clones with distinct patterns of heterogeneity had distinct responses to the drug paclitaxel. We used a multidimensional scaling (Borg and Groenen, 1997) plot to visualize similarity among the clones and annotated each clone with the index of drug sensitivity. This visualization revealed striking geometric separation in ‘profile space' of paclitaxel-sensitive from paclitaxel-nonsensitive clones for each marker set (Figure 3A, green versus red and black circles). The significance of separation was further confirmed by machine learning-based classification studies. Thus heterogeneity of basal cellular signaling states contained information that could be used to predict sensitivity to drug treatment.
Our approach is general, and makes heterogeneity a computable property of cellular populations. Interrogation at subpopulation-resolution facilitated a dramatic reduction in the observed phenotypic complexity of cancer populations, yet retained sufficient biological information to identify drug responses. Our work suggests that rigorous analysis of cancer heterogeneity can provide a new resolution at which to match disease to more effective therapies.
Phenotypic heterogeneity has been widely observed in cellular populations. However, the extent to which heterogeneity contains biologically or clinically important information is not well understood. Here, we investigated whether patterns of basal signaling heterogeneity, in untreated cancer cell populations, could distinguish cellular populations with different drug sensitivities. We modeled cellular heterogeneity as a mixture of stereotyped signaling states, identified based on colocalization patterns of activated signaling molecules from microscopy images. We found that patterns of heterogeneity could be used to separate the most sensitive and resistant populations to paclitaxel within a set of H460 lung cancer clones and within the NCI-60 panel of cancer cell lines, but not for a set of less heterogeneous, immortalized noncancer human bronchial epithelial cell (HBEC) clones. Our results suggest that patterns of signaling heterogeneity, characterized as ensembles of a small number of distinct phenotypic states, can reveal functional differences among cellular populations.
doi:10.1038/msb.2010.22
PMCID: PMC2890326  PMID: 20461076
cancer; heterogeneity; multivariate analysis; signaling; systems biology
5.  A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays 
BMC Bioinformatics  2006;7:514.
Background
Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples.
Results
We propose an extension of the well-known Naïve Bayes classifier, which accounts for biological heterogeneity in a probabilistic framework, relying on Bayesian hierarchical models. The model, which can be efficiently learned from the training dataset, exploits a closed-form of classification equation, thus providing no additional computational cost with respect to the standard Naïve Bayes classifier. We validated the approach on several simulated datasets comparing its performances with the Naïve Bayes classifier. Moreover, we demonstrated that explicitly dealing with heterogeneity can improve classification accuracy on a TMA prostate cancer dataset.
Conclusion
The proposed Hierarchical Naïve Bayes classifier can be conveniently applied in problems where within sample heterogeneity must be taken into account, such as TMA experiments and biological contexts where several measurements (replicates) are available for the same biological sample. The performance of the new approach is better than the standard Naïve Bayes model, in particular when the within sample heterogeneity is different in the different classes.
doi:10.1186/1471-2105-7-514
PMCID: PMC1698579  PMID: 17125514
6.  Phenotype Recognition with Combined Features and Random Subspace Classifier Ensemble 
BMC Bioinformatics  2011;12:128.
Background
Automated, image based high-content screening is a fundamental tool for discovery in biological science. Modern robotic fluorescence microscopes are able to capture thousands of images from massively parallel experiments such as RNA interference (RNAi) or small-molecule screens. As such, efficient computational methods are required for automatic cellular phenotype identification capable of dealing with large image data sets. In this paper we investigated an efficient method for the extraction of quantitative features from images by combining second order statistics, or Haralick features, with curvelet transform. A random subspace based classifier ensemble with multiple layer perceptron (MLP) as the base classifier was then exploited for classification. Haralick features estimate image properties related to second-order statistics based on the grey level co-occurrence matrix (GLCM), which has been extensively used for various image processing applications. The curvelet transform has a more sparse representation of the image than wavelet, thus offering a description with higher time frequency resolution and high degree of directionality and anisotropy, which is particularly appropriate for many images rich with edges and curves. A combined feature description from Haralick feature and curvelet transform can further increase the accuracy of classification by taking their complementary information. We then investigate the applicability of the random subspace (RS) ensemble method for phenotype classification based on microscopy images. A base classifier is trained with a RS sampled subset of the original feature set and the ensemble assigns a class label by majority voting.
Results
Experimental results on the phenotype recognition from three benchmarking image sets including HeLa, CHO and RNAi show the effectiveness of the proposed approach. The combined feature is better than any individual one in the classification accuracy. The ensemble model produces better classification performance compared to the component neural networks trained. For the three images sets HeLa, CHO and RNAi, the Random Subspace Ensembles offers the classification rates 91.20%, 98.86% and 91.03% respectively, which compares sharply with the published result 84%, 93% and 82% from a multi-purpose image classifier WND-CHARM which applied wavelet transforms and other feature extraction methods. We investigated the problem of estimation of ensemble parameters and found that satisfactory performance improvement could be brought by a relative medium dimensionality of feature subsets and small ensemble size.
Conclusions
The characteristics of curvelet transform of being multiscale and multidirectional suit the description of microscopy images very well. It is empirically demonstrated that the curvelet-based feature is clearly preferred to wavelet-based feature for bioimage descriptions. The random subspace ensemble of MLPs is much better than a number of commonly applied multi-class classifiers in the investigated application of phenotype recognition.
doi:10.1186/1471-2105-12-128
PMCID: PMC3098787  PMID: 21529372
7.  Locally dependent latent class models with covariates: an application to under-age drinking in the USA 
Summary
Under-age drinking is a long-standing public health problem in the USA and the identification of underage drinkers suffering alcohol-related problems has been difficult by using diagnostic criteria that were developed in adult populations. For this reason, it is important to characterize patterns of drinking in adolescents that are associated with alcohol-related problems. Latent class analysis is a statistical technique for explaining heterogeneity in individual response patterns in terms of a smaller number of classes. However, the latent class analysis assumption of local independence may not be appropriate when examining behavioural profiles and could have implications for statistical inference. In addition, if covariates are included in the model, non-differential measurement is also assumed. We propose a flexible set of models for local dependence and differential measurement that use easily interpretable odds ratio parameterizations while simultaneously fitting a marginal regression model for the latent class prevalences. Estimation is based on solving a set of second-order estimating equations. This approach requires only specification of the first two moments and allows for the choice of simple ‘working’ covariance structures. The method is illustrated by using data from a large-scale survey of under-age drinking. This new approach indicates the effectiveness of introducing local dependence and differential measurement into latent class models for selecting substantively interpretable models over more complex models that are deemed empirically superior.
doi:10.1111/j.1467-985X.2008.00544.x
PMCID: PMC2600526  PMID: 19079793
Differential measurement; Latent class; Local dependence; Marginal regression; Odds ratio; Second-order estimating equations
8.  Metabolic network reconstruction of Chlamydomonas offers insight into light-driven algal metabolism 
A comprehensive genome-scale metabolic network of Chlamydomonas reinhardtii, including a detailed account of light-driven metabolism, is reconstructed and validated. The model provides a new resource for research of C. reinhardtii metabolism and in algal biotechnology.
The genome-scale metabolic network of Chlamydomonas reinhardtii (iRC1080) was reconstructed, accounting for >32% of the estimated metabolic genes encoded in the genome, and including extensive details of lipid metabolic pathways.This is the first metabolic network to explicitly account for stoichiometry and wavelengths of metabolic photon usage, providing a new resource for research of C. reinhardtii metabolism and developments in algal biotechnology.Metabolic functional annotation and the largest transcript verification of a metabolic network to date was performed, at least partially verifying >90% of the transcripts accounted for in iRC1080. Analysis of the network supports hypotheses concerning the evolution of latent lipid pathways in C. reinhardtii, including very long-chain polyunsaturated fatty acid and ceramide synthesis pathways.A novel approach for modeling light-driven metabolism was developed that accounts for both light source intensity and spectral quality of emitted light. The constructs resulting from this approach, termed prism reactions, were shown to significantly improve the accuracy of model predictions, and their use was demonstrated for evaluation of light source efficiency and design.
Algae have garnered significant interest in recent years, especially for their potential application in biofuel production. The hallmark, model eukaryotic microalgae Chlamydomonas reinhardtii has been widely used to study photosynthesis, cell motility and phototaxis, cell wall biogenesis, and other fundamental cellular processes (Harris, 2001). Characterizing algal metabolism is key to engineering production strains and understanding photobiological phenomena. Based on extensive literature on C. reinhardtii metabolism, its genome sequence (Merchant et al, 2007), and gene functional annotation, we have reconstructed and experimentally validated the genome-scale metabolic network for this alga, iRC1080, the first network to account for detailed photon absorption permitting growth simulations under different light sources. iRC1080 accounts for 1080 genes, associated with 2190 reactions and 1068 unique metabolites and encompasses 83 subsystems distributed across 10 cellular compartments (Figure 1A). Its >32% coverage of estimated metabolic genes is a tremendous expansion over previous algal reconstructions (Boyle and Morgan, 2009; Manichaikul et al, 2009). The lipid metabolic pathways of iRC1080 are considerably expanded relative to existing networks, and chemical properties of all metabolites in these pathways are accounted for explicitly, providing sufficient detail to completely specify all individual molecular species: backbone molecule and stereochemical numbering of acyl-chain positions; acyl-chain length; and number, position, and cis–trans stereoisomerism of carbon–carbon double bonds. Such detail in lipid metabolism will be critical for model-driven metabolic engineering efforts.
We experimentally verified transcripts accounted for in the network under permissive growth conditions, detecting >90% of tested transcript models (Figure 1B) and providing validating evidence for the contents of iRC1080. We also analyzed the extent of transcript verification by specific metabolic subsystems. Some subsystems stood out as more poorly verified, including chloroplast and mitochondrial transport systems and sphingolipid metabolism, all of which exhibited <80% of transcripts detected, reflecting incomplete characterization of compartmental transporters and supporting a hypothesis of latent pathway evolution for ceramide synthesis in C. reinhardtii. Additional lines of evidence from the reconstruction effort similarly support this hypothesis including lack of ceramide synthetase and other annotation gaps downstream in sphingolipid metabolism. A similar hypothesis of latent pathway evolution was established for very long-chain fatty acids (VLCFAs) and their polyunsaturated analogs (VLCPUFAs) (Figure 1C), owing to the absence of this class of lipids in previous experimental measurements, lack of a candidate VLCFA elongase in the functional annotation, and additional downstream annotation gaps in arachidonic acid metabolism.
The network provides a detailed account of metabolic photon absorption by light-driven reactions, including photosystems I and II, light-dependent protochlorophyllide oxidoreductase, provitamin D3 photoconversion to vitamin D3, and rhodopsin photoisomerase; this network accounting permits the precise modeling of light-dependent metabolism. iRC1080 accounts for effective light spectral ranges through analysis of biochemical activity spectra (Figure 3A), either reaction activity or absorbance at varying light wavelengths. Defining effective spectral ranges associated with each photon-utilizing reaction enabled our network to model growth under different light sources via stoichiometric representation of the spectral composition of emitted light, termed prism reactions. Coefficients for different photon wavelengths in a prism reaction correspond to the ratios of photon flux in the defined effective spectral ranges to the total emitted photon flux from a given light source (Figure 3B). This approach distinguishes the amount of emitted photons that drive different metabolic reactions. We created prism reactions for most light sources that have been used in published studies for algal and plant growth including solar light, various light bulbs, and LEDs. We also included regulatory effects, resulting from lighting conditions insofar as published studies enabled. Light and dark conditions have been shown to affect metabolic enzyme activity in C. reinhardtii on multiple levels: transcriptional regulation, chloroplast RNA degradation, translational regulation, and thioredoxin-mediated enzyme regulation. Through application of our light model and prism reactions, we were able to closely recapitulate experimental growth measurements under solar, incandescent, and red LED lights. Through unbiased sampling, we were able to establish the tremendous statistical significance of the accuracy of growth predictions achievable through implementation of prism reactions. Finally, application of the photosynthetic model was demonstrated prospectively to evaluate light utilization efficiency under different light sources. The results suggest that, of the existing light sources, red LEDs provide the greatest efficiency, about three times as efficient as sunlight. Extending this analysis, the model was applied to design a maximally efficient LED spectrum for algal growth. The result was a 677-nm peak LED spectrum with a total incident photon flux of 360 μE/m2/s, suggesting that for the simple objective of maximizing growth efficiency, LED technology has already reached an effective theoretical optimum.
In summary, the C. reinhardtii metabolic network iRC1080 that we have reconstructed offers insight into the basic biology of this species and may be employed prospectively for genetic engineering design and light source design relevant to algal biotechnology. iRC1080 was used to analyze lipid metabolism and generate novel hypotheses about the evolution of latent pathways. The predictive capacity of metabolic models developed from iRC1080 was demonstrated in simulating mutant phenotypes and in evaluation of light source efficiency. Our network provides a broad knowledgebase of the biochemistry and genomics underlying global metabolism of a photoautotroph, and our modeling approach for light-driven metabolism exemplifies how integration of largely unvisited data types, such as physicochemical environmental parameters, can expand the diversity of applications of metabolic networks.
Metabolic network reconstruction encompasses existing knowledge about an organism's metabolism and genome annotation, providing a platform for omics data analysis and phenotype prediction. The model alga Chlamydomonas reinhardtii is employed to study diverse biological processes from photosynthesis to phototaxis. Recent heightened interest in this species results from an international movement to develop algal biofuels. Integrating biological and optical data, we reconstructed a genome-scale metabolic network for this alga and devised a novel light-modeling approach that enables quantitative growth prediction for a given light source, resolving wavelength and photon flux. We experimentally verified transcripts accounted for in the network and physiologically validated model function through simulation and generation of new experimental growth data, providing high confidence in network contents and predictive applications. The network offers insight into algal metabolism and potential for genetic engineering and efficient light source design, a pioneering resource for studying light-driven metabolism and quantitative systems biology.
doi:10.1038/msb.2011.52
PMCID: PMC3202792  PMID: 21811229
Chlamydomonas reinhardtii; lipid metabolism; metabolic engineering; photobioreactor
9.  HIV, Gender, Race, Sexual Orientation, and Sex Work: A Qualitative Study of Intersectional Stigma Experienced by HIV-Positive Women in Ontario, Canada 
PLoS Medicine  2011;8(11):e1001124.
Mona Loutfy and colleagues used focus groups to examine experiences of stigma and coping strategies among HIV-positive women in Ontario, Canada.
Background
HIV infection rates are increasing among marginalized women in Ontario, Canada. HIV-related stigma, a principal factor contributing to the global HIV epidemic, interacts with structural inequities such as racism, sexism, and homophobia. The study objective was to explore experiences of stigma and coping strategies among HIV-positive women in Ontario, Canada.
Methods and Findings
We conducted a community-based qualitative investigation using focus groups to understand experiences of stigma and discrimination and coping methods among HIV-positive women from marginalized communities. We conducted 15 focus groups with HIV-positive women in five cities across Ontario, Canada. Data were analyzed using thematic analysis to enhance understanding of the lived experiences of diverse HIV-positive women. Focus group participants (n = 104; mean age = 38 years; 69% ethnic minority; 23% lesbian/bisexual; 22% transgender) described stigma/discrimination and coping across micro (intra/interpersonal), meso (social/community), and macro (organizational/political) realms. Participants across focus groups attributed experiences of stigma and discrimination to: HIV-related stigma, sexism and gender discrimination, racism, homophobia and transphobia, and involvement in sex work. Coping strategies included resilience (micro), social networks and support groups (meso), and challenging stigma (macro).
Conclusions
HIV-positive women described interdependent and mutually constitutive relationships between marginalized social identities and inequities such as HIV-related stigma, sexism, racism, and homo/transphobia. These overlapping, multilevel forms of stigma and discrimination are representative of an intersectional model of stigma and discrimination. The present findings also suggest that micro, meso, and macro level factors simultaneously present barriers to health and well being—as well as opportunities for coping—in HIV-positive women's lives. Understanding the deleterious effects of stigma and discrimination on HIV risk, mental health, and access to care among HIV-positive women can inform health care provision, stigma reduction interventions, and public health policy.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
HIV-related stigma and discrimination—prejudice, negative attitudes, abuse, and maltreatment directed at people living with HIV—is a major factor contributing to the global HIV epidemic. HIV-related stigma, which devalues and stereotypes people living with HIV, increases vulnerability to HIV infection by reducing access to HIV prevention, testing, treatment, and support. At the personal (micro) level, HIV-related stigma can make it hard for people to take tests to determine their HIV status or to tell other people that they are HIV positive. At the social/community (meso) level, it can mean that HIV-positive people are ostracized from their communities. At the organizational/political (macro) level, it can mean that health-care workers treat HIV-positive people differently and that governments are deterred from taking fast, effective action against the HIV epidemic. In addition, HIV-related stigma is negatively associated with well-being among people living with HIV. Thus, among HIV-positive people, those who have experienced HIV-related stigma have higher levels of mental and physical illness.
Why Was This Study Done?
Racism (oppression and inequity founded on ethno-racial differences), sexism and gender discrimination (oppression and inequity based on gender bias in attitudes), and homophobia and transphobia (discrimination, fear, hostility, and violence towards nonheterosexual and transgender people, respectively) can also affect access to HIV services. However, little is known about how these different forms of stigma and discrimination interact (intersect). A better understanding of the effect of intersecting stigmas on people living with HIV could help in the development of stigma reduction interventions and HIV prevention, treatment and care programs, and could help to control global HIV infection rates. In this qualitative study (an analysis of people's attitudes and experiences rather than numerical data), the researchers investigate the intersection of HIV-related stigma, racism, sexism and gender discrimination, homophobia and transphobia among marginalized HIV-positive women in Ontario, Canada. As elsewhere in the world, HIV infection rates are increasing among women in Canada. Nearly 25% of people living with HIV in Canada are women and about a quarter of all new infections are in women. Moreover, there is a disproportionately high infection rate among marginalized women in Canada such as sex workers and lesbian, bisexual, and queer women.
What Did the Researchers Do and Find?
The researchers held 15 focus groups with 104 marginalized HIV-positive women who were recruited by word-of-mouth and through flyers circulated in community agencies serving women of diverse ethno-cultural origins. Each focus group explored topics that included challenges in daily life, medical issues and needs, and issues that were silenced within the participants' communities. The researchers analyzed the data from these focus groups using thematic analysis, an approach that identifies, analyzes, and reports themes in qualitative data. They found that women living with HIV in Ontario experienced multiple types of stigma at different levels. So, for example, women experienced HIV-related stigma at the micro (“If you're HIV-positive, you feel shameful”), meso (“The thing I hate most for people that test positive for HIV is that society ostracizes them”), and macro (“A lot of women are not getting employed because they have to disclose their status”) levels. The women also attributed their experiences of stigma and discrimination to sexism and gender discrimination, racism, homophobia and transphobia, and involvement in sex work at all three levels and described coping strategies at the micro (resilience; “I always live with hope”), meso (participation in social networks), and macro (challenging stigma) levels.
What Do These Findings Mean?
These findings indicate that marginalized HIV-positive women living in Ontario experience overlapping forms of stigma and discrimination and that these forms of stigma operate over micro, meso, and macro levels, as do the coping strategies adopted by the women. Together, these results support an intersectional model of stigma and discrimination that should help to inform discussions about the complexity of stigma and coping strategies. However, because only a small sample of nonrandomly selected women was involved in this study, these findings need to be confirmed in other groups of HIV-positive women. If confirmed, the complex system of interplay of different forms of stigma revealed here should help to inform health-care provision, stigma reduction interventions, and public-health policy, and could, ultimately, help to bring the global HIV epidemic under control.
Additional Information
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001124.
Information is available from the US National Institute of Allergy and Infectious Diseases on HIV infection and AIDS
NAM/aidsmap basic information about HIV/AIDS, and summaries of recent research findings on HIV care and treatment; its publication HIV and stigma deals with HIV-related stigma in the UK
Information is available from Avert, an international AIDS charity on many aspects of HIV/AIDS, including information on women, HIV, and AIDS, on HIV and AIDS stigma and discrimination, and on HIV/AIDS statistics for Canada (in English and Spanish)
The People Living with Stigma Index to address stigma relating to HIV and advocate on key barriers and issues perpetuating stigma; it has recently published Piecing it together for women and girls, the gender dimensions of HIV-related stigma; its website will soon include a selection of individual stories about HIV-related stigma
Patient stories about living with HIV/AIDS are available through Avert and through the charity website Healthtalkonline
doi:10.1371/journal.pmed.1001124
PMCID: PMC3222645  PMID: 22131907
10.  Sufficiency of FNAB aspirates of posterior uveal melanoma for cytologic versus GEP classification in 159 patients, and relative prognostic significance of these classifications 
Objective
To determine the relative sufficiency of paired aspirates of posterior uveal melanomas obtained by FNAB for cytopathology and GEP, and their prognostic significance for predicting death from metastasis.
Methods
Prospective non-randomized IRB-approved single-center longitudinal clinical study of 159 patients with posterior uveal melanoma sampled by FNAB in at least two tumor sites between 09/2007 and 12/2010. Cases were analyzed with regard to sufficiency of the obtained aspirates for cytopathologic classification and GEP classification. Statistical strength of associations between variables and GEP class was computed using Chi-square test. Cumulative actuarial survival curves of subgroups of these patients based on their cytopathologic versus GEP-assigned categories were computed by the Kaplan–Meier method. The endpoint for this survival analysis was death from metastatic uveal melanoma.
Results
FNAB aspirates were insufficient for cytopathologic classification in 34 of 159 cases (21.9 %). In contrast, FNAB aspirates were insufficient for GEP classification in only one of 159 cases (0.6 %). This difference is statistically significant (P < 0.001). Six of 34 tumors (17.6 %) that yielded an insufficient aspirate for cytopathologic diagnosis were categorized as GEP class 2, while 43 of 125 tumors (34.7 %) that yielded a sufficient aspirate for cytopathologic diagnosis were categorized as GEP class 2. To date, 14 of the 49 patients with a GEP class 2 tumor (28.6 %) but only five of the 109 patients with a GEP class 1 tumor (5.6 %) have developed metastasis. Fifteen of 125 patients (12 %) whose tumors yielded sufficient aspirates for cytopathologic classification but only four of 34 patients (11.8 %) whose tumors yielded insufficient aspirates for cytopathologic classification developed metastasis. The median post-biopsy follow-up time for surviving patients in this series was 32.5 months. Cumulative actuarial 5-year probability of death from metastasis 14.1 % for those with an insufficient aspirate for cytopathologic classification versus 22.4 % for those with a sufficient aspirate for cytopathologic classification (log rank P = 0.68). In contrast, the cumulative actuarial 5-year probability of metastatic death was 8.0 % for those with an insufficient/unsatisfactory aspirate for GEP classification or GEP class 1 tumor, versus 45.0 % for those with a GEP class 2 tumor (log rank P = 0.005).
Conclusion
This study confirmed that GEP classification of posterior uveal melanoma cells obtained by FNAB is feasible in almost all cases, including most in which FNAB yields an insufficient aspirate for cytodiagnosis. The study also confirmed that GEP classification is substantially better than cytologic classification for predicting subsequent metastasis and metastatic death.
doi:10.1007/s00417-013-2515-0
PMCID: PMC3889697  PMID: 24270974
Melanoma; Uveal neoplasm; Choroidal melanoma; Cytology; Biopsy, needle/methods; Gene expression profile; Survival prognosis; Melanoma/metastasis
11.  Latent Class Analysis With Distal Outcomes: A Flexible Model-Based Approach 
Although prediction of class membership from observed variables in latent class analysis is well understood, predicting an observed distal outcome from latent class membership is more complicated. A flexible model-based approach is proposed to empirically derive and summarize the class-dependent density functions of distal outcomes with categorical, continuous, or count distributions. A Monte Carlo simulation study is conducted to compare the performance of the new technique to two commonly used classify-analyze techniques: maximum-probability assignment and multiple pseudo-class draws. Simulation results show that the model-based approach produces substantially less biased estimates of the effect compared to either classify-analyze technique, particularly when the association between the latent class variable and the distal outcome is strong. In addition, we show that only the model-based approach is consistent. The approach is demonstrated empirically: latent classes of adolescent depression are used to predict smoking, grades, and delinquency. SAS syntax for implementing this approach using PROC LCA and a corresponding macro are provided.
doi:10.1080/10705511.2013.742377
PMCID: PMC4240499  PMID: 25419096
latent class analysis; distal outcome; finite mixture model; pseudo-class draws
12.  Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection 
Objective
The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload.
Design
We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (Npos=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations.
Measurements
Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed.
Results
Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively).
Conclusion
Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably.
doi:10.1136/amiajnl-2010-000022
PMCID: PMC3168300  PMID: 21709163
Medical informatics; text mining, data mining; longitudinal data analysis; classification and clustering algorithms; adverse event identification; text
13.  Classification of adults suffering from typical gastroesophageal reflux disease symptoms: contribution of latent class analysis in a European observational study 
BMC Gastroenterology  2014;14:112.
Background
As illustrated by the Montreal classification, gastroesophageal reflux disease (GERD) is much more than heartburn and patients constitute a heterogeneous group. Understanding if links exist between patients’ characteristics and GERD symptoms, and classify subjects based on symptom-profile could help to better understand, diagnose, and treat GERD. The aim of this study was to identify distinct classes of GERD patients according to symptom profiles, using a specific statistical tool: Latent class analysis.
Methods
An observational single-visit study was conducted in 5 European countries in 7700 adults with typical symptoms. A latent class analysis was performed to identify “latent classes” and was applied to 12 indicator symptoms.
Results
On 7434 subjects with non-missing indicators, latent class analysis yielded 5 latent classes. Class 1 grouped the highest severity of typical GERD symptoms during day and night, more digestive and non-digestive GERD symptoms, and bad sleep quality. Class 3 represented less frequent and less severe digestive and non-digestive GERD symptoms, and better sleep quality than in class 1. In class 2, only typical GERD symptoms at night occurred. Classes 4 and 5 represented daytime and nighttime regurgitation. In class 4, heartburn was also identified and more atypical digestive symptoms. Multinomial logistic regression showed that country, age, sex, smoking, alcohol use, low-fat diet, waist circumference, recent weight gain (>5 kg), elevated triglycerides, metabolic syndrome, and medical GERD treatment had a significant effect on latent classes.
Conclusion
Latent class analysis classified GERD patients based on symptom profiles which related to patients’ characteristics. Although further studies considering these proposed classes have to be conducted to determine the reproducibility of this classification, this new tool might contribute in better management and follow-up of patients with GERD.
doi:10.1186/1471-230X-14-112
PMCID: PMC4094535  PMID: 24969728
GERD; Acid related disease; Adults; Latent class analysis; Symptoms; Classification
14.  Computer-assisted lip diagnosis on traditional Chinese medicine using multi-class support vector machines 
Background
In Traditional Chinese Medicine (TCM), the lip diagnosis is an important diagnostic method which has a long history and is applied widely. The lip color of a person is considered as a symptom to reflect the physical conditions of organs in the body. However, the traditional diagnostic approach is mainly based on observation by doctor’s nude eyes, which is non-quantitative and subjective. The non-quantitative approach largely depends on the doctor’s experience and influences accurate the diagnosis and treatment in TCM. Developing new quantification methods to identify the exact syndrome based on the lip diagnosis of TCM becomes urgent and important. In this paper, we design a computer-assisted classification model to provide an automatic and quantitative approach for the diagnosis of TCM based on the lip images.
Methods
A computer-assisted classification method is designed and applied for syndrome diagnosis based on the lip images. Our purpose is to classify the lip images into four groups: deep-red, red, purple and pale. The proposed scheme consists of four steps including the lip image preprocessing, image feature extraction, feature selection and classification. The extracted 84 features contain the lip color space component, texture and moment features. Feature subset selection is performed by using SVM-RFE (Support Vector Machine with recursive feature elimination), mRMR (minimum Redundancy Maximum Relevance) and IG (information gain). Classification model is constructed based on the collected lip image features using multi-class SVM and Weighted multi-class SVM (WSVM). In addition, we compare SVM with k-nearest neighbor (kNN) algorithm, Multiple Asymmetric Partial Least Squares Classifier (MAPLSC) and Naïve Bayes for the diagnosis performance comparison. All displayed faces image have obtained consent from the participants.
Results
A total of 257 lip images are collected for the modeling of lip diagnosis in TCM. The feature selection method SVM-RFE selects 9 important features which are composed of 5 color component features, 3 texture features and 1 moment feature. SVM, MAPLSC, Naïve Bayes, kNN showed better classification results based on the 9 selected features than the results obtained from all the 84 features. The total classification accuracy of the five methods is 84%, 81%, 79% and 81%, 77%, respectively. So SVM achieves the best classification accuracy. The classification accuracy of SVM is 81%, 71%, 89% and 86% on Deep-red, Pale Purple, Red and lip image models, respectively. While with the feature selection algorithm mRMR and IG, the total classification accuracy of WSVM achieves the best classification accuracy. Therefore, the results show that the system can achieve best classification accuracy combined with SVM classifiers and SVM-REF feature selection algorithm.
Conclusions
A diagnostic system is proposed, which firstly segments the lip from the original facial image based on the Chan-Vese level set model and Otsu method, then extracts three kinds of features (color space features, Haralick co-occurrence features and Zernike moment features) on the lip image. Meanwhile, SVM-REF is adopted to select the optimal features. Finally, SVM is applied to classify the four classes. Besides, we also compare different feature selection algorithms and classifiers to verify our system. So the developed automatic and quantitative diagnosis system of TCM is effective to distinguish four lip image classes: Deep-red, Purple, Red and Pale. This study puts forward a new method and idea for the quantitative examination on lip diagnosis of TCM, as well as provides a template for objective diagnosis in TCM.
doi:10.1186/1472-6882-12-127
PMCID: PMC3522569  PMID: 22898352
Traditional chinese medicine; Computer-assisted lip diagnosis; Image analysis; Feature selection; Support vector machine
15.  A full Bayesian hierarchical mixture model for the variance of gene differential expression 
BMC Bioinformatics  2007;8:124.
Background
In many laboratory-based high throughput microarray experiments, there are very few replicates of gene expression levels. Thus, estimates of gene variances are inaccurate. Visual inspection of graphical summaries of these data usually reveals that heteroscedasticity is present, and the standard approach to address this is to take a log2 transformation. In such circumstances, it is then common to assume that gene variability is constant when an analysis of these data is undertaken. However, this is perhaps too stringent an assumption. More careful inspection reveals that the simple log2 transformation does not remove the problem of heteroscedasticity. An alternative strategy is to assume independent gene-specific variances; although again this is problematic as variance estimates based on few replications are highly unstable. More meaningful and reliable comparisons of gene expression might be achieved, for different conditions or different tissue samples, where the test statistics are based on accurate estimates of gene variability; a crucial step in the identification of differentially expressed genes.
Results
We propose a Bayesian mixture model, which classifies genes according to similarity in their variance. The result is that genes in the same latent class share the similar variance, estimated from a larger number of replicates than purely those per gene, i.e. the total of all replicates of all genes in the same latent class. An example dataset, consisting of 9216 genes with four replicates per condition, resulted in four latent classes based on their similarity of the variance.
Conclusion
The mixture variance model provides a realistic and flexible estimate for the variance of gene expression data under limited replicates. We believe that in using the latent class variances, estimated from a larger number of genes in each derived latent group, the p-values obtained are more robust than either using a constant gene or gene-specific variance estimate.
doi:10.1186/1471-2105-8-124
PMCID: PMC1876253  PMID: 17439644
16.  Automating the Assignment of Diagnosis Codes to Patient Encounters Using Example-based and Machine Learning Techniques 
Objective
Human classification of diagnoses is a labor intensive process that consumes significant resources. Most medical practices use specially trained medical coders to categorize diagnoses for billing and research purposes.
Methods
We have developed an automated coding system designed to assign codes to clinical diagnoses. The system uses the notion of certainty to recommend subsequent processing. Codes with the highest certainty are generated by matching the diagnostic text to frequent examples in a database of 22 million manually coded entries. These code assignments are not subject to subsequent manual review. Codes at a lower certainty level are assigned by matching to previously infrequently coded examples. The least certain codes are generated by a naïve Bayes classifier. The latter two types of codes are subsequently manually reviewed.
Measurements
Standard information retrieval accuracy measurements of precision, recall and f-measure were used. Micro- and macro-averaged results were computed.
Results
At least 48% of all EMR problem list entries at the Mayo Clinic can be automatically classified with macro-averaged 98.0% precision, 98.3% recall and an f-score of 98.2%. An additional 34% of the entries are classified with macro-averaged 90.1% precision, 95.6% recall and 93.1% f-score. The remaining 18% of the entries are classified with macro-averaged 58.5%.
Conclusion
Over two thirds of all diagnoses are coded automatically with high accuracy. The system has been successfully implemented at the Mayo Clinic, which resulted in a reduction of staff engaged in manual coding from thirty-four coders to seven verifiers.
doi:10.1197/jamia.M2077
PMCID: PMC1561792  PMID: 16799125
17.  Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries 
Objective
A system that translates narrative text in the medical domain into structured representation is in great demand. The system performs three sub-tasks: concept extraction, assertion classification, and relation identification.
Design
The overall system consists of five steps: (1) pre-processing sentences, (2) marking noun phrases (NPs) and adjective phrases (APs), (3) extracting concepts that use a dosage-unit dictionary to dynamically switch two models based on Conditional Random Fields (CRF), (4) classifying assertions based on voting of five classifiers, and (5) identifying relations using normalized sentences with a set of effective discriminating features.
Measurements
Macro-averaged and micro-averaged precision, recall and F-measure were used to evaluate results.
Results
The performance is competitive with the state-of-the-art systems with micro-averaged F-measure of 0.8489 for concept extraction, 0.9392 for assertion classification and 0.7326 for relation identification.
Conclusions
The system exploits an array of common features and achieves state-of-the-art performance. Prudent feature engineering sets the foundation of our systems. In concept extraction, we demonstrated that switching models, one of which is especially designed for telegraphic sentences, improved extraction of the treatment concept significantly. In assertion classification, a set of features derived from a rule-based classifier were proven to be effective for the classes such as conditional and possible. These classes would suffer from data scarcity in conventional machine-learning methods. In relation identification, we use two-staged architecture, the second of which applies pairwise classifiers to possible candidate classes. This architecture significantly improves performance.
doi:10.1136/amiajnl-2011-000776
PMCID: PMC3422834  PMID: 22586067
xy804280; thu; text processing; natural language processing; medical records
18.  Contemporary Modeling of Gene-by-Environment Effects In Randomized Multivariate Longitudinal Studies 
There is a great deal of interest in the analysis of genotype by environment interactions (GxE). There are some limitations in the typical models for the analysis of GxE, including well-known statistical problems in identifying interactions and unobserved heterogeneity of persons across groups. The impact of a treatment may depend on the level of an unobserved variable, and this variation may dampen the estimated impact of treatment. A case has been made that genetic variation may sometimes account for unobserved, and hence unaccounted for, heterogeneity. The statistical power associated with the GxE design has been studied in many different ways, and most results show that the small effects expected require relatively large or non-representative samples (i.e., extreme groups). In this report, we describe some alternative approaches, such as randomized designs with multiple measures, multiple groups, multiple occasions, and analyses to identify latent (unobserved) classes of people. These are illustrated with data from the HRS/ADAMs study, examining the relations among episodic memory (based on word recall), APOE4 genotype, and educational attainment (as a proxy for an environmental exposure). Randomized clinical trials (RCT) or randomized field trials (RFT) have multiple strengths in the estimation of causal influences, and we discuss how measured genotypes can be incorporated into these designs. Use of these contemporary modeling techniques often requires different kinds of data be collected and encourages the formation of parsimonious models with fewer overall parameters, allowing specific GxE hypotheses to be investigated with a reasonable statistical foundation.
doi:10.1177/1745691610383510
PMCID: PMC3004154  PMID: 22472970
19.  Meeting U.S. Healthy People 2010 Levels of Physical Activity: Agreement of 2 Measures across 2 years 
Annals of epidemiology  2010;20(7):511-523.
Background
Measuring the way people vary across time in meeting recommended levels of physical activity should be a fundamental component of public health surveillance. However, we were unaware of prospective cohort studies that had examined this in a population base using convergent measures.
Purpose
We examined agreement between two validated measures used to estimate periodic change in the rate of meeting U.S. Healthy People 2010 guidelines for participation in moderate or vigorous physical activity.
Methods
A cohort (N=497) from a random, multi-ethnic sample of adults living in Hawaii was assessed every 6-months for 2 years starting spring 2004. Latent transition analysis classified people as meeting or not meeting the guidelines. Intra-class kappa statistics and multinomial logistic regression analysis were used to evaluate agreement.
Results
Agreement for classifying stable classes of people who met or did not meet the guideline each time was substantial for vigorous activity (kappa ∼ .65 - .70) but fair-to-moderate for moderate activity (kappa ∼ .38 - .48). Agreement was poorer for classifying people who transitioned between meeting and not meeting the vigorous guideline (kappa ∼ .45) or the moderate guideline (kappa ∼ .21 - .29).
Conclusion
Rates of meeting the guidelines varied across time and were estimated differently by the two measures, especially for moderate activity. This illustrates an understudied problem for public health promotion. Accurate classification of change within people is necessary for determining exposure in outcome studies, personal determinants of sufficient activity, and for evaluating whether interventions are successful in sustaining increases in rates of meeting physical activity guidelines.
doi:10.1016/j.annepidem.2010.04.004
PMCID: PMC2895401  PMID: 20538194
Asian American; inter-rater agreement; latent transition analysis; Native Hawaiian/Pacific Islander; public health policy
20.  Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier 
Objective
Automated and disease-specific classification of textual clinical discharge summaries is of great importance in human life science, as it helps physicians to make medical studies by providing statistically relevant data for analysis. This can be further facilitated if, at the labeling of discharge summaries, semantic labels are also extracted from text, such as whether a given disease is present, absent, questionable in a patient, or is unmentioned in the document. The authors present a classification technique that successfully solves the semantic classification task.
Design
The authors introduce a context-aware rule-based semantic classification technique for use on clinical discharge summaries. The classification is performed in subsequent steps. First, some misleading parts are removed from the text; then the text is partitioned into positive, negative, and uncertain context segments, then a sequence of binary classifiers is applied to assign the appropriate semantic labels.
Measurement
For evaluation the authors used the documents of the i2b2 Obesity Challenge and adopted its evaluation measures: F1-macro and F1-micro for measurements.
Results
On the two subtasks of the Obesity Challenge (textual and intuitive classification) the system performed very well, and achieved a F1-macro = 0.80 for the textual and F1-macro = 0.67 for the intuitive tasks, and obtained second place at the textual and first place at the intuitive subtasks of the challenge.
Conclusions
The authors show in the paper that a simple rule-based classifier can tackle the semantic classification task more successfully than machine learning techniques, if the training data are limited and some semantic labels are very sparse.
doi:10.1197/jamia.M3087
PMCID: PMC2705263  PMID: 19390101
21.  Accuracy of reading liquid based cytology slides using the ThinPrep Imager compared with conventional cytology: prospective study 
BMJ : British Medical Journal  2007;335(7609):31.
Objective To compare the accuracy of liquid based cytology using the computerised ThinPrep Imager with that of manually read conventional cytology.
Design Prospective study.
Setting Pathology laboratory in Sydney, Australia.
Participants 55 164 split sample pairs (liquid based sample collected after conventional sample from one collection) from consecutive samples of women choosing both types of cytology and whose specimens were examined between August 2004 and June 2005.
Main outcome measures Primary outcome was accuracy of slides for detecting squamous lesions. Secondary outcomes were rate of unsatisfactory slides, distribution of squamous cytological classifications, and accuracy of detecting glandular lesions.
Results Fewer unsatisfactory slides were found for imager read cytology than for conventional cytology (1.8% v 3.1%; P<0.001). More slides were classified as abnormal by imager read cytology (7.4% v 6.0% overall and 2.8% v 2.2% for cervical intraepithelial neoplasia of grade 1 or higher). Among 550 patients in whom imager read cytology was cervical intraepithelial neoplasia grade 1 or higher and conventional cytology was less severe than grade 1, 133 of 380 biopsy samples taken were high grade histology. Among 294 patients in whom imager read cytology was less severe than cervical intraepithelial neoplasia grade 1 and conventional cytology was grade 1 or higher, 62 of 210 biopsy samples taken were high grade histology. Imager read cytology therefore detected 71 more cases of high grade histology than did conventional cytology, resulting from 170 more biopsies. Similar results were found when one pathologist reread the slides, masked to cytology results.
Conclusion The ThinPrep Imager detects 1.29 more cases of histological high grade squamous disease per 1000 women screened than conventional cytology, with cervical intraepithelial neoplasia grade 1 as the threshold for referral to colposcopy. More imager read slides than conventional slides were satisfactory for examination and more contained low grade cytological abnormalities.
doi:10.1136/bmj.39219.645475.55
PMCID: PMC1910624  PMID: 17604301
22.  Social and geographic inequalities in premature adult mortality in Japan: a multilevel observational study from 1970 to 2005 
BMJ Open  2012;2(2):e000425.
Objectives
To examine trends in social and geographic inequalities in all-cause premature adult mortality in Japan.
Design
Observational study of the vital statistics and the census data.
Setting
Japan.
Participants
Entire population aged 25 years or older and less than 65 years in 1970, 1975, 1980, 1985, 1990, 1995, 2000 and 2005. The total number of decedents was 984 022 and 532 223 in men and women, respectively.
Main outcome measures
For each sex, ORs and 95% CIs for mortality were estimated by using multilevel logistic regression models with ‘cells’ (cross-tabulated by age and occupation) at level 1, 8 years at level 2 and 47 prefectures at level 3. The prefecture-level variance was used as an estimate of geographic inequalities of mortality.
Results
Adjusting for age and time-trends, compared with production process and related workers, ORs ranged from 0.97 (95% CI 0.96 to 0.98) among administrative and managerial workers to 2.22 (95% CI 2.19 to 2.24) among service workers in men. By contrast, in women, the lowest odds for mortality was observed among production process and related workers (reference), while the highest OR was 12.22 (95% CI 11.40 to 13.10) among security workers. The degree of occupational inequality increased in both sexes. Higher occupational groups did not experience reductions in mortality throughout the period and was overtaken by lower occupational groups in the early 1990s, among men. Conditional on individual age and occupation, overall geographic inequalities of mortality were relatively small in both sexes; the ORs ranged from 0.87 (Okinawa) to 1.13 (Aomori) for men and from 0.84 (Kanagawa) to 1.11 (Kagoshima) for women, even though there is a suggestion of increasing inequalities across prefectures since 1995 in both sexes.
Conclusions
The present findings suggest that both social and geographic inequalities in all-cause mortality have increased in Japan during the last 3 decades.
Article summary
Article focus
While Japan enjoys the highest average life expectancy in the world, less has been documented on the trends and patterns of health inequalities within the nation.
We examined trends in social and geographic inequalities in all-cause premature adult mortality from 1970 through 2005.
Key messages
This is the first study that simultaneously examines time-trends in premature mortality by occupational class as well as geographic locality, and the results of our study indicate that health disparities have widened during the decades following the collapse of the asset bubble in the early 1990s.
Given the multiple challenges that threaten to further dampen economic activity of the nation, it is imperative to continue to monitor future trends in health inequalities in order to avert the potential impacts on Japan's health security.
Strengths and limitations of this study
The data are census based and cover the whole of Japan from 1970 to 2005.
This study uses multilevel methods to properly adjust for micro- and macro-level bias simultaneously.
We lacked information on whether the individuals were in standard jobs or precarious jobs and a possibility of measurement error in occupation at the time of death cannot be ruled out.
doi:10.1136/bmjopen-2011-000425
PMCID: PMC3293144  PMID: 22389360
23.  Using a factor mixture modeling approach in alcohol dependence in a general population sample 
Drug and alcohol dependence  2008;98(1-2):105-114.
Alcohol dependence (AD) is a complex and heterogeneous disorder. The identification of more homogeneous subgroups of individuals with drinking problems and the refinement of the diagnostic criteria are inter-related research goals. They have the potential to improve our knowledge of etiology and treatment effects, and to assist in the identification of risk factors or specific genetic factors. Mixture modeling has advantages over traditional modeling that focuses on either the dimensional or categorical latent structure. The mixture modeling combines both latent class and latent trait models, but has not been widely applied in substance use research. The goal of the present study is to assess whether the AD criteria in the population could be better characterized by a continuous dimension, a few discrete subgroups, or a combination of the two. More than seven thousand participants were recruited from the population-based Virginia Twin Registry, and were interviewed to obtain DSM-IV (Diagnostic and Statistical Manual of Mental Disorder, version IV) symptoms and diagnosis of AD. We applied factor analysis, latent class analysis, and factor mixture models for symptom items based on the DSM-IV criteria.
Our results showed that a mixture model with one factor and three classes for both genders fit well. The three classes were a non-problem drinking group and severe and moderate drinking problem groups. By contrast, models constrained to conform to DSM-IV diagnostic criteria were rejected by model fitting indices providing empirical evidence for heterogeneity in the AD diagnosis. Classification analysis showed different characteristics across subgroups, including alcohol-caused behavioral problems, comorbid disorders, age at onset for alcohol-related milestones, and personality. Clinically, the expanded classification of AD may aid in identifying suitable treatments, interventions and additional sources of comorbidity based on these more homogenous subgroups of alcohol use problems.
doi:10.1016/j.drugalcdep.2008.04.018
PMCID: PMC2572186  PMID: 18586414
latent trait; latent class; mixture model; diagnosis
24.  Accuracy of liquid based versus conventional cytology: overall results of new technologies for cervical cancer screening: randomised controlled trial 
BMJ : British Medical Journal  2007;335(7609):28.
Objective To compare the accuracy of conventional cytology with liquid based cytology for primary screening of cervical cancer.
Design Randomised controlled trial.
Setting Nine screening programmes in Italy.
Participants Women aged 25-60 attending for a new screening round: 22 466 were assigned to the conventional arm and 22 708 were assigned to the experimental arm.
Interventions Conventional cytology compared with liquid based cytology and testing for human papillomavirus.
Main outcome measure Relative sensitivity for cervical intraepithelial neoplasia of grade 2 or more at blindly reviewed histology, with atypical cells of undetermined significance or more severe cytology considered a positive result.
Results In an intention to screen analysis liquid based cytology showed no significant increase in sensitivity for cervical intraepithelial neoplasia of grade 2 or more (relative sensitivity 1.17, 95% confidence interval 0.87 to 1.56) whereas the positive predictive value was reduced (relative positive predictive value v conventional cytology 0.58, 0.44 to 0.77). Liquid based cytology detected more lesions of grade 1 or more (relative sensitivity 1.68, 1.40 to 2.02), with a larger increase among women aged 25-34 (P for heterogeneity 0.0006), but did not detect more lesions of grade 3 or more (relative sensitivity 0.84, 0.56 to 1.25). Results were similar when only low grade intraepithelial lesions or more severe cytology were considered a positive result. No evidence was found of heterogeneity between centres or of improvement with increasing time from start of the study. The relative frequency of women with at least one unsatisfactory result was lower with liquid based cytology (0.62, 0.56 to 0.69).
Conclusion Liquid based cytology showed no statistically significant difference in sensitivity to conventional cytology for detection of cervical intraepithelial neoplasia of grade 2 or more. More positive results were found, however, leading to a lower positive predictive value. A large reduction in unsatisfactory smears was evident.
Trial registration Current Controlled Trials ISRCTN81678807.
doi:10.1136/bmj.39196.740995.BE
PMCID: PMC1910655  PMID: 17517761
25.  Comparison of Two Output-Coding Strategies for Multi-Class Tumor Classification Using Gene Expression Data and Latent Variable Model as Binary Classifier 
Cancer Informatics  2010;9:39-48.
Multi-class cancer classification based on microarray data is described. A generalized output-coding scheme based on One Versus One (OVO) combined with Latent Variable Model (LVM) is used. Results from the proposed One Versus One (OVO) outputcoding strategy is compared with the results obtained from the generalized One Versus All (OVA) method and their efficiencies of using them for multi-class tumor classification have been studied. This comparative study was done using two microarray gene expression data: Global Cancer Map (GCM) dataset and brain cancer (BC) dataset. Primary feature selection was based on fold change and penalized t-statistics. Evaluation was conducted with varying feature numbers. The OVO coding strategy worked quite well with the BC data, while both OVO and OVA results seemed to be similar for the GCM data. The selection of output coding methods for combining binary classifiers for multi-class tumor classification depends on the number of tumor types considered, the discrepancies between the tumor samples used for training as well as the heterogeneity of expression within the cancer subtypes used as training data.
PMCID: PMC2865770  PMID: 20458360
binary classifier; multi-class tumor classification; supervised classification; latent variable model; gibbs sampling; gene expression

Results 1-25 (1235111)