Search tips
Search criteria

Results 1-6 (6)

Clipboard (0)
more »
Year of Publication
Document Types
1.  Determining the subcellular location of new proteins from microscope images using local features 
Bioinformatics  2013;29(18):2343-2349.
Motivation: Evaluation of previous systems for automated determination of subcellular location from microscope images has been done using datasets in which each location class consisted of multiple images of the same representative protein. Here, we frame a more challenging and useful problem where previously unseen proteins are to be classified.
Results: Using CD-tagging, we generated two new image datasets for evaluation of this problem, which contain several different proteins for each location class. Evaluation of previous methods on these new datasets showed that it is much harder to train a classifier that generalizes across different proteins than one that simply recognizes a protein it was trained on.
We therefore developed and evaluated additional approaches, incorporating novel modifications of local features techniques. These extended the notion of local features to exploit both the protein image and any reference markers that were imaged in parallel. With these, we obtained a large accuracy improvement in our new datasets over existing methods. Additionally, these features help achieve classification improvements for other previously studied datasets.
Availability: The datasets are available for download at The software was written in Python and C++ and is available under an open-source license at The code is split into a library, which can be easily reused for other data and a small driver script for reproducing all results presented here. A step-by-step tutorial on applying the methods to new datasets is also available at that address.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3753569  PMID: 23836142
2.  Protein subcellular location pattern classification in cellular images using latent discriminative models 
Bioinformatics  2012;28(12):i32-i39.
Motivation: Knowledge of the subcellular location of a protein is crucial for understanding its functions. The subcellular pattern of a protein is typically represented as the set of cellular components in which it is located, and an important task is to determine this set from microscope images. In this article, we address this classification problem using confocal immunofluorescence images from the Human Protein Atlas (HPA) project. The HPA contains images of cells stained for many proteins; each is also stained for three reference components, but there are many other components that are invisible. Given one such cell, the task is to classify the pattern type of the stained protein. We first randomly select local image regions within the cells, and then extract various carefully designed features from these regions. This region-based approach enables us to explicitly study the relationship between proteins and different cell components, as well as the interactions between these components. To achieve these two goals, we propose two discriminative models that extend logistic regression with structured latent variables. The first model allows the same protein pattern class to be expressed differently according to the underlying components in different regions. The second model further captures the spatial dependencies between the components within the same cell so that we can better infer these components. To learn these models, we propose a fast approximate algorithm for inference, and then use gradient-based methods to maximize the data likelihood.
Results: In the experiments, we show that the proposed models help improve the classification accuracies on synthetic data and real cellular images. The best overall accuracy we report in this article for classifying 942 proteins into 13 classes of patterns is about 84.6%, which to our knowledge is the best so far. In addition, the dependencies learned are consistent with prior knowledge of cell organization.
PMCID: PMC3371862  PMID: 22689776
3.  Automated analysis of protein subcellular location in time series images 
Bioinformatics  2010;26(13):1630-1636.
Motivation: Image analysis, machine learning and statistical modeling have become well established for the automatic recognition and comparison of the subcellular locations of proteins in microscope images. By using a comprehensive set of features describing static images, major subcellular patterns can be distinguished with near perfect accuracy. We now extend this work to time series images, which contain both spatial and temporal information. The goal is to use temporal features to improve recognition of protein patterns that are not fully distinguishable by their static features alone.
Results: We have adopted and designed five sets of features for capturing temporal behavior in 2D time series images, based on object tracking, temporal texture, normal flow, Fourier transforms and autoregression. Classification accuracy on an image collection for 12 fluorescently tagged proteins was increased when temporal features were used in addition to static features. Temporal texture, normal flow and Fourier transform features were most effective at increasing classification accuracy. We therefore extended these three feature sets to 3D time series images, but observed no significant improvement over results for 2D images. The methods for 2D and 3D temporal pattern analysis do not require segmentation of images into single cell regions, and are suitable for automated high-throughput microscopy applications.
Availability: Images, source code and results will be available upon publication at
PMCID: PMC2887049  PMID: 20484328
4.  Improved Recognition of Figures containing Fluorescence Microscope Images in Online Journal Articles using Graphical Models 
Bioinformatics (Oxford, England)  2007;24(4):569-576.
There is extensive interest in automating the collection, organization, and analysis of biological data. Data in the form of images in online literature present special challenges for such efforts. The first steps in understanding the contents of a figure are decomposing it into panels and determining the type of each panel. In biological literature, panel types include many kinds of images collected by different techniques, such as photographs of gels or images from microscopes. We have previously described the SLIF system ( that identifies panels containing fluorescence microscope images among figures in online journal articles as a prelude to further analysis of the subcellular patterns in such images. This system contains a pretrained classifier that uses image features to assign a type (class) to each separate panel. However, the types of panels in a figure are often correlated, so that we can consider the class of a panel to be dependent not only on its own features but also on the types of the other panels in a figure.
In this paper, we introduce the use of a type of probabilistic graphical model, a factor graph, to represent the structured information about the images in a figure, and permit more robust and accurate inference about their types. We obtain significant improvement over results for considering panels separately.
The code and data used for the experiments described here are available from Contact:
PMCID: PMC2901545  PMID: 18033795
5.  Quantifying the distribution of probes between subcellular locations using unsupervised pattern unmixing 
Bioinformatics  2010;26(12):i7-i12.
Motivation: Proteins exhibit complex subcellular distributions, which may include localizing in more than one organelle and varying in location depending on the cell physiology. Estimating the amount of protein distributed in each subcellular location is essential for quantitative understanding and modeling of protein dynamics and how they affect cell behaviors. We have previously described automated methods using fluorescent microscope images to determine the fractions of protein fluorescence in various subcellular locations when the basic locations in which a protein can be present are known. As this set of basic locations may be unknown (especially for studies on a proteome-wide scale), we here describe unsupervised methods to identify the fundamental patterns from images of mixed patterns and estimate the fractional composition of them.
Methods: We developed two approaches to the problem, both based on identifying types of objects present in images and representing patterns by frequencies of those object types. One is a basis pursuit method (which is based on a linear mixture model), and the other is based on latent Dirichlet allocation (LDA). For testing both approaches, we used images previously acquired for testing supervised unmixing methods. These images were of cells labeled with various combinations of two organelle-specific probes that had the same fluorescent properties to simulate mixed patterns of subcellular location.
Results: We achieved 0.80 and 0.91 correlation between estimated and underlying fractions of the two probes (fundamental patterns) with basis pursuit and LDA approaches, respectively, indicating that our methods can unmix the complex subcellular distribution with reasonably high accuracy.
PMCID: PMC2881404  PMID: 20529939
6.  High-recall protein entity recognition using a dictionary 
Bioinformatics (Oxford, England)  2005;21(Suppl 1):i266-i273.
Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches to that of Maximum Entropy (Max-Ent) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary—the measure of most interest in our intended application.
PMCID: PMC2857312  PMID: 15961466
Protein Name Extraction; Dictionary HMMs; CRFs; SemiCRFs

Results 1-6 (6)