Microscopy images are rich sources of information about cell structure and function for systems biology. We have presented a framework to classify proteome-scale collections of proteins containing complex subcellular location patterns, and our classifier provides performance similar to human annotation on single-class proteins.
The only prior work on the automated classification of proteins using HPA confocal immunofluorescence images was described by Newberg et al. 
. In this paper, we obtain similar classification accuracies on single-class proteins but analyze many more proteins and patterns. The cytoplasm pattern, which has the second largest number of proteins, was added and introduces some confusion with other patterns because of non-specific staining over the cell. The nucleus pattern was split into nucleus pattern and nucleus without nucleoli pattern to provide more detailed annotations, notwithstanding the two are highly blended in the staining and are difficult to distinguish visually in many images. The small class of cytoskeleton was also even split into three further patterns of actin, intermediate filaments and microtubules which reduces the number of training images available for each. Nonetheless, good classification accuracies were maintained, which represents a significant advance over our prior work. However, the accuracies are not yet high enough to replace human annotators. In the future, we plan to implement new features specific for the centrosome pattern, and hope to add features for better discriminating the cytoskeleton and plasma membrane patterns from the cytoplasm pattern.
One of the main novelties we describe in this paper is the introduction of approaches to identify possible mis-annotated proteins, derived from SVM classification and hierarchical clustering, and the demonstration that they could identify proteins needing reannotation at a rate higher than random. Our results show that selecting proteins using both
schemes achieves higher yield of reannotated proteins than either of them alone or in combination. We plan to continue cycles of reannotation, and to incorporate the automated system in the annotation pipeline. Note that in this paper we only provide results for the A-431 cell line, but the whole framework introduced here can be applied to other cell lines, such as U-2OS and U-251MG. As a matter of fact, some preliminary results have already been obtained (data not shown; included in Reproducible Research Archive as described in Materials and Methods
). We hope thereby to maximize the accuracy of reported annotations in the Human Protein Atlas. We anticipate that a similar approach may be applied to other proteome-scale image collections.
The dataset used in this paper contains 2D, static confocal images of fixed cells from HPA. In the future, the temporal dynamics of the variations of protein subcellular location patterns and the evolution over the course of stem cell differentiation can be explored by our framework as datasets become available.
Another novel aspect of this work is the results on full or partial recognition of mixed pattern proteins. Our results highlight the difficulty of handling these patterns. The main problem is that the features are affected by the degree of mixture. This is unlike the case for tasks like document classification, in which the addition of a second topic associated with new words does not alter the detection of words associated with the first topic. It is also unlike the case in many natural scene images in which adding a dog to an image of a cat does not change the local features associated with the cat. In these cases, a number of multiclass learning strategies have been successfully used. For protein patterns consisting of vesicular objects, we have used similar methods to show that the frequency of object types can be used to estimate mixing between patterns (using both supervised 
and unsupervised 
approaches). Unfortunately, this approach does not generalize to mixtures involving non-vesicular proteins, and preliminary work indicates that local features such as SIFT 
also do not perform well in that case.