Analyzing StarryNite errors
Initially, we investigated the types of errors produced by StarryNite, with the goal of focusing our analyses on the most common errors. To this end, we grouped StarryNite errors into five categories: (1) false positives, (2) false negatives, (3) positioning errors, (4) incorrect diameter estimation and (5) tracing errors. A false positive occurs when StarryNite mistakenly detects a nucleus, which in fact is non-existent. Conversely, false negatives are nuclei that StarryNite fails to identify. Positioning errors occur when StarryNite makes mistakes in finding the coordinates of the centroid of the nucleus. Incorrect size estimation happens when the inferred diameter of a nucleus differs from the true value. Tracing errors include cases where a nucleus at a particular time point is not matched to the right nucleus (or nuclei) in the next time point. For each nucleus, there can be three possible matches: one to one, one to two, or one to none, corresponding to movement, cell division (i.e
., division call), and cell death [5
]. A moving nucleus simply changes its location from one time point to the next. A dividing nucleus splits into two children nuclei in the next time point. Finally, a cell death corresponds to the case where a cell disappears. Once the embryo finishes its development it starts to crawl away from the imaging foci. Hence, in the final stages of development, some cells will start to disappear from the image data and some will still be present. Note that all of these errors are subjectively defined, ultimately, by visual inspection by a human expert. Hence, there is no hard and fast rule for, for example, how far off the centroid must be in order to qualify as a positioning error.
We collected statistics for each error type on a single benchmark series (081505), which contains image data up to the 195 cell stage. This series contains a total of 23,987 nuclei annotations by StarryNite and 24,355 annotations in the manually edited version. The results, summarized in Figure , suggest that false negatives are the most common error types, followed by tracing errors, dislocations, incorrect diameter estimations and false positives. Although false negatives are the most commonly observed errors, we chose to concentrate on the second most common error type, tracing errors. We made this choice for two reasons. First, tracing errors are directly amenable to correction by a simple classifier, which can be applied systematically to all division calls made by StarryNite. In contrast, a classifier that attempts to correct false negative annotations would have to be applied to all empty regions of all image stacks. Second, tracing errors have a more complex morphology than simple false negative annotations, allowing us to use a rich set of features, as described below.
Histogram of various types of errors in one image series. (a) Major error types. (b) Subtypes of tracing errors.
Tracing errors can be further subdivided into eight categories: (1) division annotated as movement, (2) movement annotated as division, (3) division annotated as cell death, (4) movement annotated as cell death, (5) cell death annotated as division, (6) cell death annotated as movement, (7) indexing errors of moving nuclei and (8) indexing errors of dividing nuclei. An indexing error of a moving nucleus occurs when a moving nucleus at a particular time point is linked to the wrong nucleus at the next time point. Similarly, an indexing error of dividing nuclei occurs when the indices of the newborn children are incorrectly assigned. Figure shows that "movement annotated as division" is the most frequent type of tracing error: 42.3% of the tracing errors in series 081505 are of this type. Indeed, this series contains a total of 427 division calls, and 102 (24%) of those were in fact movements. In addition to being the most frequent tracing error type, movements detected as divisions are biologically important as well. Figure illustrates one such error. In the figure, a moving nucleus at a particular time point t is encapsulated by a white square box. For simplicity, the figure shows only a single image slice, corresponding to a fixed z-value. Alternatively, Figures and contain 3D image representations of all the nuclei present at t and t + 1, respectively, where t = 35 in this example. According to the human annotator, M1 and M2 move from t = 35 to t = 36 and P1 at t = 35 divides into C1 and C2 at t = 36. However, StarryNite annotates M1 at t = 35 as the parent nucleus and links it to M2 and C1 at t = 36, which are incorrectly annotated as the children of M1. Thus, in this example a moving nucleus (M1) is annotated as dividing, causing a deviation from the true topology of the lineage tree. Based on these analyses, we decided to focus our initial efforts on the automatic recognition of movements detected as divisions.
Figure 3 Moving nucleus annotated as dividing. (a) An image plane from the series 081505 at t = 35 and z = 23, where t is the time index and z is the plane index within the image stack. A moving nucleus is encapsulated by a white square box. (b) 3D view of the (more ...)
Often, success in machine learning depends critically upon the ability of the researcher to successfully incorporate into the learning framework significant prior knowledge about the problem domain. Such prior knowledge can be represented, for example, using a formal, probabilistic prior or, for kernel methods, by selecting an appropriate similarity function. However, perhaps the most straightforward way to encode prior knowledge is by designing feature extraction routines that are tailored to the task and the data. In our case, we examined the "movement annotated as division" tracing errors in several of our image series and, on that basis, designed a collection of 82 features that provide a rich view of the types of errors produced by StarryNite. The 82 features are summarized in Table and explained below.
The set of 82 features is grouped into nine categories
The time index denotes how much time has elapsed since the start of embryonic development. The relationship between developmental age and time index depends, of course, on the time resolution of the experiment. In general, StarryNight makes more errors, on a per-nucleus basis, at later time points, simply because the images at later time points contain more nuclei and are hence more crowded.
A cell needs to mature to a certain age before dividing into two nuclei. Therefore, by including age information we aim to eliminate incorrect division annotations that correspond to divisions of very young cells or lead to very long-lived cells. We compute ages of the parent nucleus as well as the two children nuclei, as described in Methods.
We obtain the diameter, in pixels, of the parent and the two children nuclei directly from StarryNite's annotation. We expect the diameters of the children to be similar to one another and smaller than the diameter of the parent.
We include 15 distance features that capture the spatial relationships among the parent, the parent's neighbors and the two children.
Normalized nucleus support
During mitosis, a cell typically elongates in one direction, deviating from its usual spherical shape. To enable discriminating between dividing and non-dividing nuclei, we introduce a feature called normalized nucleus support that quantifies how spherical the nucleus is. Details are provided in Methods.
During mitotic division, the two children typically move in opposite directions. To capture these directional changes, we define a set of five angle features, as described in Methods.
Similar to the diameter features, we expect the GFP signals of the two children to be similar to each other and less than the GFP of the parent. To capture this information, we include six GFP features: the GFP signals of the the parent at t - 1, the parent at t, children at t + 1 and children at t + 2.
Number of nuclei at a given time
This feature allows the learner to exploit the correlation between the number of nuclei at a given time point and the probability of error. When an image stack contains many cells, the nuclei are packed more tightly together and more likely to experience collisions that affect the direction of moving nuclei. Accordingly, we observed that StarryNite makes more errors when the number of nuclei is high.
We included the x-y-z coordinates for the centroids of the parent as well as the two children (9 features). With this set of features, we allow the learner to identify a tendency for StarryNite to make more errors at particular locations.
Preliminary feature analysis
Prior to performing any machine learning, we measured the discriminative power of each feature individually. The goal of this analysis is two-fold: to identify features that are not individually informative, and to provide a performance baseline against which to compare our machine learning results. Using a development data set consisting of 10 experimental series (see "Benchmark Datasets" for details), we ranked all of the division calls according to each of the 82 features. Each such ranking induces a receiver operating characteristic (ROC) curve, which plots the true positive rate as a function of false positive rate as we traverse the ranked list. We use the area under the curve (AUC) as a performance metric to rank features. Figure shows the ROC curves for the top five features, according to this metric, and Figure illustrates sorted AUC values across 70 features (only features with non-zero AUC values are shown). In Figure , the features are sorted by their AUC values. 82 features along with their AUC scores are listed in Additional file 1
Figure 4 Analysis of individual features. (a) The figure plots ROC curves for the features with the top five AUC scores. Feature names are defined as follows: "Dist P-C1" is the distance from parent to child-1; "Dist C1-C2 at t+1" is the distance between children (more ...)
This ROC analysis leads to several observations. First, some of the normalized nucleus supports (the ones that are far from the centroid) are zero for all examples in the dataset we used, suggesting that all the nuclei we evaluated are smaller in size than expected. These non-informative features were eliminated from all subsequent analyses. Second, the best feature is "distance from parent to child-1" with an AUC of 0.8857. This provides a baseline against which to compare our trained classifier.
In general, the single-feature rankings are consistent with our biological expectations. For instance, correct divisions have a higher average distance between parent and child-1 than incorrect divisions, because we expect a certain amount of separation between parent and children. When a child candidate is too close to a candidate parent, then StarryNite is more likely to make an incorrect division call linking those close cells. Furthermore, in mitosis we expect the children to move rapidly from each other, resulting in a certain amount of separation between them. This separation is related to the second best feature. We expect the children to move in opposite directions from each other, which can be captured by the third best feauture. We expect the age of a newborn child to be larger than the time it takes a regular moving cell to divide starting from the current time point. This information is represented in the fourth best feature. Finally, we expect a distance relation between a parent and its neighbors, as we observed in the fifth best feature, because if they are close to each other StarryNite is more likely to get confused about choosing the right parent and labeling a moving nucleus as a dividing one.
Initial testing of an SVM classifier
We performed a 10-fold cross-validation experiment on the development data set. At each cross-validation iteration, we chose one experimental series as the test set and used the remaining series as the training data. Then we learned the optimal parameters and hyperparameters of the SVM classifier by performing internal cross-validation on the training set (see "SVM Classifier"), and we classified each division call in the test set as correct ("Dividing") or incorrect ("Moving").
At the threshold selected by the SVM, we achieve an accuracy of 88.16%, which represents a 4.3% improvement over StarryNite (83.84%). Several additional performance metrics are detailed in Table . By definition, StarryNite has 100% sensitivity, since we only consider division calls made by StarryNite. On the other hand, our method is 8.5% better than StarryNite in terms of the precision rate (i.e., the likelihood of a prediction to be correct) although it annotates some of the true division calls made by StarryNite as errors. We should note that, for guiding manual reannotation, it is better to identify more errors to speed up the editing process even if some of the movement annotations made by the SVM are in fact divisions that are correctly captured by StarryNite. Such incorrect annotations of our method can still be corrected by the human expert, reducing the overall effort that needs to be spent for the editing phase. Figure shows the ROC curve achieved by the SVM, with a point indicating the selected decision threshold. For comparison, we also include the ROC curve produced by the best-performing single feature. The AUC score of the SVM classifer is 0.9330, which is better than the AUC score of the best feature (0.8857). These results show that the SVM classifier is capable of identifying this particular class of StarryNite errors with high accuracy.
Comparison of the SVM and StarryNite on the development set.
Figure 5 ROC curves of the best feature and the SVM. Cross-validated ROC curve produced by the SVM on the development data set and the ROC curve of the best performing single feature ("distance from parent to child-1"). The SVM decision threshold is indicated (more ...)
Having established a baseline accuracy in the previous experiment, we next explored the possibility of achieving improved performance by eliminating uninformative or redundant features from the classifier. We performed two such experiments, both of which suggest that feature selection for this particular task is not necessary.
In the first feature selection experiment, we adopt a simple filter, based on the per-feature AUCs shown in Figure . Figure shows the result of a series of tests conducted with smaller and smaller feature sets. In each step, we eliminated one feature with the lowest AUC. We then performed the same cross-validation experiment described in the previous section, including internal cross-validation to select hyperparameters. For each cross-validation split, we compare the accuracy of the reduced-feature SVM with the accuracy of the baseline SVM that uses all 70 features. The figure shows that, although some reduced feature sets yield a slight improvement in accuracy--e.g., eliminating the worst 28 features gives an improvement of 0.622% --the mean is always less than one standard deviation from zero. This result suggests that this simple feature selection strategy does not significantly improve the performance of the classifier.
Figure 6 Two feature selection experiments. (a) The figure plots the mean difference in accuracy, across 10 cross-validation splits, of an SVM that uses all features compared to an SVM with some features removed. The number of features eliminated is given on the (more ...)
In the second feature selection experiment, we considered the joint effect of groups of related features. In this analysis, we used the nine feature groups introduced in the "Feature Design" section. Rather than considering all 29 - 1 = 511 possible combinations of groups, we considered 18 possibilities: each one of the nine feature groups alone, and all combinations of eight feature groups. As before, we performed a cross-validation experiment on each reduced feature set and then compared the accuracy of the reduced-feature classifier to the accuracy of the baseline SVM. The results shown in Figure agree with the previous experiment: in no case does the reduced-feature SVM significantly out-perform the baseline SVM.
Although these two experiments do not prove that feature selection for this particular task is a bad idea, they do suggest that any gains provided by feature selection are likely to be modest. On the basis of these experiments, we therefore decided not to pursue more sophisticated feature selection experiments.
Evaluation on two validation sets
Finally, we tested the SVM classifier on independent data. Our goal was to find a set of SVM parameters that yield good generalization performance with respect to previously unseen data. In pursuit of this goal, we performed two rounds of analyses, on the two validation sets described in Tables and .
Development and validation sets.
Initially, we performed a similar cross-validation experiment as before using this new data. The results are shown in Table , in the column labeled "CV SVM." Apparently, this data set is easier for StarryNite, which achieves a 4.7% improvement in accuracy, compared to the development data set (88.52% versus 83.84%). However, the SVM still provides a significant boost in performance, giving a 5.8% improvement relative to StarryNite (94.35% versus 88.52%).
Comparison of SVM and StarryNite on the validation data set.
Unfortunately, when we use the validation data set as a test set--i.e., when we train on the development set and test the resulting SVM on the validation set--our results are not as good. The SVM, using hyperparameters selected on the development set, achieves an accuracy of only 90.9%, which is only 2.4% better than StarryNite's accuracy of 88.5%. This difference is statistically significant according to McNemar's test with a p-value of 0.0003. On the other hand, this improvement is smaller than what we achieved via cross-validation on the development set (4.3%) or the validation set (4.7%), suggesting that, although the SVM does a good job of learning to identify errors, those two data sets contain systematic differences that make it difficult for the SVM to generalize from one to the other. We have thus violated the basic premise of most machine learning algorithms, that the test data is drawn from the same underlying distribution as the training data. This hypothesis is supported by the observation that the hyperparameters selected during internal cross-validation are quite different from one another: the learned hyperparameters for the development set were C = 66.3692, γ = 2-11, and for the validation set C = 2.02018, γ = 2-7.
As mentioned above, our ultimate goal is to produce a static SVM classifier that yields robust performance across a variety of possible data sets. Because our experiments suggest that our initial development and validation sets contain systematic differences, we next trained an SVM on the combination of the two data sets and tested the performance of the classifier on a second validation data set, which contains ten new series (Table ). As shown in Table , the SVM performs 9.1% better than StarryNite when tested on the new validation set. Furthermore, the similarity between the first and the second columns implies that the test data and the training data come from similar sources.
Comparison of SVM and StarryNite on the new validation set.
Note that in Table , the accuracies of both StarryNite and the SVM are lower than the results presented in Tables and . This is mainly because all the images in our first dataset are sampled with 1 minute time resolution while some of the series in the second dataset have 1.5 minute resolution (see Methods for details of datasets). We also trained and tested our SVM classifier only on series with 1.5 minute resolution and obtained a similar drop in performance (data not shown). This result can be explained as follows. When the time resolution increases from 1 min to 1.5 min, the newborn cells that are sampled by the imaging protocol move further away from the parent cell, which makes it more difficult to detect divisions because right after a division we expect the parent and the children to be close to each other to some extent. Having a larger separation between parent and children cells leads to an increase in divisions detected as movements. On the other hand, because the newborn cells come closer to other cells that are actually moving, the number of movements detected as divisions increases. Therefore, having experiments with 1.5 min time resolution in our test set makes the classification task more challenging for both methods. Nonetheless, the performance of the SVM was significantly better than StarryNite, validating the success of our approach.
The final trained SVM, which is incorporated into the StarryNite program, is trained from all three data sets, in an effort to provide the best generalization performance.