|Home | About | Journals | Submit | Contact Us | Français|
A graph of tissue vasculature is an essential requirement to model the exchange of gasses and nutriments between the blood and cells in the brain. Such a graph is derived from a vectorized representation of anatomical data, provides a map of all vessels as vertices and segments, and may include the location of nonvascular components, such as neuronal and glial somata. Yet vectorized data sets typically contain erroneous gaps, spurious endpoints, and spuriously merged strands. Current methods to correct such defects only address the issue of connecting gaps and further require manual tuning of parameters in a high dimensional algorithm.
To address these shortcomings, we introduce a supervised machine learning method that (1) connects vessel gaps by “learned threshold relaxation”; (2) removes spurious segments by “learning to eliminate deletion candidate strands”; and (3) enforces consistency in the joint space of learned vascular graph corrections through “consistency learning.” Human operators are only required to label individual objects they recognize in a training set and are not burdened with tuning parameters. The supervised learning procedure examines the geometry and topology of features in the neighborhood of each vessel segment under consideration.
We demonstrate the effectiveness of these methods on four sets of microvascular data, each with > 8003 voxels, obtained with all optical histology of mouse tissue and vectorization by state-of-the-art techniques in image segmentation. Through statistically validated sampling and analysis in terms of precision recall curves, we find that learning with bagged boosted decision trees reduces equal-error error rates for threshold relaxation by 5 to 21 % and strand elimination performance by 18 to 57 %. We benchmark generalization performance across datasets; while improvements vary between data sets, learning always leads to a useful reduction in error rates. Overall, learning is shown to more than halve the total error rate, and therefore, human time spent manually correcting such vectorizations.
The brain’s ability to process information and coordinate sensory and motor function relies critically on the body’s ability to deliver nutrients to the brain as well as to remove heat and waste from the brain’s constituent neurons. In the mammalian brain, these functions are provided by the blood that flows throughout the brain through an interconnected three-dimensional (3D) vascular network (Fig. 1). A need to understand the topology and morphology of the brain’s vasculature has received renewed interest (Risser et al, 2007b; Heinzer et al, 2008; Cassot et al, 2006; Blinder et al, 2010) as a result of the clinical importance of blood flow in neurological disease states as well as the growing popularity of technologies that infer neuronal activity from blood flow. Positron emission tomography and magnetic resonance imaging are among the imaging modalities that exploit the neurovascular coupling between neurological activity, the ensuing oxygen and energy consumption, and changes to local blood perfusion to infer neuronal activity. Yet the microscopic details of the vascular network and its relationship to surrounding neuronal units remains poorly understood, and investigators continue to debate which specific aspects of the neuronal activity elicit these observable changes in blood flow (Devor et al, 2007; Logothetis et al, 2001). Furthermore, the spatial extent of the imaged response extends beyond the anatomical limits of its corresponding neuronal origin (Grinvald et al, 2000; Fukuda et al, 2006; Schaffer et al, 2006), a phenomenon that is likely related to the topological properties of the vascular network.
It is necessary to generate complete, vectorized representations of the vascular connectivity over sufficiently large volumes so that edge effects are minimized, yet with sufficient resolution to capture the smallest microvascular elements. To accurately model the flow of blood in such a network, the graph should be free of both spurious connections and erroneous gaps. Although, in principle, direct error-free segmentation and vectorization of the vascular network should be possible, in practice there are several potential sources of error. These include upstream issues such as the completeness and stability of vascular staining, aberrations in the acquisition of images, and reduced signal-to-noise necessary to accommodate large-scale data collection in reasonable scan times. Furthermore, no segmentation method is guaranteed to be robust across all datasets and signal conditions. In general, the generation of topologically error-free vectorizations of 3D tubular structures from 3D grayscale data with tens of thousands of tubular segments is an unsolved problem. Specifically, the recent review of vascular segmentation by Lesage (Lesage et.al., 2009), concludes: “Aiming for a generic, flawless segmentation framework is probably illusory”, which can be interpreted as follows: state of the art segmentation (even in the simpler vascular networks explored in all of the cited work in Lesage) will be incorrect in that it will connect some vessels spuriously and miss other connections that should be made between vessels. Related work in connectomics highlights the importance of scalable image segmentation algorithms that learn from labeled datasets (Jain et.al. 2010b, Kleinfeld et.al. 2011). As one indication of the importance of such scalable automation, current state of the art in Serial Block Face Scanning Electron Microscopy (SBF-SEM) can acquire a cortical column in a year, but manually tracing the same volume with state of the art methods would take 1000 years (Helmstaedter et.al. 2008). Though state of the art in connectomics in electron microscopy still relies heavily on manual tracing, newer work has improved labeling speed approximately 50-fold with a centerline skeleton representation (Helmstaedter, et.al. 2011). Consensus centerline skeletons from multiple labelers are then found by voting to either eliminate or keep centerline edges. This evaluation of many individual segments by voting lends itself to the kind of learning on object-level connections explored in this work—in both cases, some centerline strands are added and some are deleted based on voting or learning.
Importantly, the presence of even a few defects in the connectivity graph can result in significant errors in analytical results. Specifically, the flow of blood through a vessel scales approximately as the fourth power of the vessel radius. Because vessel radii can span an order of magnitude or more, the resulting flow can readily span several orders of magnitude. Therefore the erroneous addition or exclusion of even a single medium-sized vessel has the potential to either shunt flow from or induce flow into a significant volume of smaller vessels. Thus the detection and correction of even a small number of graph defects can significantly improve the quality of downstream analytical results. This corrected vascular graph forms the underlying model for calculation of blood flow in the brain. The presence of errors could lead to errors in estimates of vascular flow and concomitant perfusion of tissue with gases and metabolites.
Here we develop methods for the detection and characterization of graph errors in large vascular datasets, together with the application of machine learning to optimize and automate the correction of such errors. Our algorithms utilize a combination of topology-, morphometry- and intensity-based features to locate defects and to learn the defect correction decision. A consistency learning algorithm is developed to resolve cases in which learned decisions for strand removal and strand addition conflict. We build on previous work (Tsai et al, 2009a; Tsai et al, 2009b) and image, segment, and vectorize labeled vasculature in cubic-millimeter-size blocks of brain tissue with micrometer resolution.
The paper is organized as follows: Section 1.2 introduces the types of errors the algorithms in this paper aim to correct and highlights the difficulties in the chain of sample preparation, acquisition and analysis that lead to these errors. Section 1.3 discusses published methods to address the problems described in Section 1.2. Section 2 introduces our Methods, with an overview of the relevant aspects of our acquired data in Section 2.1, followed by a discussion of the threshold relaxation method used to bridge gaps and a method to select candidate strands to delete from the vascular graph, respectively, in Sections 2.2 and 2.3. Section 2.4 introduces our labeled training data, features, bagged and boosted decision tree learning methods and our performance validation methods. Section 2.5 introduces the concept of consistency learning, which infers joint labels based on pairs of learned scores that interact with each other through common endpoints. Section 3 collects the performance results of our methods on real data, breaking down performance for each component of our method; bridging strand performance is addressed in Section 3.1, deletion of strands is addressed in Section 3.2, generalization error is addressed in Section 3.3, and consistency learning is addressed in Section 3.4. Section 4 discusses the key contributions of the paper; improvements to bridging strand performance is discussed in Section 4.1, the benefits of learning beyond the confidence metric are discussed in Sections 4.2 and 4.3, learning to delete strands is discussed in Section 4.4, consistency learning is discussed in Section 4.5, upstream acquisition and algorithm considerations are discussed in Section 4.6, and issues of generalization error are discussed in Section 4.7. Section 5 presents the overall conclusions and suggests the value of active learning as a next step.
The methods described in this paper take connectivity graphs, in the form of vectorized maps, as their starting point. The process of producing a vectorized graph from the topological structure of a real object can be divided into three stages: 1) data acquisition, 2) image segmentation, and 3) vectorization. Each stage has the potential to introduce discrepancies between the real anatomy and the final graph representation. We review these issues as a means to motivate our error-correcting algorithms.
The details of the data acquisition stage depend on the structure of interest and the imaging modality applied, yet a few considerations apply in nearly all cases. A series of 2D or 3D grayscale images are acquired with a fundamental resolution limit set by the imaging modality (Tsai et al, 2003; Weber et al, 2008). Technical realities such as detector noise, signal scatter, refractive aberrations, and acquisition speed can degrade the quality of the image. Regions of incomplete, weak, or extraneous contrast, due to poor labeling, intrinsic contrast gradations, or shadowing effects can adversely affect the perceived connectivity of the structure in the grayscale data.
The segmentation stage involves the conversion from grayscale data to a binary mask that accurately represents the morphometry and topology of the vascular structure. In particular, many methods exist to segment 3D vessels from raw data (Kirbas et al, 2004; Lesage et al, 2009), including multiscale eigenanalyses of local Hessian operators (Krissian et al, 1998; Frangi et al, 1998; Manniesing et al, 2006b), the use of filterbanks (Szymczak et al, 2005; Grady, 2010), tracking methods (Friman et al, 2010), vessel crawling (McIntosh et al, 2006), level set-based curve evolution methods (Manniesing et al 2006a) and segmentation learning approaches (Lu et al, 2009). Previous work on segmentation and vectorization of microvasculature from 3D optical image stacks (Cassot et al, 2009; Tsai et al, 2009b) demonstrate that high-fidelity segmentation can be achieved, particularly at high image contrast and signal-to-noise. Nonetheless, at regions of locally low image quality, segmentation errors persist (Fig. 2). Regions of partial dye occlusion, e.g., by a trapped blood cell, or weak staining can lead to breaks in the mask. Similarly, close-proximity vessels may result in erroneously joined masks, particularly in the limit of low signal-to-background, e.g., the presence of non-vascular structures in the data.
The vectorization stage entails the conversion from a binary mask to a graph or vector representation consisting of individual 1D strands and their interconnections (Risser et al, 2007a). A common first step in this process is the extraction of centerlines from the binary mask. Existing skeletonization methods can accomplish this for 2D datasets, but the ill-conditioned medial axis transform (MAT) that is fundamental to these techniques results in many small erroneous branches that develop as a result of either noise or local bulges along the structure itself. In 3D, MATs can also develop “medial surfaces” which are not centerlines at all. Curve evolution methods and morphological operators have been introduced to help mitigate these issues (Maddah et al, 2005; Leymarie et al, 2004; Bouix et al, 2005), and include recent work on vectorizing 3D microvascular networks (Risser et al, 2007b; Cassot et al, 2006; Maddah et al, 2005; Tyrrell et al, 2006). Still, discrimination between short outcroppings that require representative strands and bulges that result in erroneous hairs remains an unsolved issue.
Beyond the prior image analysis work on vessels, many biomedical imaging applications treat the analysis of filamentous structures and these may benefit from some of the methods described here (Gillette et al, 2011). For instance, the DIADEM challenge winners (http://diademchallenge.org/) developed the FARSIGHT toolkit for semi-automated tracing and vessel segmentation (Bjornsson et al, 2008). One interesting aspect of their approach (Bjornsson et al, 2008) is that automated analysis results are scored and manually corrected in rank order; similar use of confidence is envisioned for this work. The proof-editing of reconstructed neurons in the DIADEM challenge was the key bottleneck to produce accurate 3D reconstructions at scale (Peng et al, 2011). A recently developed 3D WYSIWYG interface to volumetric data for neuron tracing that allows editing the connectivity in 3D aims to address this bottleneck (Peng et al, 2011).3
The problem of connecting gaps in 2D, or weak edge linking, is a rich and well-established field of study (Casadei et al, 1996; Canny, 1986) and includes some recent studies on inferring missing edges in road network extraction (Sharon et al, 2000; Ren et al, 2007; Rothwell et al, 1994; Poullis et al, 2008). Similarly, edge deletion using machine learning to remove false edges from an overcomplete set has been demonstrated on the Berkeley Segmentation Benchmark (Kaufhold et al, 2004). Far fewer results have been collected for the analogous 3D problems (Krissian et al, 1998; Pock et al, 2005). One promising connection method grounded in the formalism of tensor voting uses the vectorized graph alone to infer gap connections (Risser et al, 2007a). More recently, the method has been shown to perform favorably to mathematical morphology and an Ising model for the same task (Risser et al, 2008). While the tensor voting method shows considerable promise, it relies only on the graph, and thus does not make use of the underlying grayscale data to inform the gap filling method and has also not yet been extended to exploit machine learning.
The methods in this paper learn to add or delete strand objects in vascular graphs, where these objects are results of other lower level image analysis algorithms. Similarly, related work in connectomics formerly aimed at optimizing voxel-level cluster similarity via the Rand index (Turaga, et.al. 2009) or a topology-preserving index (Jain, et.al. 2010a) have recently progressed to learning to cluster superpixels (Jain, et.al., 2011). Arguably, these superpixel approaches could also be applied to connecting gaps.
Data analyzed in this paper were obtained using previously described protocols (Tsai et al, 2009b). In brief, mice are transcardially perfused with a low viscosity gelatin solution containing the fluorescent label, fluorescein-conjugated albumin. The solution solidifies by cooling and is fixed with paraformaldehyde to prevent dye diffusion. The brain is extracted and blocked into mm3-scale tissue blocks which are then imaged with all-optical histology using two photon laser scanning microscopy (Tsai et al, 2003). This produces hundreds to thousands of overlapping 3D image stacks of fluorescein-stained microvasculature and surrounding tissue structures. These image stacks are nominally aligned based on approximate microscope stage coordinates and then aligned to voxel resolution by minimizing a quadratic registration energy in overlap regions across all blocks (Preibisch et al, 2009). This produces a large raw grayscale volume, typically 1000×1000×1000 voxels for our preparation.
A suite of image analysis algorithms (Tsai et al, 2009a; Tsai et al, 2009b) is applied to selectively filter this block of raw, noisy vasculature data, segment it into a 3D binary mask, and extract a set of monofilament centerlines. Vectorization of the centerline results in a graph, G = (V, E). Although not the main focus of the work in this paper, the main components of the image analysis algorithms applied prior to the learning in this paper include modules to enhance vessels, create volume-accurate vessel masks, create topologically accurate vessel masks and compute vessel centerlines (Fig. 3). These stages represent the degree to which data has been processed prior to the strand addition and deletion learning algorithms described here. The quality of the perfusion, fluorescein concentration, laser settings, sectioning, and registration stages can lead to variation in image quality between datasets. In this work, for instance, dataset A is significantly reduced in SNR compared to the others. In addition to these acquisition variables, the strand addition and deletion learning algorithm performances will depend on a number of image analysis parameters above, including the rod filter length, various filter sizes in the mask-generation stages, various adaptive thresholds, and centerlining parameters. The centerlining stage has perhaps the most direct influence on the threshold relaxation performance because it directly controls the number of degree 1 endpoints in the graph, G. Adjustable parameters in the centerlining stage affect the resultant graph by rejecting small fragments of the vascular network that would otherwise form spurious “Islands” of vasculature or “hairs”.
In the vascular graph, G = (V, E), each vertex vi, with deg(vi) ≠ 2, is either a bifurcation point or an endpoint, while each edge connects two vertices and is attributed with associated radii estimates, ri (Table 1A). This graph representation, along with the intermediate masks and grayscale volumes serve as a starting point for the work described here. Four such vectorizations, one for each of four distinct samples A through D, are analyzed in this work. Nearly the entire vascular mask is correctly segmented and fully connected. Thus errors of any type are relatively infrequent. The number of gaps and spurious connections is small relative to the totality of strands in the vascular graph, typically numbering in the hundreds for the more than 10,000 strands in a 1000×1000×1000 μm3 volume. Given the discussion above regarding the imperfect state of the art in segmentation and vectorization, and the sensitivity of topological metrics to graph structure, it is this population of graph defects which is the focus of the present work.
To describe the learning aspects of our method, the work presented here represents and describes operations on a number of intermediate results of variable type and structure. These symbols are collected for reference in this Section (Table 1A, Table 1B).
In this work, potential bridges to gaps in the vectorized strand network are found by a process of local threshold relaxation, but another algorithm that produces connections, e.g., tensor voting, could theoretically be used in its place. Tensor voting has been demonstrated to outperform an Ising model, a mathematical morphology model, and a standard phase field model of gap connection (Risser et al 2008). An evaluation of precision-recall F-score measures for threshold relaxation was shown to outperform tensor voting for a range of parameterizations (Kaufhold et al 2008).
At every internal endpoint in the network, a corresponding sub-volume of grayscale data is evaluated at progressively lower thresholds until a connection is made to another strand, subject to geometric constraints. A confidence is assigned to each bridge based on the value of the final threshold relative to the background noise in the grayscale sub-volume (Figs. 4 and and66).
The threshold relaxation algorithm accepts as input a filter-enhanced grayscale image volume, Γ, the corresponding binary segmentation, B, and its vectorization, G (Table 1A). The output of the method is a set of bridging strands, Si, and their respective confidence levels, Ci, (Fig. 5).
The confidence level for any individual bridging strand is the highest z-score that would allow the gap to be bridged through a path between an endpoint in the vascular graph and another point in the vascular graph not excluded by geometric constraints (Fig. 6).
Gap connection proceeds in two stages (Pseudocode in Fig. 7). First, candidate endpoints in the graph are found for each gap (Figs. 4 and and7A).7A). Every gap is presumed to originate at an endpoint vertex, PEi (Table 1B). Second, a monofilament strand is found that bridges these points. In a local volume about the endpoint, a threshold, Tz, is applied to the grayscale volume, Γ, to produce a new binary mask, Bz. Geometric constraints are applied to avoid backtracking by excluding points that fall behind the endpoint vertex. The threshold is relaxed until the resulting binary mask produces at least one connection between the endpoint vertex and at least one other graph point, PCi, in the space not rejected by geometric constraints. If multiple points on the graph become joined at a particular Tz, then the point that minimizes the mask-constrained path length to the endpoint vertex is chosen to be the connection point, PCi.
In the second stage, the strand that bridges the gap is found. Two steps are performed to refine the binary mask, Bz. First a binary search over thresholds is used to tighten the mask to include the fewest voxels while still maintaining a connection. Second, a “paired path length distance transform” is used to eliminate all points in the mask except those most likely to participate in the path. In brief, the shortest path connecting PEi and PCi in the resulting smaller mask, BG, computed with Dijkstra’s algorithm, is the output strand, Si.
The second stage of the threshold relaxation algorithm makes use of the endpoint, connection point, thresholds, and local volumes found in lines 1 to 20 of the pseudocode (Fig. 7A) and is encapsulated by the call to GetBridgingStrand in line 21 of the code. In line 1 of GetBridgingStrand (Fig. 7B), the tight threshold, TT, is chosen via a binary search of thresholds between TCi and TUi such that BT(PEi) = 1, BT(PCi) = 1 and DP(BT, PEi, PCi) < ∞, i.e., PEi and PCi are connected via BT (Fig. 8).
The output of the threshold relaxation algorithm is the set of candidate bridging strands, Si, and their corresponding confidence metrics, Ci. Each candidate bridging strand is a sequential list of coordinates for the monofilament curve which spans the gap. The confidence metric is the statistical z-score of the refined threshold used to generate the bridging strand. Note that threshold relaxation is the algorithm applied in this case, but theoretically any algorithm that made connections on endpoints could be used (such as Tensor Voting (Risser, et.al. 2007a), methods based on Ising models (Risser, et.al. 2008), and morphological closing (Gonzalez 2002), for example).
Candidates for strand deletion were chosen based solely on strand length. All strands that fell below a threshold length of 15 μm were considered candidates for strand deletion. At this threshold, approximately 17% of all strands were flagged as deletion candidates.
A human operator is presented with a tool for ground truthing as a means to produce a training set for the machine learning process. The operator visually considers each candidate strand in its local 3D context, including volume renderings of the grayscale data, the binary mask, and the vectorized graph. The operator then labels each candidate strand as a true or false candidate for bridging or removal. The necessity of training sets in which each candidate is ground-truthed by a human operator motivates the selection of only a sub-population of short strands as deletion candidates, rather than allowing all strands to serve as candidates; in theory all strands are deletion candidates.
Basic threshold relaxation produces only a single confidence metric on each bridging strand based on the strand’s intensity statistics (Fig. 9). For the purpose of machine learning, the threshold relaxation algorithm is extended to compute a number of additional features, including vertex count, curve length, etc. (Table 2B). These additional features are stored along with the original confidence metric in ΦBS. Note that each column of ΦBS corresponds to the features of an individual bridging strand.
Just as an intensity-based confidence metric can be computed as the minimum z-score for bridging strands (Fig. 6), it can serve as a confidence metric for candidate deletion strands as well. Similarly, additional features beyond confidence are computed for each candidate deletion strand, including features based on intensity, geometry, and topology of either the deletion candidate strand itself, or the neighboring strands to which the candidate connects. The list of deletion strand features is stored in ΦDCS (Table 2C). Note that each column of ΦDCS corresponds to the features of an individual candidate deletion strand.
AdaBoosted decision trees were used to score strands in this work. AdaBoost was run for nT = 100 (Table 1A) rounds for each training set using decision tree weak learners. In this analysis, the MATLAB® (R2007b) (MATLAB, 2009) Statistics Toolbox™ (Version 6.1, The Mathworks, Inc.) implementation of a decision tree was used. To prevent overfitting, the minimum example count per leaf node was set to 5, and the minimum splitting fraction was 10% of the total number of examples used to train each tree at each round of boosting.
The method of bootstrap aggregating, or bagging (Breiman, 1996), is used to improve the boosted decision tree model’s stability, reduce its variance, and prevent overfitting of parameters. In bagging, nMC (Table 1A) individual classification models are each trained (via boosting as described above) on a set of training data uniformly resampled with replacement. When classifying a new strand, the individual classifier results on that strand are averaged into a single score. The two key parameters in bagging are the number of iterations of bagging to combine, nMC (Table 1A), and the number of examples, n′ to uniformly sample from a training set. The number of labeled strand exemplars in our datasets are very few (only 300 to 400 strands per training sample) and can be unbalanced (for bridging strands found by threshold relaxation, typically only 20 to 50 bridging strands per ΦBS are incorrect, for example). For this reason, twice the number of elements are sampled to reduce the likelihood that resamplings are rejected for containing only positive exemplars. This procedure of bagging boosted decision trees is carried out for both bridging and deletion candidate strands and for all training/testing sets described in Section 2.4.6.
Our scoring convention is such that a high score for either bridging or candidate deletion strands indicate that strand should be kept, i.e. the strand is likely part of the vascular graph—and a low score indicates that strand should not be kept.
When evaluating detection performance for a single dataset (e.g., computing the improvement in detection performance on dataset C based on the other labeled elements of dataset C), the method of LOOCV validation is used. LOOCV, a form of cross-fold validation (Kohavi, 1995), uses all the training data for performance estimation, prevents over-fitting, and can be used to derive confidence metrics on ROC and Precision Recall curves (Fawcett, 2004). Because the number of strands in ΦBS and ΦDCS is small, each bagged boosted classifier is trained on all but one of the strands and tested on the strand that was left out. In this way, strandwise bootstrap aggregated scores that do not train and test on the same data are generated for each bridging strand and deletion candidate strand within a dataset.
When evaluating detection performance across a number of different datasets, the performance estimates are valid without LOOCV because training and testing datasets are different by construction. With different datasets, the resampled training set in each iteration of bagging is used to train a classifier to evaluate all samples in the test dataset instead of generating a bagged LOOCV set of scores per strand as is necessary for single datasets. We cross train two ways in this work: (1) training on individual datasets other than the test set (this is the case when classifiers are learned on dataset C and used to classify dataset B, for example) and (2) training on all three datasets left out (this is the case when classifiers are learned from datasets A, B, and C and used to classify dataset D, for example).
In summary, for each individual dataset, nMC × m × nT ~ 15,000,000, decision trees were trained in the computation of the bagged, boosted, and LOOCV bridging strand scores, where m ~ 300 (Table 1A), and approximately 20,000,000 trees were trained in the computation of the deletion candidate strand scores, where m′ ~ 400 (Table 1A). At the conclusion of the learning process for both ΦBS and ΦDCS, pairs of scores that share a common endpoint are selected and indexed by k. Thus, k is an index over all short strand endpoints that also have a bridging strand. These are B(k) for the element of ΦBS corresponding to the kth endpoint and H(k) for the element of ΦDCS corresponding to the kth endpoint.
Learning for bridging strand addition and deletion strand removal are treated thus far as separate decisions through separate machine learning stages, but these decisions are not independent. Specifically, if a bridging strand connects a point on a deletion candidate strand, a conflict between the addition and deletion decisions can arise. The cartoon diagram (Fig. 10) illustrates both consistent decisions and decisions in conflict. A graph consistency and conflict approach is applied, aimed at fusing these two decisions into consistent decisions over all interacting deleted and added strands.
Decision conflicts arise from endpoints on strands that are short enough to qualify as a deletion candidate. These short source strands will be associated with two learned scores: one for the bridging strand that emanates from it and one for the deletion candidate strand. The two scores, H(k) and B(k) for the kth endpoint, identify a point in the joint feature space that summarizes that endpoint’s consistency. If both scores are high, then both the short source strand and the bridging strand are likely kept; if both scores are low, then they will both likely be eliminated. An example plot in the H-B score space is shown (Fig. 11). The endpoints that are jointly labeled “keep” are shown with green “plus” signs, endpoints that are jointly labeled “remove” are shown with red dots, and endpoints with conflicting labels are shown with black circles. A classifier is trained to separate the two types of consistent endpoints, and then it is applied to classify endpoints with conflicting labels.
A typical vascular graph in this work has 500,000 vertices, over 10,000 strands, and 40 vertices/strand. In a vascular graph of this size, there are typically 7,500 junction points of degree 3 or higher and ~ 3,000 endpoints of degree 1. The majority of the endpoints occur where vessels leave the bounds of the 3D dataset, i.e. the data cube, while ~ 300 occur internally. The vessels that create endpoints exiting the data cube are not considered for bridging or evaluated in performance measures. Thus, approximately 300 internal bridging strand candidates are generated per dataset. Additionally, based on a length threshold of 15 voxels, approximately 17 % of all strands were considered for deletion.
We first apply machine learning to the problem of bridging gaps in the vascular graph with the threshold relaxation method described above. Example bridging strands, computed by threshold relaxation are shown (Fig. 12). For the purpose of precision recall curve scoring, bridging strands are scored as a true positive, a false positive, or a false negative by a human operator (Fig. 13). A false negative constitutes an endpoint that the human operator believes should be connected to a strand other than the one bridged by the algorithm. These false negatives contribute both a false positive and false negative to the precision recall curve if the strand’s score is such that the bridge is retained (middle panel of Fig. 13). If the bridge is not retained, only the false negative contributes to the precision recall curve. We compare the detection performance based on different features of the candidate bridging strand. A set of precision recall curves for all candidate bridging strands (m > 300) in a dataset (left of Fig. 14). Curves are color coded according to the feature used to score the candidate strand. Note that the confidence computed by the original threshold relaxation algorithm does not dominate the detection performance, and thus does not constitute a dominant feature of the bridging strands.
We next implemented machine learning, as described above, on the features produced by the threshold relaxation algorithm to produce bagged boosted scores for each bridging candidate. The original threshold relaxation confidence feature is compared to the detection performance learned classifier for an individual dataset (right of Fig. 14). For the example of Dataset B, the results indicate that the learned bridging strand scores dominate the detection performance compared to results without learning. The learned bridging strand scores also dominate the detection performance of all other individual features computed (comparison of the left and right panels of Fig. 14). This domination by the learned threshold relaxation method holds for all four datasets tested.
We also applied machine learning to the removal of spurious connections and hairs, defined as short strands that fell below a threshold length and marked as candidates for removal. Again, the machine learning algorithm was trained on a ground truth subset of the data, in which removal candidates were judged by a human operator as strands that should be retained or deleted (Fig. 15). If a strand is correctly retained, it is a true positive. If it is incorrectly retained, it is a false positive. If a strand should be retained but is not, it is a false negative.
We consider the effect of bagging on the detection performance. Precision recall curves for the performance of learned candidate strand deletion performance demonstrate that bagging the boosted LOOCV score (Methods) generally outperforms individual classifiers (Fig. 16). This indicates that LOOCV on a single boosted, but not bagged, classifier may be sufficient to prevent overfitting, but that performance can be improved by bagging. The detection performance of the intensity-based deletion candidate feature, zmin, compared to each deletion candidate’s learned score illustrates the benefit of learning on multiple features compared to a more narrow feature-based approach (Fig. 16).
In our study of the vasculature, there is no known a priori reason to weigh false positives more or less than false negatives. Thus the intercept of a precision recall curve and the unity-slope line serves as an estimate of the overall performance of a learning algorithm on a set of data. A set of unity-slope intercepts of bridging strand detection performance curves are plotted for all datasets (left of Fig. 17). When the dataset in the column is the same as the training dataset, LOOCV as described above is used for validation. The detection performance for deletion candidate strands is similarly presented (right of Fig. 17).
Note that in Section 2.4.1, only the shortest deletion candidate strands (<15 μm) were ground truthed. In theory, all strands in G could be false positives and this restricted sample could introduce sampling bias. To test the relative effect of this sample bias, a random sample of 500 deletion candidates between 15 and 40 μm in length was also ground truthed. Only one of these 500 candidates was marked as a strand to delete; this error rate of approximately 0.2 % is acceptable considering that all deletion candidate strand error rates are greater than 5 % (right panel in Fig. 17). Note also that in the restricted sample of short strands that was ground truthed, more than 20 % of deletion candidates < 10 μm in length and approximately 10 % of deletion candidates between 10 μm and 15 μm in length were marked for deletion. This analysis by length indicates that the ground truthing is focused on the candidate strands where learning is of value. All performance metrics are also computed only for the restricted set.
Some endpoints produce simultaneous candidate strands for deletion and addition. As described in the Methods section, such strands are associated with two scores, H(k) and B(k) for deletion and gap bridging, respectively. A decision surface that separates the consistently labeled and scored endpoints is learned from all the endpoints with consistent ground-truth labels, using quadratic discriminant analysis (Fig. 18). Conflicting labels, indicated by open circles, are relabeled according to the learned consistency rule. In this way, learning is used to arbitrate among possible strand addition and deletion decisions to impose a consistency among all joint decisions made from two different learned functions. Note that only the atomic labels associated with addition or deletion were used in Consistency Learning. The labels in the joint space are inferred.
The focus of this paper is on methodologies to correct erroneous gaps and spurious strands in a vascular graph (Figs. 1 & 2). In particular, machine learning is applied to a set of features derived from both intensity and topological characteristics of the dataset. First, a method of threshold relaxation is used to detect gaps and provide candidate bridging strands to add to the graph (Kaufhold et al, 2008) (Figs. 2, ,44 to to8).8). The threshold relaxation method, itself, enjoys a number of desirable characteristics: (1) it exploits graph topology to locate defects; (2) it exploits underlying intensity data to guide connections; (3) it prevents backtracking, and (4) it can connect large gaps. Next, machine learning is applied to features extracted from the candidate strands (Figs. 9 to to17).17). This learning framework introduces several additional advantages, including (1) the incorporation of multiple features based on both intensity and topology; (2) the potential to incorporate more sophisticated search strategies, e.g., tensor voting (Risser et al, 2007a; Risser et al, 2008) as additional learning features; (3) the minimization of the parameter sensitivity in graph correction decisions; and (4) when used in conjunction with deletion candidate strands, the ability to balance conflicts between decisions to jointly add or remove strands.
Every bridging strand discovered by the original threshold relaxation algorithm was assigned a confidence metric (Kaufhold et al, 2008) (Figs. 6, ,77 and and1414 and Tables 1A and and2B).2B). The bridging strand confidence for each strand is a measure of the “weakest link” in the intensity evidence between two points. Specifically, it is the path through the intensity evidence whose minimum z-score is highest (Fig. 6). As such, it may fail to connect breaks that are not supported by grayscale data, e.g., regions in which the staining in a vessel falls to background levels for stretches that exceed the width of applied filters. This situation is ameliorated by the incorporation of other features, including geometric and topologically based features, into the decision process through machine learning.
The deterministic bridging strand confidence alone reaches maximal equal precision/recall rates on the lowest signal-to-noise ratio datasets of 78 % to 86 % (Fig 17). However, we show that a number of other metrics can improve prediction performance, including strand length, tortuosity, and distance profile to closest vessel, among others. It is worth noting the lists of bridging and deletion strand features are not exhaustive (Tables 2B & 2C). These additional features are meant to quantify simple strand characteristics beyond the intensity-based confidence metric that may be used to decide whether each strand should be kept or removed. Note that bridging strand and deletion strand features need not be identical or even similar. The precision recall curve improves significantly with the use of validated machine learning scores that were generated by LOOCV scores on bagged and boosted decision tree classifiers using these additional features. The generalized confidence metrics computed using machine learning methods improve the best equal precision/recall values for these data sets by about 5 % (left of Fig 17).
Spurious endpoints, or vectorization “hairs,” can lead to erroneous connections in the threshold relaxation procedure (Kaufhold et al, 2008). The additional algorithmic step of selecting candidate strands for deletion addresses this limitation (Fig. 2, ,1313 and and15).15). The decision to remove a candidate is learned from a set of topology- and intensity-based features of the candidate strand and of neighboring strands. As discussed above, only short strands were considered deletion candidates in this work, but this strand validation concept can be extended to include all strands, thus providing a probabilistic framework for inference on the vascular graph.
The implementation of a strand deletion process along with a strand addition, i.e., gap bridging, process introduces the potential for conflicting decisions. Specifically, a decision to remove an endpoint-terminated hair may be inconsistent with a decision to add a bridging strand from that endpoint. While the deletion candidate strands could, in theory, be removed first, and bridging strands estimated on the remaining vascular graph, or vice versa, to avoid cascaded errors, we instead choose to coordinate the addition and deletion of strands according to joint statistics on both because both algorithms produce uncertain labels on strands. In its current implementation, consistency learning makes use of two features, the bridging strand score B(k) and the deletion candidate strand score H(k), to generate a decision surface (Figs. 11 and and18).18). Extension of the decision consistency concept to a larger feature space is straight-forward and can be accomplished within the same learning framework used to generate the individual scores.
In some datasets, e.g., dataset B, the application of learning to the strand deletion decision added relatively little value in detection performance compared to the strand confidence feature alone. The comparatively larger improvement in another data set, e.g., dataset D, illustrates the sensitivity of learning performance to upstream considerations. These considerations include tissue preparation, image acquisition, and a number of image analysis parameters such as filter sizes, adaptive thresholds, and centerlining parameters. Centerlining parameters most directly affect the number of degree 1 endpoints in the graph by controlling rejection of short strands that are either unconnected, i.e., “islands”, or singly-connected, i.e., vectorization “hairs.” Nonetheless, any segmentation and vectorization algorithm may serve as upstream sources for the graph correction algorithms described in this paper.
Note that small vascular loops were ignored in this work for two reasons; they are infrequent and inconsequential to the flow modeling work which motivates the learning algorithms proposed. In the context of our learning algorithms and the size of our datasets, the frequency of small loops introduced by threshold relaxation before learning is very low. For instance, across our four datasets, the number of small loops introduced by threshold relaxation can be bounded above at 3 % because that is the total number of bridges connected by threshold relaxation; i.e., there are approximately 300 bridging strands added to approximately 10,000 strands total per dataset. Further, on a sample dataset, only approximately 5 % of the bridging strands introduced small loops. On dataset B, 13 of 326 bridging strands created spurious loops before learning, for example. In sum, approximately 0.15 % additional spurious loops will likely be introduced into a dataset due to learned threshold relaxation. We also do not attempt to detect and/or delete loops or parts of loops. From the flow modeling perspective that motivates this work, small loops only introduce an infrequent additional parallel path in flow modeling at some locations, and are very minor effects when compared to the vessel lengths they are incident to. The loops introduced by threshold relaxation are unlikely to affect any statistical measure on datasets this size, and further, very small erroneous loops are typically simple to detect and reevaluate when they do arise.
Evaluating LOOCV detection performance across datasets A to D, learned threshold relaxation error rates are reduced between by 5 % and 21 %, with a mean of 16 %, compared to threshold relaxation without learning. Similarly, deletion candidate strand detection performance error rates are reduced between 18 % and 57 %, with a mean of 44 %. This demonstrates the potential for learning to improve detection peformance substantially for both tasks. However, bridging and deletion candidate strand population feature statistics are variable, and training and testing on different datasets (cross-training) reduces the observed performance gain from learning. Cross-trained detection performance (Fig. 17) for bridging strands show some expected degradation in the error rate reduction to between −3 % to 15 %, with a mean of 7 %, as compared to LOOCV detection performance. The negative error rate indicates that the learned result was poorer than the result without learning. It is important to note that dataset A differs substantially in its signal-to-noise ratio from datasets B, C, and D, and was included to quantify learned performance in a significantly altered population of features. Surprisingly, despite its differences, bridging strand detection for dataset A improved the most on cross-training compared to no learning. For deletion candidate strands, datasets B, C, and D benefited from learning when trained on a different dataset, reducing error rates 31 % to 53 % on those datasets, with a mean of 40 %, while dataset A’s performance was significantly worse than performance without learning. The implication is that in dataset A, many valid deletion candidates were short, and thus fit the criteria for selection without learning; but the topology- or intensity-based features of these hairs differed significantly from those in other datasets. Thus learning on other datasets served to confound decisions on dataset A.
The empirical analysis of generalization error suggests that different datasets have different feature/label distributions. The degradation in performance when training and testing on different datasets compared to validated learning performance on the same dataset points to at least two strategies to reduce generalization error. First, upstream improvements in data quality as well as improved algorithm parameter stability can both be expected to make datasets more uniform, and consequently, benefit more from learning which vascular graph strands to delete and/or add. However, even without improving upstream data quality or parameter stability, learned results may be improved by better matching between individual dataset feature/label population statistics. Cross-trained results by dataset indicate that some datasets appear to match other datasets with variable fidelity -- using only the best matched data for training may improve performance. Similarly, selective sampling methods can also be applied to the set of all training datasets to tune a sampled dataset to more closely reflect the population of features observed in the dataset targeted for testing. Finally, in the case that dataset statistics differ not only by feature distribution, but by ground truth label, active learning methods can be used, in conjunction with sampling methods described here, to minimize the number of additional labels in the test set required to improve performance with learning some incremental amount.
This work applies machine learning to correct connectivity defects in the graph representation of a 3D vascular network. Threshold relaxation (Kaufhold et al, 2008) is used to detect gaps and generate candidate bridging strands, while strand length is used to select candidate strands for deletion. Machine learning is then applied to a number of intensity- and topology-based features to generate deletion strand and bridging strand scores. A decision surface is trained to resolve conflicting scores, yielding a consistent decision for every candidate defect. Machine learning generally showed improved performance against unlearned single feature discrimination. Error analysis based on leave one out cross validation for same-set training yielded 16% and 44% mean improvements for bridging and deletion respectively, compared to no training, and achieved precision recall values as high as 88 % and 94 %.
As described above, each dataset is different, but a 1000x1000x1000 μm3 dataset typically contains more than 10,000 strands. In this graph, there will be approximately 300 internal endpoints, of which approximately 280 should be connected as judged by a human and approximately 20 whose strand should be eliminated. If learning on bridging strands yields P=R=0.85 (Fig. 17), there will be approximately 222 true positives found by learning, with about 39 false positives and 39 false negatives (Fig. 13). In the same dataset, approximately 1000 short deletion candidate strands will contain approximately 150 spurious strands, and assuming learning yields P=R=0.92 (Fig. 17), approximately 36 of these spurious strands will remain and 36 strands will be incorrectly deleted. Of the strands not interrogated for deletion due to selection bias, there may be an additional 20 spurious strands based on a 0.2% spurious strand rate observed for strands longer than deletion candidates. Thus, the learning methods described in this paper reduce 470 total errors to 170 total errors, more than halving the error rate, and therefore, human time spent manually correcting datasets, which is argued to be the bottleneck for similar efforts (Peng et al, 2011). The additional benefit of consistency learning is expected to further reduce this error rate.
For both of the learning problems studied, we discovered a number of informative strand features that lead to improved decision-making performance regarding which strands to add and which strands to delete from a real vascular graph. Both strand addition accuracy and strand deletion accuracy were demonstrated to improve using a limited set of features. A richer feature set that includes more information on the local topological and geometric context, e.g., local or regional grammar on strands, can be expected to incrementally improve performance of both algorithms. Finally, we formalized a tradeoff measure of consistency between adding new candidate strands and deleting candidate strands.
Algorithmic elements presented here provide intermediate vascular graph correction tools which are part of a larger system to reason more globally over the intermediate labels and scores computed from these modules. For instance, the consistency learning method provides a natural interface to active learning (Freund et al, 1997b; Dasgupta et al, 2009). By finding the endpoints that exhibit the most conflict across strand addition and deletion scores, human operator effort can more efficiently be directed to labeling the elements of the vascular graph that lie on or nearest the decision margin of the consistency learning problem. The learning algorithms also provide a formalism to rank and present confidence-rated examples to a human operator. Such a formalism may be of particular value in managing the variable SNR within and across these large datasets.
We thank Mahnaz Maddah for making available her 3D Chamfer3,4,5 path length distance transform. We thank the anonymous reviewers for their careful read and insightful comments which substantially improved the quality and overall readability of this manuscript. We thank Uygar Sumbul and Sebastian Seung for their helpful discussions on using convolutional networks for topology preservation in dendritic trees. This work was supported by grants from the National Institutes of Health (EB003832, MH085499, and OD006831).
Dr. John Kaufhold is an image scientist and SAIC Technical Fellow at Science Applications International Corporation, where he conducts research on medical and biological image analysis. Dr. Kaufhold has reviewed for IEEE’s Computer Vision and Pattern Recognition, and has served on various program committees, including Microscopic Image Analysis and Applications in Biology. Prior to joining SAIC in 2005, Dr. Kaufhold conducted medical image analysis research at General Electric’s Visualization and Computer Vision Lab, where he was named as inventor on ten issued patents in medical image analysis. Dr. Kaufhold earned his Ph.D. in biomedical engineering from Boston University in 2001.
Dr. Philbert Tsai is an associate project scientist in the Physics department at the University of California, San Diego, where he conducts research on imaging neurovascular structure and function. Since 2003, he has served as a lecturer for the Cold Spring Harbor Labs course on Imaging Structure and Function in the Nervous System. He has served on the program committee for the SPIE Photonics West Conference on Commercial and Biomedial Applications of Ultrafast Lasers, and the review committee for the OSA Frontiers in Optics Conference. Dr. Tsai earned his Ph.D. in Physics from UCSD in 2004.
Dr. Pablo Blinder is a newly appointed assistant professor at the Tel-Aviv University. He completed an Israeli Science Foundation post-doctoral fellowship with Prof. David Kleinfeld at University of California, San Diego. Dr. Blinder studied the structure-function relationship in different biological systems and currently focuses on large scale vascular networks. Dr Blinder’s multidisciplinary approach to this topic encompasses in vivo physiology, block-face volumetric imaging and a graph theoretic analysis of biological networks. Dr. Blinder earned his doctorate in Biology from the Ben-Gurion University in 2007 with a joint supervision from the Physics department from the Tel-Aviv University. Dr. Blinder’s post-doctoral work is partly supported by a fellowship from the Israel Science Foundation.
Prof. David Kleinfeld is at the University of California, San Diego, where he conducts research on fundamental and biomedical aspects of neuroscience. He is a past director of the “Imaging Structure and Function in the Nervous System” postgraduate school at Cold Spring Harbor Laboratories and the “Methods in Computational Neuroscience” and “NeuroInformatics” schools at the Marine Biological Laboratory. In 2010, he received a NIH Director’s Pioneer Award. Prof. Kleinfeld earned his Ph.D. in physics in the laboratory of Prof. George Feher at UCSD in 1984, completed the “Neurobiology” postgraduate school at the Marine Biological Laboratory that same year, and worked at AT&T Bell Laboratories from 1985 to 1995.
The standard Euclidean distance transform, Dr(B), yields the distance from every bright pixel to its closest dark pixel. Note that the distance transform is 0 everywhere outside the mask. 3D Euclidean distance transforms are used in this work. The path length distance transform, Dp(B, s), returns the path length distance, Dp, from a chosen starting point, s, inside the mask. The path length distance is defined as the geodesic, i.e., shortest path, from one point in the mask to another point in the mask constrained such that all intervening edges are also in the mask. By definition, the path length distance between points on the mask and points outside the mask is ∞, i.e., they are not connected. Many methods can be used to compute the path length distance transform, including fast marching methods and Chamfer methods. In this work, we use the Chamfer3,4,5 method for path length computation as defined in (Maddah et al, 2005) (and abstracted from Mahnaz Maddah’s work).
The 3D extent of each candidate strand sets its 3D bounding box for visualization -- a margin of 15 μm about each bridging strand is the default setting for volume and surface renderings. Volume renderings were produced using (Conti, 2011). For each bridging strand, the user is presented with a 3D volume rendering of the data, ΓB, overlaid with a rendering of the mask, BB. The strands in the graph, GB, within some distance margin of the bridging strand, are also plotted in the same coordinate system. This default behavior can be adjusted with a control panel that can increase or decrease the size of the visualized bounding box. The volume rendering can be toggled off, set to render in 2D or 3D, or made more or less opaque. The surface rendering can also be toggled off and can be adjusted in transparency. By adjusting a proximity threshold, the local vascular network topology representation of centerlines can be expanded or made to include only those strands touching the bridging strand. The nearby centerline plots can also be toggled off.
The visualization of each candidate strand in its 3D local vascular network context can be rotated, zoomed, and translated. Using these features, the user can rotate the bridging strand into a view angle and zoom level that most clearly illustrates whether it is or is not a valid candidate strand. A 2D snapshot of each candidate strand is retained for later auditing. Audit images streamline the process of resolving differences between different ground-truth labels assigned by different users for various candidate strands. At the user’s discretion, multiple audit images can be collected. The labeling of bridging strands and deletion candidate strands differ slightly, as described below.
After viewing each bridging strand in its local 3D context, the human operator labels the strand as either a correctly connected gap, i.e., a true positive, a “hair”, i.e., a false positive, or an endpoint that should be connected to the vascular graph, but threshold relaxation found the wrong connection point so that it is both a false positive and a false negative. The human operator can also label a bridging strand as ambiguous, in which case its label is not used in learning.
Deletion candidate strands are similarly labeled by a human operator. After viewing each deletion candidate in its local 3D context, the human operator labels the strand for either retention or removal, by selecting whether the strand and/or its neighbors should be kept, or the strand or part of it should be deleted. The human operator can also mark the strand as “not sure,” in which case its label is not used in learning.
The interested reader can find more theoretical and implementation details of Adaboost in (Freund et al, 1997a). For the ith bridging strand, AdaBoost accepts as input a feature vector, xi, and a label which indicates whether or not (y = +1/y = −1) the example is labeled as an accurate bridging strand by the human observer. AdaBoost produces a score, H(xi) from this input. In this analysis, AdaBoost was run for nT = 100 rounds for each training set using a decision tree weak learner.
Decision trees are well known classification tools which repeatedly partition a dataset by thresholding multiple features on successively smaller subsets of the feature set. Caruana et. al. (Caruana et al, 2006) demonstrated that boosted decision trees generate highly robust classifiers across multiple types of datasets and performance metrics, providing broad evidence that absent any other information or intuition on choosing a learning algorithm for a specific problem, boosted decision trees produce high-quality results more frequently than competing algorithms, including random forests, logistic regression, Naïve Bayes, k-nearest neighbors, standard decision trees, bagged decision trees, neural nets, support vector machiness, or other statistical methods. Specifically, Caruana (Caruana et al, 2006) shows that boosted decision trees consistently have the best area under the receiver operator curve, which is the closest proxy to the precision recall detection metrics we aim to optimize in this analysis. This motivates the use of boosted decision trees in our work.
Decision trees are known to suffer from overfitting when grown to full size on a large, noisy dataset (Bramer, 2007). In the case of fully grown trees, each leaf node will 100% correctly classify some of the training examples, leading to a 100% detection performance on the train data, but may suffer terrible performance on a novel test dataset. Many decision tree fitting techniques and algorithms have been developed to mitigate these overfitting problems, e.g., implementations include Quinlan’s C4.5 (Quinlan, 1993) and Breiman et.al.’s CART (Breiman et al, 1984), and include techniques to halt growing the tree after some complexity is reached or to grow a tree to full size and then prune back decisions that are likely due to noise alone. In this analysis, the MATLAB® (R2007b) (MATLAB, 2009) Statistics Toolbox (Version 6.1) implementation of a decision tree was used. To prevent overfitting, the minimum example count per leaf node was set to 5, and the minimum splitting fraction was 10% of the total number of examples used to train each tree at each round of boosting.
Note that other learning methods were also benchmarked using 500 iterations of bagging as was done for boosted decision trees. Support Vector Machines, linear discriminants, boosted stumps and boosted linear discriminants were evaluated and none outperformed boosted decision trees for the same number of bags for either bridging strand or deletion candidate strand learning.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.