|Home | About | Journals | Submit | Contact Us | Français|
Recent advances in computational image processing have made it possible to reconstruct camera motion and scene geometry from a series of monocular images. By applying these methods to endoscopic image sequences, it is possible to create detailed, quantitative anatomic reconstructions. Such anatomic reconstructions have many potential clinical uses. Our objectives in this study are to (1) develop a process flow for reconstruction from endoscopic image sequences and (2) present results supporting the hypothesis that such reconstructions can be computed.
We first outline the overall process flow for endoscopic reconstruction. Then, we present an instantiation of this process flow using recently developed methods in computational vision. We apply these methods to cadaverous specimens for which ground truth endoscopic motion is known.
We are able to produce consistent estimates of endoscopic motion and dense reconstructions of the surrounding anatomy for >65% of 1373 image pairs.
Our study indicates that processing endoscopic images to produce anatomic structure is feasible. Such reconstructions have high potential clinical value for intraoperative navigation, diagnosis, and treatment planning.
Computer-assisted surgery is a maturing field that has expanded our ability to perform complex sinus and anterior skull–based surgeries in a safe fashion. However, these new techniques demand new levels of precision in surgical navigation and challenging problems in the visualization of anatomy and anatomic revisions or reconstructions.
In this study, we present a new concept in computer-assisted surgery, which we refer to as “quantitative endoscopy” (QE). QE allows for the precise and quantitative reconstruction of surface anatomy from a series of monocular endoscopic images. QE has been made possible by recent advances in image processing algorithms that reconstruct camera motion and scene geometry from a series of monocular images. Solutions to this problem, referred to as structure from motion in the computer vision literature,1 have found numerous applications in digital imaging. Our goal is to apply these methods to endoscopic image sequences to create detailed, quantitative anatomic reconstructions.
Several groups have specifically used endoscopic video to extract local geometry. Taylor et al.2,3 used simple hierarchical correlation methods to track identified landmarks in successive images from a robotically manipulated laparoscope. Mori4 has developed a virtual endoscopy system making use of epipolar geometry. Several groups, including ours5,6 and Mourgues et al.,7 have produced dense depth/surface maps from carefully calibrated stereo endoscopes. The work of Higgins extensively considered the development of CT to video registration for the purpose of lung cancer assessment and virtual bronchoscopy.8–11 However, all of this work assumes the presence of a 3D model (from CT), and the goal is to register against this model. Consequently, the methods they have developed are not able to perform explicit surface reconstruction from video data. Our group developed an initial demonstration of motion-based anatomic reconstruction on phantom data.12
The development of QE, particularly of tubular structures such as the sinuses, ear canals, or throat, presents a number of technical challenges. Extreme changes in lighting, low image texture, and unfavorable motion trajectories all contribute to the difficulty of the reconstruction problem. Thus, our goals in this study are to (1) provide an instantiation of the feature matching and reconstruction framework suitable for typical sinus imagery, (2) validate the hypothesis that motion estimation can be performed reliably on endoscopic data with this framework, and (3) provide illustrative data showing the resulting anatomic surface reconstructions.
Consider a camera observing a point P = (x, y, z)t (3) on a surface S from a position that can be characterized by the location of the optical center c, and a rotation matrix R, expressed relative to a fixed coordinate system. The projection of P onto the camera image plane can be described in two steps. First, the point P is projected into camera coordinates as Pc(xc, yc, zc)t = Rt(P − c), where Rt denotes the transpose of R. Once in camera coordinates, the point is then projected onto the image plane as p = (u, v)t = (fxc/zc, fyc/zc)t, where f is the focal length of the imaging system. In practice, there are additional parameters that identify the location of the optical center in the pixel grid and scale coordinates to pixel coordinates. The identification of these parameters is well studied in the computer vision and photogrammetry literature.13
Consider now observing the same physical point P from two camera positions (c1, R1) and (c2, R2) with c1 ≠ c2. The point P will project to two image locations p1 = (u1, v1)t and p2 = (u2, v2)t. The following condition now holds1:
where sk denotes a skew symmetric matrix.14 The matrix therefore encodes the motion of the camera between the two viewpoints. It is referred to as the essential matrix.15 The essential matrix can be estimated using linear techniques from eight or more matched points in two images16 or nonlinear techniques from as few as five points.17 The process of computing E (or other equivalent forms) and using it to estimate the geometric relationship between the cameras is referred to as the motion problem. It is important to note that if some translation component c2 − c1 satisfies Eq. 1, then any scalar multiple also does—multiplying the zero vector on the right side by a scalar has no effect. Thus, translation can be estimated only up to a scale factor using this technique.1
Given two known camera positions, it also is possible to triangulate observed points and compute their position in 3D space. This is referred to as the stereo problem.18 The essential issue in stereo is to match corresponding points in the two images. Extensive studies of local matching solutions and globally optimal matching solutions19 have been published, although no definitive answers exist.18 However, in most cases it is possible to implement efficient and robust local matching solutions.
If it is the case that both motion and observed scene geometry is unknown and must be jointly estimated, the problem is referred to as the structure and motion problem.20 For the case of two frames, this problem can be solved as follows: (1) find matching points in two images, (2) compute the essential matrix from the matched points, (3) extract the motion (rotation and translation c2 − c1) between the two positions from the essential matrix; and (3) perform stereo matching and reconstruction to compute the geometry of the observed scene. If an extended series of frames is available (e.g., by tracking points through a video sequence), methods for reconstruction from multiple frames have been developed for both simplified camera models21 or full perspective models.22 In particular, bundle-adjustment methods23 can be applied to extract surface geometry and motion information with extremely high accuracy. This geometry can then be used for a wide variety of clinical applications. The overall process flow for QE reconstruction thus can be viewed as shown in Fig. 1.
We performed a series of video-assisted nasal endoscopies on a cadaverous porcine specimen. Images were collected using a Storz Telecam 202212113U NTSC, attached to a 4-mm Storz zero-degree rigid Hopkins rod endoscope 7210AA (Karl Storz, Inc., Culver City, CA). The images were digitally captured at a rate of roughly 30 frames/s and stored for later data processing. An Optotrack (Northern Digital, Waterloo, ON, Canada) was used to measure and record endoscopic motion data during image acquisition via an Optotrack optical target attached to the shaft of the endoscope. The Optotrack motion data are considered the ground truth to which the reconstructed endoscopic motion data were compared. Optotrack motion data were recorded with timing information to allow for proper synchronization with the image data.
Images from a standard optical calibration target were also recorded before the data collection. An optical calibration procedure13 was performed on these images using a publically available Matlab toolkit (Mathworks, Inc., Natick, MA).24 The resulting calibration had an RMS error of 0.33 pixels indicating that an accurate calibration was achieved.
The principal goal of this study was to compare the accuracy of image-based motion estimation with the Optotrak system. To perform this comparison, it is necessary to relate the pose (position and orientation) of the Optotrak target to that of the image sensor. Computing this transformation involves solving a so-called “AX = XB” problem. We implemented the method of Park and Martin25 and applied it to the Optotrak pose data acquired during the optical calibration procedure. The resulting AX = XB solution had an RMS error 0.94° of orientation and 0.53 mm of translation error.
We have developed a reliable method for estimating endoscope epipolar geometry by using scale-invariant feature transform (SIFT) features26 and robust estimation techniques.27 The algorithm proceeds as follows. First, SIFT features are detected and matched in a given pair of images using methods reported by Lowe.26 Five feature matches are then sampled randomly and without replacement, and a candidate E matrix is computed using the five-point algorithm.17 The resulting E matrix is normalized and the left side of Eq. 1 is evaluated on all matched pairs. The solution is given a score that is the median of these evaluations. The final solution is the E matrix with the least median score. The advantage of this algorithm is that it uses the minimum number of correct matches5 and can tolerate up to 50% outliers in the data.
Matched features between pairs of images can be used for 3D reconstruction. However, it is well known that more accurate reconstructions result by combining information across multiple images.23,12 We have implemented an initial version of such an algorithm.
In this algorithm, SIFT features are extracted and matched as described previously. However, as the sequence proceeds, SIFT features that match across multiple frames are tracked throughout the video sequence. Reconstruction from the tracked features proceeds as follows.
First, two frames are selected, the camera motion is estimated as described previously and the validated matches (i.e., inliers) are obtained. An initial 3D reconstruction (up to an unknown scale) of the matched points is determined through triangulation using the estimated camera motion and the validated matches. For a new frame, matches corresponding to the tracked features are used to infer correspondences between 2D image features and their 3D reconstructed points (known from previous frames). The camera motion is estimated using these 2D–3D matches using standard methods.28 The 3D positions of the reconstructed points are then refined using a least squares technique applied to all frames. Newly reconstructed 3D points corresponding to new SIFT features are initialized and are added to the reconstruction. In this way, a sequence of motions and an associated extended reconstruction is computed.
Figure 2 illustrates the performance of the robust motion estimation algorithm. In the upper row, the initial feature matches from a pair of sinus images are shown. At the lower left, the feature matches that are judged to be inliers after the robust E matrix estimation are shown. For each inlier match in the left image, it is possible to compute the line along which the corresponding match in the right image must lie. The intersection of those lines shows the direction of motion, as depicted in the lower right image.
Figure 3 illustrates the steps of the two-frame structure reconstruction algorithm on a typical image pair. On the upper left is the image pair. On the lower right are the epipolar lines indicating, in this case, a lateral motion. The lower middle image shows the results of a dense stereo match, in which brightness encodes depth (brighter is closer). The right-most image shows a view of the 3D reconstruction results, clearly showing the curving tubular shape of the viewed cavity.
To further validate the accuracy of the image matching component, we have applied the algorithm to 1373 images from the porcine video sequence. Each image was paired with a second image at least 1 mm distant from the sequence, and the two-frame algorithm was applied. The interframe motion values from image-based motion estimation and tracked motion were compared. Because the absolute value of translation can not be computed from images alone, the direction of translation was used as the basis of comparison.
It was found that accurate motion information was acquired in frames for which at least 50 feature matches were present. This comprised 855 (65.8%) of the image pairs. For these pairs, the RMS error in rotation was 1.14° with a median error of 0.42°. The error in translation direction was 26.4° and the median error was 14.0°. However, these results were quite variable. For example, the comparable figures for frames 250–350 were 0.47 and 0.26° of rotation and 8.79 and 6.14° of translation. The latter are well within the expected noise floor of motion estimation using the five-point method,17 and are very reasonable candidates for further refinement.
We have also applied the multiframe tracking and reconstruction algorithm to the data. At this point, we have not performed a quantitative comparison of this algorithm. However, qualitatively, the algorithm results in a larger area of reconstructed points, and the reconstructed data have higher accuracy. Figure 4 shows one such reconstruction result.
Our results indicate that traditional endoscopic images can be computer enhanced to produce accurate geometric models of viewed structures. The proposed algorithm was able to generate reliable motion estimates on 65.8% of 1373 contiguous frame pairs and highly accurate motion estimates on 100% of a selected series of 100 frame pairs. Inspection of the resulting reconstructed points showed that the reconstructions are dense and are qualitatively correct relative to the viewed anatomy. We are now validating the accuracy of the geometric reconstruction. This will be performed by registering the reconstructed data to surfaces extracted from a CT image of the anatomy. This process also will provide a demonstration of anatomy-based navigation using video-CT registration.
The technique presented here has been optimized for tubular geometries that severely constrain camera motions, such as those typically found in the head and neck region. However, they are fully generalizable to nonconstrained endoscopic procedures. As such, these methods are immediately applicable to the problems of registration for surgical navigation, sizing of anatomic structures, and 3D visualization of surgical anatomy for surgical planning or simulation. QE may also be useful for surgical navigation and updating anatomic models during surgery.
We foresee the integration of QE into current clinical procedures to be simple and straightforward. However, it is clear from our results that performance can be enhanced by careful and appropriate image acquisition. The development of appropriate clinical protocols as well as further technical improvements to the algorithms are the next steps toward clinical application.
Supported by Grant NIH 1R21EB005201-01A1
Presented at the meeting of the American Rhinologic Society, Washington, D.C., September 15, 2007
The authors have no financial interest or other conflict of interest in this work