|Home | About | Journals | Submit | Contact Us | Français|
We realized graphics processing unit (GPU) based real-time 4D (3D + time) signal processing and visualization on a regular Fourier-domain optical coherence tomography (FD-OCT) system with a nonlinear k-space spectrometer. An ultra-high speed linear spline interpolation (LSI) method for λ-to-k spectral re-sampling is implemented in the GPU architecture, which gives average interpolation speeds of >3,000,000 line/s for 1024-pixel OCT (1024-OCT) and >1,400,000 line/s for 2048-pixel OCT (2048-OCT). The complete FD-OCT signal processing including λ-to-k spectral re-sampling, fast Fourier transform (FFT) and post-FFT processing have all been implemented on a GPU. The maximum complete A-scan processing speeds are investigated to be 680,000 line/s for 1024-OCT and 320,000 line/s for 2048-OCT, which correspond to 1GByte processing bandwidth. In our experiment, a 2048-pixel CMOS camera running up to 70 kHz is used as an acquisition device. Therefore the actual imaging speed is camera-limited to 128,000 line/s for 1024-OCT or 70,000 line/s for 2048-OCT. 3D Data sets are continuously acquired in real time at 1024-OCT mode, immediately processed and visualized as high as 10 volumes/second (12,500 A-scans/volume) by either en face slice extraction or ray-casting based volume rendering from 3D texture mapped in graphics memory. For standard FD-OCT systems, a GPU is the only additional hardware needed to realize this improvement and no optical modification is needed. This technique is highly cost-effective and can be easily integrated into most ultrahigh speed FD-OCT systems to overcome the 3D data processing and visualization bottlenecks.
The acquisition line (A-scan) speed of Fourier-domain optical coherence tomography (FD-OCT) has been advancing rapidly to >100,000 line/s level in the last few years. For a spectrometer based FD-OCT, an ultrahigh speed CMOS line scan camera based system has achieved up to 312,500 line/s ; while for a swept laser type, 370,000 line/s has been realized by using a Fourier domain mode-locking swept laser . Such ultrahigh acquisition speed enables time-resolved volumetric (4D) recording and reconstruction of dynamic processes such as eye blinking, papillary reaction to light stimulus [3,4], and embryonic heart beating [5–7]. However, parallel efforts have not been embarked on data processing and visualization at the matching speed of the acquisition. Therefore, current real-time video-rate display is generally limited to 2D (B-scan) images. The most common way dealing with a huge volumetric data (C-scan) is to “capture and save” and then perform post-processing at a later time. The post-processing of 3D data usually includes two stages, FD-OCT signal processing and volumetric visualization, both of which are heavy-duty computing task due to the huge data size. Therefore the real-time signal processing and volumetric visualization become two bottlenecks for an ultra-high speed FD-OCT system that could be a practical system for clinical applications such as surgical intervention and instrument guidance, which usually requires a real-time 4D imaging capability.
To overcome the signal processing bottleneck, several solutions have recently been proposed and demonstrated: Multi-CPU parallel processing has been implemented and achieved 80,000 line/s processing rate on nonlinear-k system  and 207,000 line/s on linear-k system for 1024-OCT ; A linear-k Fourier-domain mode-locked laser (FDML) with direct hardware frequency demodulation method enabled real-time en face image by yielding the analytic reflectance signal from one depth for each axial scan ; More recently, a graphics processing unit (GPU) has been utilized for processing FD-OCT data  using linear-k spectrometer. However, the methods in [9–11] are limited to highly-special linear-k FD-OCT systems to avoid interpolation for λ-to-k spectral re-sampling. Therefore they are not applicable to the majority of nonlinear-k FD-OCT systems. Moreover, a linear-k spectrometer is not completely linear over the whole spectrum range [11,12] so re-sampling would still be required for a wide spectrum range which is essential for achieving ultra-high axial resolution.
For the volumetric visualization issue, multiple 2D slice extraction and co-registration is the simplest approach, while volume rendering offers more comprehensive spatial view of the whole 3D data set, which is not immediately available from 2D slices. However, volume rendering such as ray-casting is usually very time-consuming for CPU. So real-time rendering for a large data volume is only available through GPU. Moreover, a complete 3D data set must be ready prior to any volumetric visualization due to the feature of FD-OCT signal processing method , which still would require a solution.
In this paper, we realized GPU based real-time 4D signal processing and visualization on a regular FD-OCT system with nonlinear k-space for the first time to the best of our knowledge. An ultra-high speed linear spline interpolation (LSI) method for λ-to-k spectral re-sampling is implemented in GPU architecture. The complete FD-OCT signal processing including interpolation, fast Fourier transform (FFT) and post-FFT processing have all been implemented on a GPU. 3D Data sets are continuously acquired in real time, immediately processed and visualized by either en face slice extraction or ray-casting based volume rendering from 3D texture mapped in graphics memory. For standard FD-OCT systems, a GPU is the only additional hardware needed to realize this improvement and no optical modification is needed. This technique is highly cost-effective and can be easily integrated into most ultrahigh speed FD-OCT systems to overcome the 3D data processing and visualization bottlenecks.
The FD-OCT system used in the test is shown in Fig. 1 . A 12-bit CMOS camera (Sprint spL2048-140k, Basler AG, Germany) with 70,000 line/s effective line rate at 2048 pixel mode works as the detector of the OCT spectrometer. A superluminescence diode (SLED) (λ0 = 825nm, Δλ = 70nm, Superlum, Ireland) was used as the light source, which gives an axial resolution of approximately 5.5µm in air /4.1µm in water. The beam scanning was implemented by a pair of high speed galvanometer mirrors driven by a dual channel function generator and synchronized with a high speed frame grabber (PCIe-1429, National Instruments, USA). To simplify the alignment issues, our OCT system was configured in a common-path mode, where the reference signal comes from the bottom surface reflection of a glass window placed in between the scanning lens and sample, while the up surface is anti-reflective coated. The common-path structure doesn’t require dispersion compensation optics while maintain a high axial resolution [13–16]. The lateral resolution is estimated to be 9.5 µm assuming Gaussian beam. An 8-core Dell T7500 workstation was used to obtain and display images, and a GPU (NVIDIA Quadro FX5800 graphics card) with 240 stream processors (1.3GHz clock rate) and 4GBytes graphics memory was used to perform OCT signal processing and 3D visualization such as en face slice extraction or volume rendering.
Figure 2 presents the CPU-GPU hybrid system architecture, where two synchronized threads were used for data acquisition, signal processing and visualization, respectively. The solid arrows describe the main data stream and the hollow arrows indicate the internal data flow of the GPU. In the acquisition thread, a certain number of raw OCT interference spectrums were sequentially grabbed from the CMOS camera, transferred into the frame grabber through camera link cable and then routed into the host computer’s memory as a whole data block. In the processing and visualization thread, the grabbed data block was transferred into the graphics card memory through the PCI Express x16 2.0 interface, and then operated for each interferogram in parallel by the 240 graphics stream processors to complete the standard processing including λ-to-k spectral remapping, fast Fourier transform, and post-FFT processing. In the post-FFT processing part, a reference volume acquired and saved in the graphics memory prior to imaging any sample was subtracted by the whole volume before logarithm scaling to remove the DC component as well as the noise and artifact caused by irregular reference signal from the reference plane. The processed 3D data set is then sent to the next stage for visualization by either direct en face slices extraction or being mapped to 3D texture allocated in the graphics memory to perform volume rendering, which will be illustrated in details in the later section. Finally the processed 2D frame is transferred back to the host memory and displayed in the graphical user interface (GUI). The GPU is programmed through NVIDIA’s Compute Unified Device Architecture (CUDA) technology . The FFT operation is implemented by the CUFFT library . Since currently there is no suitable function in CUDA library for λ-to-k spectral re-sampling, here we propose a GPU accelerated linear spline interpolation (LSI) method as following:
Start from the LSI equation:
where is the nonlinear k-space value series and is the calibrated wavelength values of the FD-OCT system. is the spectral intensity series corresponding to . is the linear k-space series covering the same frequency range as . Linear spline interpolation requires a proper interval for each , that is:
Let a series to present the lower ends for each element of , then Eq. (1) can be written as:
can be easily obtained before interpolation by comparing and . From Eq. (3), one would notice that is independent of other values in the series , therefore this LSI algorithm is highly suitable for the parallel computation. Figure 3 shows the flowchart of parallelized LSI, where the parallel loops are distributed onto the GPU’s 240 stream processors. The values of , and are all stored in graphics global memory prior to interpolation, while the and are allocated in real-timely refreshed memory blocks.
Volume rendering is a numeric simulation of the eye’s physical vision process in the real world, which provides better presentation of the entire 3D image data than the 2D slice extraction [19–21]. Ray-casting is the simplest and most straightforward method for volume rendering, shown as Fig. 4(a) . An imaging plane is defined between the observer’s eye and the data volume, and each pixel of the imaging plane is the integration along the specific eye ray through the pixel, which can be presented by the following recursive back-to-front compositing equations :
where and stands for the color and opacity values of a single voxel at the spatial position . , , and are the color and opacity values on a particular eye ray in and out of this voxel. The eye ray corresponds to a pixel position on the image plane, and voxels along the ray will be taken into account for color and opacity accumulation.
The principle of ray-casting demands heavy computing duty, so in general real-time volume rendering can only be realized by using hardware acceleration devices like GPU. Figure 4(b) illustrates the details of the interactive volume rendering portion for Fig. 2. After post-FFT processing, the 3D data set is mapped into 3D texture, a pre-allocated read-only section on the graphics memory. A certain modelview matrix is obtained through the GUI functions to determine the relative virtual position between data volume and imaging plane . Then the GPU performs ray-casting method to render the 2D frame from the 3D texture according to the modelview matrix. To insure compatibility with the NI-IMAQ Win32 API and simplify the software structure, we have developed and implemented the ray-casting function using the CUDA language and the 2D frames are finally displayed using an NI-IMAQ window.
To test the GPU’s OCT data processing ability, we processed a series of large numbers of A-scan lines in one batch. The complete processing time is recorded in millisecond from the interval between the data transfer-in (host memory to graphics memory) and data transfer-out (graphics memory to host memory), and the time for interpolation is also recorded. Here both 2048-pixel and 1024-pixel OCT modes were tested and the 1024-pixel mode was enabled by the CMOS camera’s area-of-interest (AOI) output feature. The processing time versus one-batch line number is shown as Fig. 5(a) . The corresponding processing line rate can be easily calculated and shown in Fig. 5(b). The interpolation speed averages at >3,000,000 line/s for 1024-OCT and >1,400,000 line/s for 2048-OCT. The complete processing speed goes up to 320,000 line/s for the 2048-OCT and 680,000 line/s for the 1024-OCT. This is equivalent to approximately 1GBytes/s processing bandwidth at 12 bit/pixel. Since commonly used high-end frame grabbers (i.e. PCIe-1429) has an acquisition bandwidth limit of 680MBytes/s, the GPU processing should be able to process all the OCT data in real-time. As one can see, the processing bandwidth decreases in the case of smaller A-scan batch numbers (1000~10,000) due to the GPU’s hardware acceleration feature but it is still above 140,000 line/s for 2048-pixel and above 200,000 line/s for 1024-pixel, which is adequate enough to over-speed the camera and also leaves enough time for volume rendering.
Figure 6 shows the system sensitivity roll-off at both 1024-OCT and 2048-OCT modes, where the A-scans are processed by GPU based LSI and FFT. As one can see, the background noise increases with imaging depth due to the error of linear interpolation, and this issue can be solved by a more complex zero-filling method , which will be implemented on GPU in our future work.
Then we tested the actual imaging speed by performing the real-time acquisition and display of 2-D B-scan images. The target used is an infrared sensing card, as in Fig. 7 . Each frame consists of 10,000 A-scans and we got 12.8 frame/s for 1024-OCT (minimum line period = 7.8µs) and 7.0 frame/s for 2048-OCT (minimum line period = 14.2 µs), corresponding to 128,000 and 70,000 A-scan/s respectively, which is limited by the CMOS camera’s acquisition speed.
To demonstrate the higher acquisition speed case and evaluate the possible bus and memory contention issue, for each frame the raw data transferring-in and processing were repeated for 4 times within each frame period, while achieving the same frame rate for both OCT modes. Therefore the minimum effective processing speeds of 512,000 A-scan/s for 1024-OCT and 280,000 A-scan/s for 2048-OCT can be expected. These speeds represents more than double the currently highest acquisition speed using a CMOS camera, which is 215,000 A-scan/s for 1024-OCT  and 135,000 A-scan/s for 2048-OCT .
We further tested the real-time volumetric data processing and en face image reconstruction by running the OCT at 1024-pixel mode. The line scan rate was set to 100,000 line/second for the convenience of the synchronization. A Naval orange juice sac was used as the sample. Three different volume sizes are tested: 250 × 160 × 512 voxels (40,000 A-scans/volume); 250 × 80 × 512 voxels (20,000 A-scans/volume); 125 × 80 × 512 voxels (10,000 A-scans/volume); corresponding to a volume rate of 2.5, 5 and 10 volume/second, respectively. Figure 8 shows the en face slices of approximately 1mm × 1mm region in two different depths extracted from the same volumetric data and the depth difference of about 25 µm. All the A-scans of one volume were acquired and processed as one batch and remapped for en face slice extraction. More than one en face images at different depth can be quickly reconstructed and displayed simultaneously since the complete 3D data is available. As one can see, with decreasing volume size and increasing volume rate, the image quality degenerate but the major details such as cell wall are still clear enough to be visible compared with the largest volume size slices as in Fig. 8(a) and 8(b).
Here it is necessary to compare an en face FD-OCT imaging with another en face OCT imaging technology—time-domain transverse-scanning OCT/OCM (TD-TS-OCT/OCM) which acquires only one resolution element per A-scan. A typical TD-TS-OCT/OCM system can achieve a large en face image size (250,000 pixels) at 4 frame/s , giving 1,000,000 transverse points per second. In contrast, en face FD-OCT has less transverse scan rate (typically <500,000 A-scan/s) because a whole spectrum has to be acquired for each A-scan. However, en face FD-OCT provides a complete 3D data set so multiple en face images at different depth of the volume can be extracted simultaneously, which is not available by TD-TS-OCT/OCM.
Then we implemented the real-time volume rendering of continuous acquired data volume and realized the 10 volume per second 4D FD-OCT “live” image. The acquisition line rate is set to be 125,000 line/s at 1024-OCT mode. The acquisition volume size is set to be 12,500 A-scans, providing 125(X) × 100(Y) × 512(Z) voxels after the signal processing stage, which takes less than 10 ms and leaves more than 90 ms for each volume interval at the volume rate of 10 volume/s. As noticed from Fig. 6(a), the 1024-OCT has a 10-dB roll-off depth of about 0.8mm, and the background noise also increases with the depth. Therefore the optimum volume for the rendering in the visualization stage is truncated in half from the acquisition volume to be 125(X) × 100(Y) × 256(Z) voxels excluding the DC component and the low SNR portion in each A-scan. Nevertheless, the whole volume rendering is available if a larger image range is required. The image plane is set to 512 × 512 pixels, which means a total number of 5122 = 262144 eye rays are used to accumulate though the whole rendering volume for the ray-casting process according to Eq. (4) and Eq. (5). The actual rendering time is recorded during the imaging processing to be ~3ms for half volume and ~6ms for full volume, which is much shorter than the volume interval residual (>90ms). Also for the purpose of demonstrating the higher acquisition speed case, the data transfer-in, the complete FD-OCT processing and the volume rendering of the same frame were repeated 3 times within each volume period, while still maintaining 10 volume/s real-time rendering. Therefore a minimum effective processing and visualization speeds of 375,000 A-scan/s for 1024-OCT can be expected.
First we tested the real-time visualization ability by imaging non-biological samples. Here the half volume rendering is applied and the real volume size is approximately 4mm × 4mm × 0.66mm. The dynamic scenarios are captured by free screen-recording software (BB FlashBack Express). Figure 9(a) presents the top surface of a piece of sugar-shell coated chocolate, which is moving up and down in axial direction with a manual translation stage. Here the perspective projection is used for the eye’s viewpoint , and the rendering volume frame is indicated by the white lines. As played in Media 1, Fig. 9(b) shows the situation when the target surface is truncated by the rendering volume’s boundary, the X-Y plane, where the sugar shell is virtually “peeled” and the inner structures of the chocolate core is clearly recognizable. Figure 9(c) illustrates a five-layer plastic phantom mimicking the retina, where the layers are easily distinguishable. The volume rendering frame in Fig. 9(c) is configured as “L” shape so the tapes are virtually “cut” to reveal the inside layer structures.
Then we implemented the in vivo real-time 3D imaging of a human finger tip. Figure 10(a) shows the skin and fingernail connection, the full volume rendering is applied here giving a real size of 4mm × 4mm × 1.32mm considering the large topology range of the nail connection region. The major dermatologic structures such as epidermis (E), dermis (D), nail fold (NF), nail root (NF) and nail body (N) are clearly distinguishable from Fig. 10(a). Media 2 captured the dynamic scenario of the finger’s vertical vibration due to artery pulsing when the finger is firmly pressing against the sample stage. The fingerprint is imaged and shown in Media 3 in Fig. 10(b), where the epithelium structures such as sweat duct (SD), stratum corneum (SC) can be clearly identified. Figure 10(c) offers a top-view of the fingerprint region in Media 4, where the surface is virtually peeled by the image frame and the inner sweat duct are clearly visible. The volume size for Fig. 10(b) and Fig. 10(c) is 2mm × 2mm × 0.66mm.
Finally, to make full use of the ultrahigh processing speed and the whole 3D data, we implemented multiple 2D frames real-time rendering from the same 3D data set with different model view matrix in Media 5, including side-view [Figs. 11(a, b, d, e) ], top-view [Fig. 11(c)] and bottom-view [Fig. 11(f)], where Fig. 11(a) and Fig. 11(d) are actually using the same model view matrix but the later displayed with the “L” volume rendering frame to give more information of inside. All frames are rendered within the same volume period and displayed simultaneously, thus gives more comprehensive information of the target. The two vertexes with the big red and green dot indicate the same edge for each rendering volume frame.
The processing bandwidth showed in Section 4 is much higher than most of the current FD-OCT system’s acquisition speed, which indicates a huge potential for improving the image quality and volume speed of real-time 3D FD-OCT by increasing the acquisition bandwidth. The GPU processing speed can be increased even higher by implementing a multiple-GPU architecture using more than one GPU in parallel. Therefore the bottleneck for 3D FD-OCT imaging would now lie in the acquisition speed.
For all the experiments described above, the only additional device required to implement the real-time high speed OCT data processing and display for most cases is a high-end graphics card which cost far less compared to the most optical setup and acquisition devices. The graphics card is a plug-and-play computer hardware without need for any optical modifications. And it is much simpler than adding a prism to build a linear-k spectrometer or developing a linear-k swept laser. The both are complicated to build and will change the overall physical behavior of the OCT system.
In conclusion, we realized GPU based real-time 4D signal processing and visualization on a regular FD-OCT system with nonlinear k-space for the first time to the best of our knowledge. An ultra-high speed linear spline interpolation (LSI) method for interpolation for λ-to-k spectral re-sampling is implemented in GPU architecture. The complete FD-OCT signal processing including interpolation for λ-to-k spectral re-sampling, fast Fourier transform (FFT) and post-FFT processing have all been implemented on a GPU. 3D Data sets are continuously acquired in real time, immediately processed and visualized by either en face slice extraction or ray-casting based volume rendering from 3D texture mapped in graphics memory. For standard FD-OCT systems, a GPU is the only additional hardware needed to realize this improvement and no optical modification is needed. This technique is highly cost-effective and can be easily integrated into most ultrahigh speed FD-OCT systems to overcome the 3D data processing and visualization bottlenecks.
This work was supported by National Institutes of Health (NIH) grant R21 1R21NS063131-01A1.