The system configuration is shown in
. In the FD-OCT system section, a 12-bit dual-line CMOS line-scan camera (Sprint spL2048-140k, Basler AG, Germany) is used as the detector of the OCT spectrometer. A superluminescence diode (SLED) (λ0
= 825nm, Δλ = 70nm, Superlum, Ireland) is used as the light source, giving a theoretical axial resolution of 5.5µm in air. The transversal resolution was approximately 40µm assuming Gaussian beam profile. The CMOS camera is set to operate at the 1024-pixel mode by selecting the area-of-interest (AOI). The minimum line period is camera-limited to 7.8µs, corresponding to a maximum line rate of 128k A-scan/s, and the exposure time is 6.5 µs. The beam scanning was implemented by a pair of high speed galvanometer mirrors controlled by a function generator and a data acquisition (DAQ) card. The raw data acquisition is performed using a high speed frame grabber with camera link interface. To realize the full-range complex OCT mode, a phase modulation is applied to each B-scan’s 2D interferogram frame by slightly displacing the probe beam off the first galvanometer’s pivoting point (only the first galvanometer is illustrated in ) [11
Fig. 1 System configuration: CMOS, CMOS line scan camera; G, grating; L1, L2, L3, L4 achromatic collimators; C, 50:50 broadband fiber coupler; CL, camera link cable; CTRL, galvanometer control signal; GVS, galvanometer pairs (only the first galvanometer is illustrated (more ...)
A quad-core Dell T7500 workstation was used to host the frame grabber (PCIE-x4 interface), DAQ card (PCI interface), GPU-1 and GPU-2 (both PCIE-x16 interface), all on the same mother board. GPU-1 (NVIDIA GeForce GTX 580) with 512 stream processors, 1.59GHz processor clock and 1.5 GBytes graphics memory is dedicated for raw data processing of B-scan frames. GPU-2 (NVIDIA GeForce GTS 450) with 192 stream processors, 1.76GHz processor clock and 1.0 GBytes graphics memory is dedicated for the volume rendering and display of the complete C-scan data processed by GPU-1. The GPU is programmed through NVIDIA’s Compute Unified Device Architecture (CUDA) technology [14
]. The software is developed under the Microsoft Visual C + + environment with National Instrument’s IMAQ Win32 APIs.
The signal processing flow chart of the dual-GPUs architecture is illustrated in
, where three major threads are used for the FD-OCT system raw data acquisition (Thread 1), the GPU accelerated FD-OCT data processing (Thread 2), and the GPU based volume rendering (Thread 3). The three threads synchronize in the pipeline mode, where Thread 1 triggers Thread 2 for every B-scan and Thread 2 triggers Thread 3 for every complete C-scan, as indicated by the dashed arrows. The solid arrows describe the main data stream and the hollow arrows indicate the internal data flow of the GPU. Since the CUDA technology currently does not support direct data transfer between GPU memories, a C-Scan buffer is placed in the host memory for the data relay.
Signal processing flow chart of the dual-GPUs architecture. Dashed arrows, thread triggering; Solid arrows, main data stream; Hollow arrows, internal data flow of the GPU. Here the graphics memory refers to global memory.
Compared to previously reported systems, this dual-GPUs architecture separates the computing task of the signal processing and the visualization into different GPUs, which has the following advantages:
(1) Assigning different computing tasks to different GPUs makes the entire system more stable and consistent. For the real-time 4D imaging mode, the volume rendering is only conducted when a complete C-scan is ready, while B-scan frame processing is running continuously. Therefore, if the signal processing and the visualization are performed on the same GPU, competition for GPU resource will happen when the volume rendering starts while the B-scan processing is still going on, which could result in instability for both tasks.
(2) It will be more convenient to enhance the system performance from the software engineering perspective. For example, the A-scan processing could be further accelerated and the point spread function (PSF) could be refined by improving algorithm with GPU-1, while more complex 3D image processing task such as segmentation or target tracking can be added to GPU-2.
In our experiment, the B-scan size is set to 256 A-scans with 1024 pixel each. Using the GPU based NUFFT algorithm, GPU-1 achieved a peak A-scan processing rate of 252,000 lines/s and an effective rate of 186,000 lines/s when the host-device data transferring bandwidth of PCIE-x16 interface was considered, which is higher than the camera’s acquisition line rate. The NUFFT method was effective in suppressing the side lobes of the PSF and in improving the image quality, especially when surgical tools with metallic surface are used. The C-scan size is set to 100 B-scans, resulting in 256 × 100 × 1024 voxels (effectively 250 × 98 × 1024 voxels after removing of edge pixels due to fly-back time of galvanometers), and 5 volumes/second. It takes GPU-2 about 8ms to render one 2D image with 512 × 512 pixel from this 3D data set using the ray-casting algorithm [8