Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4819977

Formats

Article sections

Authors

Related links

Technol Cancer Res Treat. Author manuscript; available in PMC 2017 April 1.

Published in final edited form as:

Published online 2015 May 3. doi: 10.1177/1533034615580694

PMCID: PMC4819977

NIHMSID: NIHMS736855

Department of Mechanical Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213

The publisher's final edited version of this article is available at Technol Cancer Res Treat

See other articles in PMC that cite the published article.

This study focuses on the implementation of an efficient numerical technique for cryosurgery simulations on a graphics processing unit (GPU) as an alternative means to accelerate run time. This study is part of an ongoing effort to develop computerized training tools for cryosurgery, with prostate cryosurgery as a developmental model. The ability to perform rapid simulations of varying test cases is critical to facilitate sound decision making associated with medical training. Consistent with clinical practice, the training tool aims at correlating the frozen region contour and the corresponding temperature field with the target region shape. The current study focuses on the feasibility of GPU-based computation using C++ accelerated massive parallelism (AMP), as one possible implementation. Benchmark results on a variety of computation platforms display between three-fold acceleration (laptop) and thirteen-fold acceleration (gaming computer) of cryosurgery simulation, in comparison with the more common implementation on a multi-core central processing unit (CPU). While the general concept of GPU-based simulations is not new, its application to phase-change problems, combined with the unique requirements for cryosurgery optimization, represents the core contribution of the current study.

Simulation of heat transfer problems involving phase change is of great importance in the study of thermal injury to biological systems (1,2). One important application is cryosurgery, which uses controlled freezing to destroy cancerous tumors (3). Modern cryosurgery is frequently performed as a minimally invasive procedure, with the application of a large number of cooling probes (cryoprobes), in the shape of long hypodermic needles, strategically inserted into the area to be destroyed (the target region) (4). The minimally invasive nature of the cryoprocedure presents unique challenges associated with the appropriate selection of cooling technology (e.g., nitrogen boiling versus Joule-Thomson), the selection of the most suitable cryoprobe dimensions, the optimization of the number of cryoprobes and their layout, and the application of the most effective thermal history (5–8). Optimal selection of the cryosurgery parameters can reduce the cost of the medical treatment, improve its quality, and minimize post-treatment complications. Furthermore, methodologies for how to select the optimal cryosurgery parameters affect the quality and duration of clinical training. Rapid computer simulations of bioheat transfer can help alleviate these difficulties, both in clinical training and during the planning phase of a specific procedure.

Techniques for computer-assisted planning and training essentially rely on decision making based on an array of solutions to closely related problems, in a process similar to classical optimization methods (9). Methodologies for defining the corresponding set of problems belong to the general area of optimization, where the objective function for optimization may vary among clinicians based on accepted criteria for cryosurgery success and hardware capabilities. In practice, the most intensively monitored parameter during minimally invasive cryosurgery is the freezing-front location with the aid of medical imaging. While the cryosurgeons may aim at reaching specific temperatures within the target region, such as the so-called *lethal temperature*, the clinical practice revolves around matching the freezing front shape with the shape of the target region. With cryosurgery optimization in mind, this practice has led to the development of the so-called *defect region*—a quality measure of the geometrical match between a planning isotherm and the shape of the target region (10). While the defect region is not the only possible measure to guide cryosurgery planning and training, it is perceived as the most intuitive by the authors of this study, closely following current clinical practice (11).

In order to be clinically relevant for planning and training, cryosurgery simulations must be rapidly performed—ideally in a matter of seconds. An efficient numerical scheme and parallel computation appear to be key for the development of computation tools for cryosurgery (12). Although originally designed for image rendering, the GPU has emerged as a competitive platform for computing massively parallel problems for non-graphical applications (13,14). While GPU-based computation appears promising for the application of cryosurgery simulations, its interactive application for planning and training may come with difficulties inherent to GPU-based software architecture. As a part of an ongoing effort to develop computerized planning and training tools for cryosurgery, the objective of the current study is to develop a proof-of-concept for GPU-based cryosurgery simulations.

In effort to place a realistic runtime target for the current proof-of-concept development, it is assumed by the authors that a complete surgical planning of under two minutes is required. This timeframe bears negligible impact on the duration of the cryosurgical procedure, being much shorter than the cryoprobes insertion phase. Furthermore, a two minutes timeframe correlates very well with parallel educational objectives during computerized training (15,16). While using bubble-packing planning (17) may take only a couple of seconds and execution of one heat transfer simulation, the more accurate force-field analogy method may require up to 45 consecutive heat transfer simulations to reach an optimal layout (6). Using the recently developed finite difference (FD) numerical scheme (18) and CPU-based computerized framework (12), each simulation requires approximately 30 seconds to complete on a high-end machine, leading to an overall planning time over 20 minutes. It follows that the current state-of-the-art performance is at least an order of magnitude slower than desired, which sets the target for the current study.

Bioheat transfer in this study is modeled with the classic bioheat equation (19):

$C\frac{\partial T}{\partial t}=\nabla \xb7\left(k\nabla T\right)+{\dot{w}}_{b}{C}_{b}\left({T}_{b}-T\right)+{\dot{q}}_{\mathit{\text{met}}}$

(1)

where *C* is the volumetric specific heat of tissue, *T* the temperature, *t* the time, *k* the thermal conductivity of the tissue,
${\dot{w}}_{b}$ the blood perfusion rate (ml_{blood}/ml_{tissue} per second), *C _{b}* the specific heat of blood,

Representative thermophysical properties of biological tissues used in the current study, where *T* is given in degrees Kelvin

The validity and mathematical consistency of the classical bioheat equation has drawn a lot of attention in the literature of the past six decades, with greater interest around the 1970s and 1980s (1,2). Despite the above controversy, the classical bioheat equation is deemed adequate for the current study, whereas more sophisticated mathematical models may not warrant higher accuracy for cryosurgery computation but will involve greater mathematical complications (20,21). Typical to bioheat transfer simulations of cryosurgery, the heating effect resulting from blood perfusion is far more significant than the metabolic heat generation, with the latter neglected in the current study (21).

Further note that the dependency of blood perfusion on temperature may represent a more complex behavior than the step-like change described in Table 1. While the specific behavior may be unknown for the case of prostate freezing, Rabin and Shitzer (21) have already demonstrated that the overall effect of blood perfusion on the size of the frozen region is measured in only a few percent, for the special case of an extremely high blood perfusion rate but in the absence of major blood vessels. Given these prior findings, the simplistic temperature-dependent blood perfusion behavior in the current study is assumed adequate to predict the frozen region size, which is the monitored parameter via medical imaging during prostate cryosurgery.

An efficient numerical scheme to solve Eq. (1) has been published previously (18), and is presented here in brief for the completeness of presentation. The numerical scheme is based on a finite differences (FD) approach, using a variable grid size and grid-dependent time intervals, which is well suited for parallel computation (12). The variable grid size is used in order to reduce the computational cost, where a fine grid is only necessary in regions with steep thermal gradients, such as those around the urethral warmer and cryoprobes, while a coarse grid is used in the rest of the domain. Since the time interval to ensure stability is proportional to the typical grid size to the second power, an area with a coarse grid permits a much larger time step.

The characteristic FD equation for solving Eq. (1) is (21):

${T}_{i,j,k}^{p+1}=\frac{\mathrm{\Delta}t}{\mathrm{\Delta}{V}_{i,j,k}\left[{C}_{i,j,k}+{\left({\dot{w}}_{b}{C}_{b}\right)}_{i,j,k}\mathrm{\Delta}t\right]}\sum _{l,m,n}\frac{{T}_{l,m,n}^{p}-{T}_{i,j,k}^{p}}{{R}_{l,m,n-i,j,k}}+\frac{\mathrm{\Delta}t\left[{\left({\dot{w}}_{b}{C}_{b}\right)}_{i,j,k}{T}_{b}+{\left({\dot{q}}_{\mathit{\text{met}}}\right)}_{i,j,k}+{q}_{\mathit{\text{input}}}/\mathrm{\Delta}V\right]+{C}_{i,j,k}{T}_{i,j,k}^{p}}{{C}_{i,j,k}+{\left({\dot{w}}_{b}{C}_{b}\right)}_{i,j,k}\mathrm{\Delta}t}$

(2)

where *i, j, k* are the indices of the grid point under consideration, the indices *l*, *m*, *n* represent its neighboring grid points, *p* is the time level, *V* is the element volume associated with the grid point under consideration, and *R* is the thermal resistance to heat transfer by conduction between grid point *i*, *j*, *k* and its neighbor *l*, *m*, *n*. The thermal resistance to heat transfer by conduction in a Cartesian geometry is given by:

${R}_{l,\phantom{\rule{0.2em}{0ex}}m,\phantom{\rule{0.2em}{0ex}}n-i,\phantom{\rule{0.2em}{0ex}}j,\phantom{\rule{0.2em}{0ex}}k}={\left[\frac{L}{2kA}\right]}_{l,\phantom{\rule{0.2em}{0ex}}m,\phantom{\rule{0.2em}{0ex}}n}+{\left[\frac{L}{2kA}\right]}_{i,\phantom{\rule{0.2em}{0ex}}j,\phantom{\rule{0.2em}{0ex}}k}$

(3)

where *L* is the space interval in the direction of heat flow, and *A* is the representative cross-sectional area perpendicular to the same direction.

Figure 1 schematically illustrates a 2D domain, representative of a cross-section during prostate cryosurgery—the focus of the current line of research. For demonstration purposes, Fig. 1 includes two grid sizes: a general coarse grid and a fine grid around the cryoprobes and the urethra; the urethra runs through the prostate and is warmed by a special heater during cryosurgery. Figure 1 also illustrates the thermal resistance network around a fine grid point and at a transition area between fine and coarse grid. Figure 1 presents a coarse-to-fine grid ratio of 1:3, which corresponds to a grid point ratio of 1:9 in 2D. These ratios illustrate the potential run time reduction by varying grid size. Consistent with (18), the simulated domain is assumed much bigger than the frozen region, such that the boundary condition represent constant temperature throughout the simulation—equal to the core-body temperature.

The stability criterion for applying Eq. (2) is given by (18):

$\mathrm{\Delta}t\le \left[\frac{{\left(C\mathrm{\Delta}V\right)}_{i,j,k}}{{\sum}_{l,m,n}1/{R}_{l,m,n-i,j,k}}\right]$

(4)

In the current study, the maximum allowable time interval for the finest grid, *Δt _{fine}*, is evaluated first, to be used in the finest grid regions. Next, the maximum allowable time interval for the coarse grid is calculated,

The numerical scheme presented above was selected as a choice of practice, which has already proven to work well for parallelization of cryosurgery simulations (12). Other FD techniques may also serve as good candidates for parallel computation of cryosurgery, such as the classical alternating direction implicit (ADI) method (22). For example, Zeng and co-workers (23) have demonstrated that ADI can be used for parallel computation of cryosurgery, while using the same physical properties compiled by Rabin and Shitzer from previous publications (24), and by applying the same enthalpy approach proposed by Rabin and Korin for phase change processes (25) and adopted for cryosurgery by Rabin and Shitzer (21). While the ADI may be used for parallel computation, we believe that it represents an inferior match for GPU implementation due to the required computation framework for the inherently serial tri-diagonal solver. The number of intermediate steps that an ADI solver needs to take in order to solve the tri-diagonal matrix for each simulated time interval is proportional to $2{\text{log}}_{2}n-1$, where n is the number of nodes in the simulation (26), while the currently applied solution only requires a single step to calculate the new temperature field at the end of a time interval (12,25).

When the ADI is based on an implicit and unconditionally stable solution, such as the modified Crank–Nicolson method (22), each time step can be much extended, which represents an advantage. However, the implicit nature of the ADI strategy calls for additional measures to ensure the convergence of the solution and also energy conservation of the numerical scheme (20), which would heavily tax any GPU-based implementation. While the explicit scheme used in the current study requires relatively short time steps, its stability is a sufficient condition for convergence and its small time step guarantees energy conservation (25), a combination which makes it very suitable for the parallel nature of GPU implementation.

As presented in the introduction, the completion of a cryoprocedure is assumed when the best match between a planning isotherm and the target region is obtained, where this match is evaluated by the defect region value (10):

$G=\frac{1}{{V}_{t}}{\int}_{{V}_{s}}u\phantom{\rule{0.2em}{0ex}}d{V}_{s}\phantom{\rule{0.2em}{0ex}};\phantom{\rule{0.2em}{0ex}}u=\{\begin{array}{c}\begin{array}{c}1\phantom{\rule{0.2em}{0ex}}{T}_{p}<T\\ 0\phantom{\rule{0.2em}{0ex}}T\le {T}_{p}\end{array}\}\mathit{\text{interior}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{to}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{the}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{target}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{region}}\\ \begin{array}{c}1\phantom{\rule{0.2em}{0ex}}T\le {T}_{p}\\ 0\phantom{\rule{0.2em}{0ex}}{T}_{p}<T\end{array}\}\mathit{\text{exterior}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{to}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{the}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{target}}\phantom{\rule{0.2em}{0ex}}\mathit{\text{region}}\end{array}$

(5)

where *V _{t}* is the volume of the target region,

This study focuses on prostate cryosurgery as a developmental model for GPU-based simulations. In the absence of available imaging data to reconstruct the transient frozen region in a clinical setup, results of this study are validated against an established FD numerical scheme (18), which in turn has been validated against experimental results on a bench-top setup (28). In essence, this two-phase validation is particularly useful to study runtime acceleration, when uncertainties in experimental data do not propagate into computer simulations and affect benchmarking results (29). The current study employs a bubble-packing method as a technique to identify the optimal cryoprobe layout (6), which has been validated against experimental data independently (5). The FD implementation scheme has been previously optimized for parallel computation on multicore CPU machines (12), which represents an appropriate benchmark for GPU-based implementation in the current study.

GPU implementation and optimization have been conducted in an iterative manner. Below, the key elements of GPU implementation are presented first, the key optimization steps and their relative impact on run-time acceleration are presented next, and benchmarking results against an optimized CPU implementation on selected machines are presented last. Implementation results, optimization analyses, and benchmarking in this study are based on a cancerous prostate model created previously, using a urethral warmer and 12 cryoprobes at a uniform insertion depth, with optimal layout of the probes generated by bubble packing (17). Figure 2(a) displays typical results obtained with the simulation code (12) at the point of minimum defect, presented at the largest prostate cross section. The compiler C++ AMP was selected for GPU implementation as a choice of practice, based on its user friendliness as well as its advantage of future graphics interoperability with DirectX.

A typical simulated temperature field at the largest prostate cross section, subject to the operation of a urethral warmer and 12 cryoprobes: (top) temperature field results obtained with a CPU-based parallel computation, using double precision; (bottom) **...**

Conceptually, the current study represents an advanced-stage development, where validation with experimental data is established in two consecutive steps. The first step has been completed in previous studies, where the applied FD numerical scheme (18) has been validated against experimental results (28). That previous study (28) demonstrated mismatch between the predicted freezing front location and the simulated freezing front location of less than 4.8%, with an average mismatch of 2.9% (n=24). Note that the freezing front location is the monitored parameter during minimally invasive cryosurgery. That 2.9% mismatch has been translated to 0.3 mm of disagreement in the actual freezing front location (28), which is much smaller than the resolution in medical imaging for minimally invasive cryosurgery (in the range of 1 to 2 mm for ultrasound). The second step of comparison with experimental data is presented here, where any new computational development in the current study is validated against the established FD scheme and CPU-based implementation. For example, Fig. 2(b) displays a mismatch of up to 0.4°C between GPU-based and CPU-based simulations of the same case, with a negligible difference in the freezing front location (less than 0.01 mm, which has no clinical meaning). A wider discussion on the propagation of uncertainty in measurements into cryosurgery simulations has been presented by Rabin (29).

With reference to Fig. 3, the cryosurgical simulation is composed of a main loop that marches in time subject to six key operations: (i) field-properties update, where the boundary conditions and material properties are recalculated during each computation cycle, based on the corresponding temperature field; (ii) fine-grid temperature calculations subject to small time step intervals (calculated every loop cycle); (iii) coarse-grid temperature calculations subject to a corresponding larger time step intervals (in practice calculated only every *m* loop cycles, where *m* is the ratio of time step intervals); (iv) interpolation of the irregular simulation grid into a regular fine grid in preparation for the next step; (v) calculation of the defect field and its integrated value; and, (vi) periodic data transfer from the GPU to the host machine for logging and further thermal analyses. Only operations (i) and (ii) are executed every time main loop cycle, where the other operations are performed periodically, based on numerical stability criteria and operational parameters. For this purpose, special tests are scheduled to indicate whether the execution of operation (iii) through (vi) is due in the current time step.

The GPU simulation algorithm stores temperature and thermal conductivity data for each node in the simulation. Other thermal properties such as heat generation and specific heat are calculated ad-hoc based on the nodal temperature. In addition, two matrices are used to store the nodal connectivity and thermal resistance data. The iterative optimization process in this study starts with a GPU implementation base-case, designated as *naïve*, and then progresses through various optimizations stages to improve runtime performance.

Since most of the simulation algorithm can be modeled as a sparse-matrix vector operation (SPMV), the naïve implementation is based on a compressed sparse row (CSR) matrix format for both the matrices of nodal connectivity, thermal resistance, and interpolation. The access pattern for both of these matrices follows the scalar SPMV algorithm (30), where each thread is responsible for a row in the matrix-vector multiplication operation. In order to calculate the defect value, a reduction operation is applied to the sum of defect values from all interpolation nodes. A cascade algorithm (31) is used for the reduction, where each thread sums a fixed number of elements from the interpolation grid, each processor group sums up an intermediate defect value from all of its individual threads, and finally each processor group copies its partial sum to the CPU where the final reduction occurs. Only one of the above steps were run at a time in the naïve implementation.

Optimization of the GPU-based simulation scheme progressed based on a prioritized list of expected improvements in terms of runtime, with the outcome displayed in Fig. 4:

Typical runtime results of successive optimization steps for the GPU implementation, where Naïve represents the non-optimized base-case GPU implementation.

Converting from double-precision in the naïve implementation to float-precision for all applicable parameters dramatically reduced runtime by 44%. No significant degradation of the quality of calculated temperature field (Fig. 2(b)), the defect field, or the time to minimum defect was identified due to the decrease in numerical precision. As can be seen from Fig. 2(b), the maximum mismatch between the float precision and the CPU-based implementation is 0.4°C, while being practically negligible in the great majority of the field. By reducing to one half the amount of data transferred from the global memory, this change speeds up all of the steps in the heat transfer simulation. Most significantly, this change speeds up the defect value calculation, which is the most global memory-intensive process. While no appreciable degradation in the computed temperature field is observed due to the conversion to float-precision in the current study, order-of-magnitude reduction in time or space intervals may call for revaluation of the temperature field quality.

Since file I/O can be performed in a separate thread on the CPU, in parallel to GPU operations, asynchronous file operations were tested in the next step of optimization, shifting this costly step to the background. Asynchronous I/O saved additional 14% in terms of run time with reference to the float-precision implementation.

Defect calculations are based on a tri-linear interpolation scheme, where each interpolated grid point is calculated based on eight neighboring nodes. This allows an ELLPACK (ELL) matrix representation to be used instead of CSR. The ELLPACK matrix more efficiently represents the sparse matrix, as it saves only the non-zero entries (eight in this case). Anytime that the ELL representation is used to store a matrix on the GPU, the non-zero entries are arranged so that the first non-zero entry in each row are laid out sequentially in the ELL array, with the pattern continuing for each additional non-zero per row. This layout significantly improves global memory coherency. In the third step of optimization, temperature interpolations and defect calculations were further combined into a single kernel, dramatically reducing the number of global memory operations. Unfortunately, the current optimization effort did not significantly shorten runtime (Fig. 4) since the current effort covered only a small portion of the total time spent within this kernel. In practice, the majority of the computation time is devoted the scheme of defect calculations, which could not be further accelerated. Interestingly, due to the asynchronous I/O operations, it was also observed that modifying this kernel had an adverse effect on the runtime of the field property update kernel.

Texture memory is a specially mapped location of global memory, which allows faster lookup and writing times. To take advantage of this special GPU feature, reused data for the execution of the main program loop such as temperature and thermal properties, was stored in texture memory. Applying texture memory directions reduced the runtime of all kernels by 28% compared to the previous optimization stage, and overall reduction of 12% compared with the naïve implementation. The only difficulty with using texture memory with C++ AMP is in the need for 2D texture structure in order to contain the large data structures typical to cryosurgery simulations.

While any node within a uniform grid, either fine or coarse, has connections with six neighboring nodes, more connections may develop at the interface between the fine and coarse regions. To calculate this connection network more effective, a hybrid matrix representation was formulated as follows. The six connections within an uniform grid areas are stored in the memory efficient ELLPACK matrix representation. For interface nodes (more than six neighbors), these connections are stored in a permutated CSR (PCSR) format, with rows ordered in descending order based on the number of non-zero elements. This hybrid representation is similar to the one presented in (30), except for the non-zero entries that do not fit into the ELLPACK representation, which in turn are saved in a PCSR format rather than the coordinate (COO) format suggested in (30). In general, the COO format saves a row number, a column number, and a value for each non-zero entry. The PCSR representation has less memory coherency than the COO format when performing sparse matrix vector-like operations. However, for the application of cryosurgery, the PCSR format significantly reduces the amount of computation compared to the COO format, since calculations of the temperature-dependent terms in the update step only need to occur once per row rather than once per non-zero element. The PCSR format does, however decrease code divergence within a processor group compared to the original CSR format, by processing rows with a similar number of non-zeroes within one group. The real acceleration due to the improved connectivity comes about by storing the majority of the connections in the network in the more efficient ELLPACK representation. Improved network connectivity as described above reduced runtime by 10% compared with the naïve implementation.

With all the above five steps of optimization in mind, the optimized GPU-based code was benchmarked against a CPU-based parallel computation code (12) on the selected computation platforms as listed in Table 2, where the results are displayed in Fig. 5. These machines were selected to represent the typical performance of a midrange laptop, a high-end home PC, a gaming PC, and a workstation PC. All heat transfer simulations were run for 201 seconds of simulated runtime, which coincides with the point at which the minimum defect was identified. The grid size used in this study is 1 mm × 1 mm × 1 mm for the fine grid and 3 mm × 3 mm × 1 mm for the course grid, resulting in a total of about 80,000 grid points in the entire 3D domain. These grid sizes were chosen in a previous study in order to reduce computational cost (18).

In all tested cases, the GPU-based implementation was able to significantly outperform the optimized CPU-based implementation. Simulation speed increase ranged from 3× on the laptop and up to 13× on the gaming PC, which housed the most powerful GPU. Note that the CPU-based computation has already been optimized for multi-processors (12), where the implementation of the applied FD scheme is trivial using parallel *for* loops with OpenMP. In the pre-optimized CPU implementation (12) every node was represented as an object that contained all of the material properties and connectivity information with the surrounding nodes. Optimization in that study focused on data restructuring in order to reduce the system memory load, while targeting the more commonly used properties such as conductivity, volume increments, and specific heat. For example, with reference to the pre-optimized serial CPU implementation, the optimized and parallelized CPU implementation performed 3.9× faster using Hyper-Threading on all 4 cores of an Intel Core i7 960 (12). Combining the results of the current study with those previous optimization measures, the optimized GPU-based implementation performs about 50× faster than the serial CPU simulation (3.9× in the previous study times 13× in the current study). Figure 2(b) displays a very good agreement between the two simulation approaches.

Parametric investigation using Nvidia GeForce GTX 580 (512 GPU cores) reveals that simulation runtime is linearly proportional to the number of numerical nodes for larger problems, such as the one used in the current study for benchmarking (about 80,000 nodes). This linear relationship holds as long as the problem is big enough to effectively use the entire GPU memory bandwidth. For example, a 40,000 nodes problem of similar parameters would run twice as fast, but the relative performance illustrated in Fig. 5 would remain unaffected. When additional computational work is called while memory fetches occur, much of it can be fully hidden (31). However, there is a problem-size threshold below which the computation performance starts to degrade, as the large number of cores becomes underutilized during memory fetches. This occurs when the problem becomes smaller than about 30,000 numerical nodes for the current numerical scheme and the above computation unit. Given the internal GPU architecture, the lower boundary for such a linear behavior cannot be predicted a priori, and the 30,000 nodes limit was found experimentally in the current study. Similarly, the upper boundary for this linear behavior must be found experimentally, where the higher performance GPU will exhibit a wider linearity range.

When evaluating the scalability of the GPU implementation of the current algorithm, three aspects should be bared in mind: (i) Three grids are simultaneously used during calculations (small and large grids of simulation, and an interpolation grid for defect evaluation), in which kernels are executed on; each numerical grid affect the computation time of a different set of kernels. (ii) The number of interface nodes between the small and the large simulation grids affects the computation time, as each large-grid node at the interface has more than the usual six neighbors. These interface grid nodes call for additional hardware operations. (iii) In the context of clinical practice, the geometrical variation in problem size is expected be relatively small, as the typical candidate for cryosurgery has a prostate volume in the range of 35 ml to 50 ml. Hence, the comparative results presented above are quite representative and expected to only vary with hardware performance, rather than with the size of the target region.

In terms of actual runtime for a clinical application, a full-scale GPU-based simulation of cryosurgery can already be achieved in less than 2 seconds (Fig. 5). Such a short time can be considered instantaneous for the purpose of providing an early prediction of the cryoprocedures outcome when compared with the actual cryosurgical procedure, which is measured in several minutes or longer. This capability of rapid simulations advances the discussion about cryosurgery computation from merely analyzing a special case to analyzing a battery of cases in effort to improve decision making, which would be an integral part of any computerized planning (6,18,32), computerized training (12,16), or an intelligent approach to control of the procedure. Such a battery of cases may contain a varying set of problems in the search for an optimum, or a set of “what-if” scenarios for the benefit of training and medical education.

The current trend of increasing the number of cores on modern GPU platforms, combined with the threshold on the effective size of numerical problem, suggest that the effectiveness of GPU technology development may get to the point of diminishing returns, where the GPU may become “too big” for the mathematical problem. However, this would only be true for serial execution of a large array of simulations (each of which is parallelized for GPU implementation). Hence, computer-based decision making associated with advanced GPU systems does not only call for rapid computation techniques but also for advanced decision-making frameworks, where many simulations are concurrently executed on the same system. Consistently, it is suggested that the next generation of computational cryosurgery research should focus on new ways to utilize the rapid advancing technology of GPU-based systems.

As a part of our ongoing program to develop a computerized training tool for cryosurgery, the current study focuses on decreasing the time necessary to simulate a cryosurgery procedure. For this purpose, a previously developed FD algorithm was implemented on a GPU machine, using C++ AMP. The FD algorithm was chosen for this study due to its demonstrated superiority compared with a finite elements commercial code in a previous study (12). C++ AMP was chosen for the current study as a choice of practice.

This study discusses several aspects in optimizing GPU-based cryosurgery simulations, with the defect calculations as a key element in computation. Most significant acceleration was achieved by switching to float-precision, implementing asynchronous I/O, saving frequently used data in texture memory, and the integration of a hybrid ELLPACK matrix structure to store the nodal connectivity grid. Results of this study demonstrate runtime acceleration of up to 13× compared with multicore-CPU processing, up to 50× compared with sequential CPU processing, up to about 90× compared with actual cryosurgery operation runtime, and an actual runtime of about 2 seconds for a single procedure. The GPU-based implementation reproduced previously simulated cryosurgery cases on a multicore CPU-based implementation within 0.4°C margins, where results of that earlier implementation have been validated experimentally (28). Additional modifications, like the reduction of stair-casing artifacts (33) may further be applied to the proposed GPU implementation to improve accuracy and tailor the scheme to other applications. Further performance improvements to the proposed algorithm could include modification of the storage scheme for the resistance matrix, in order to better use the processors-shared memory cache (30). While advanced optimization techniques have been proposed for finite-difference methods using regular grids, where calculations are executed to reduce the processor’s global memory load (34), their development and benchmarking for the type of irregular grid used in the current study is left for future work. This level of performance makes GPU-based computation ideal for computerized planning and training of cryosurgery.

This study is a part of an ongoing effort to develop training and planning tools for cryosurgery, with emphasis on prostate cryosurgery. Prostate cryosurgery presents a unique challenge where many cryoprobes are operated simultaneously, which demands high computation performance for real-time simulations. A previous study has characterized the evolution of prostate cancer as it pertains to geometric deformations (16), while another study addressed reconstruction of organ models from clinical data (35). With the current computation capabilities in mind, future studies can now expand on additional cryosurgical applications, when the geometries of the cancer tumors or the cancerous organ become available. These capabilities are particularly useful for cases of multiple cryoprobes and complex geometries.

This study was supported by Award Number R01CA134261 from the National Cancer Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health. The authors declare no conflict of interest in the presentation of the current manuscript.

**Conflict of Interest**

All authors certify that this manuscript has not been published in whole or in part nor is it being considered for publication elsewhere. The authors have no conflicts of interest to declare.

1. Diller KR. In: Advances in Heat Transfer. Young I Cho., editor. Vol. 22. Elsevier; 1992. pp. 157–357.

2. Charny CK. In: Advances in Heat Transfer. Young I Cho., editor. Vol. 22. Elsevier; 1992. pp. 19–155.

3. Cooper IS, Lee AS. Cryostatic congelation: a system for producing a limited, controlled region of cooling or freezing of biologic tissues. The Journal of nervous and mental disease. 1961;133:259–263. doi: 10.1097/00005053-196109000-00013. [PubMed] [Cross Ref]

4. Gage AA. Cryosurgery in the treatment of cancer. Surgery, gynecology & obstetrics. 1992;174:73–92. [PubMed]

5. Rossi MR, Tanaka D, Shimada K, Rabin Y. Computerized Planning of Cryosurgery Using Bubble Packing: An Experimental Validation on a Phantom Material. International journal of heat and mass transfer. 2008;51:5671–5678. doi: 10.1016/j.ijheatmasstransfer.2008.04.045. [PMC free article] [PubMed] [Cross Ref]

6. Tanaka D, Shimada K, Rabin Y. Two-Phase Computerized Planning of Cryosurgery Using Bubble-Packing and Force-Field Analogy. Journal of Biomechanical Engineering. 2005;128:49–58. doi: 10.1115/1.2136166. [PMC free article] [PubMed] [Cross Ref]

7. Tanaka D, Shimada K, Rossi MR, Rabin Y. Computerized planning of prostate cryosurgery with pullback operation. Computer aided surgery: official journal of the International Society for Computer Aided Surgery. 2008;13:1–13. doi: 10.3109/10929080701882556. [PubMed] [Cross Ref]

8. Tanaka D, Shimada K, Rossi MR, Rabin Y. Towards intra-operative computerized planning of prostate cryosurgery. The international journal of medical robotics + computer assisted surgery: MRCAS. 2007;3:10–19. doi: 10.1002/rcs.124. [PMC free article] [PubMed] [Cross Ref]

9. Sehrawat A, Keelan R, Shimada K, Wilfong DM, McCormick JT, Rabin Y. Simulation-based cryosurgery intelligent tutoring system (ITS) prototype. 2014 submitted. [PMC free article] [PubMed]

10. Lung DC, Stahovich TF, Rabin Y. Computerized planning for multiprobe cryosurgery using a force-field analogy. Computer methods in biomechanics and biomedical engineering. 2004;7:101–110. doi: 10.1080/10255840410001689376. [PubMed] [Cross Ref]

11. Onik GM, Cohen JK, Reyes GD, Rubinsky B, Chang Z, Baust J. Transrectal ultrasound-guided percutaneous radical cryosurgical ablation of the prostate. Cancer. 1993;72:1291–1299. [PubMed]

12. Keelan R, Yamakawa S, Shimada K, Rabin Y. Computerized training of cryosurgery - a system approach. Cryo letters. 2013;34:324–337. [PMC free article] [PubMed]

13. Pratx G, Xing L. GPU computing in medical physics: a review. Medical physics. 2011;38:2685–2697. [PubMed]

14. Brodtkorb AR, Hagen TR, Sætra ML. Graphics processing unit (GPU) programming strategies and trends in GPU computing. Journal of Parallel and Distributed Computing. 2013;73:4–13. http://dx.doi.org/10.1016/j.jpdc.2012.04.003.

15. Rabin Y, Shimada K, Keelan R, Sehrawat A, Zhang H. ACCryo 2014 – Annual Meeting of the American College of Cryosurgery and the Society for Cryobiology

16. Sehrawat A, Keelan R, Shimada K, Rabin Y. 098 Developing an Intelligent tutoring system for prostate cryosurgery. Cryobiology. 2013;67:425–426. http://dx.doi.org/10.1016/j.cryobiol.2013.09.104.

17. Tanaka D, Shimada K, Rossi MR, Rabin Y. Cryosurgery planning using bubble packing in 3D. Computer methods in biomechanics and biomedical engineering. 2008;11:113–121. doi: 10.1080/10255840802297069. [PMC free article] [PubMed] [Cross Ref]

18. Rossi MR, Tanaka D, Shimada K, Rabin Y. An efficient numerical technique for bioheat simulations and its application to computerized cryosurgery planning. Computer methods and programs in biomedicine. 2007;85:41–50. doi: 10.1016/j.cmpb.2006.09.014. [PMC free article] [PubMed] [Cross Ref]

19. Pennes HH. Analysis of tissue and arterial blood temperatures in the resting human forearm. Journal of applied physiology. 1998;85:5–34. 1948. [PubMed]

20. Rabin Y, Shitzer A. Combined solution of the inverse Stefan problem for successive freezing/thawing in nonideal biological tissues. J Biomech Eng. 1997;119:146–152. doi: 10.1115/1.2796073. [PubMed] [Cross Ref]

21. Rabin Y, Shitzer A. Numerical solution of the multidimensional freezing problem during cryosurgery. J Biomech Eng. 1998;120:32–37. doi: 10.1115/1.2834304. [PubMed] [Cross Ref]

22. Carnahan B, Luther HA, Wilkes JO. Applied numerical methods. Wiley; 1969.

23. Zeng P, Deng ZS, Liu J. Parallel Algorithm for Solving 3-D Freezing Problems in Biological Tissues during Cryosurgery. Applied Mechanics and Materials 195–196. 2012:1131–1136. doi:10.4028. www.scientific.net/AMM.195-196.1131.

24. Rabin Y, Shitzer A. Exact Solution to the One-Dimensional Inverse-Stefan Problem in Nonideal Biological Tissues. Journal of Heat Transfer. 1995;117:425–431. doi: 10.1115/1.2822539. [Cross Ref]

25. Rabin Y, Korin E. An efficient numerical solution for the multidimensional solidification (or melting) problem using a microcomputer. International journal of heat and mass transfer. 1993;36:673–683. doi: 10.1016/0017-9310(93)80043-t. [Cross Ref]

26. Zhang Y, Cohen J, Owens JD. Fast tridiagonal solvers on the GPU. SIGPLAN Not. 2010;45:127–136. doi: 10.1145/1837853.1693472. [Cross Ref]

27. Harris M. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology. 2007;6

28. Rossi MR, Rabin Y. Experimental verification of numerical simulations of cryosurgery with application to computerized planning. Physics in medicine and biology. 2007;52:4553–4567. doi: 10.1088/0031-9155/52/15/013. [PMC free article] [PubMed] [Cross Ref]

29. Rabin Y. A general model for the propagation of uncertainty in measurements into heat transfer simulations and its application to cryosurgery. Cryobiology. 2003;46:109–120. [PubMed]

30. Bell N, Garland M. Efficient sparse matrix-vector multiplication on CUDA. Nvidia Corporation; 2008. (Nvidia Technical Report NVR-2008-004).

31. Gregory K, Miller A. C++ AMP: Accelerated Massive Parallelism with Microsoft® Visual C++®. O’Reilly Media, Inc; 2012.

32. Rabin Y, Lung DC, Stahovich TF. Computerized planning of cryosurgery using cryoprobes and cryoheaters. Technology in cancer research & treatment. 2004;3:229–243. [PubMed]

33. Neufeld E, Chavannes N, Samaras T, Kuster N. Novel conformal technique to reduce staircasing artifacts at material boundaries for FDTD modeling of the bioheat equation. Physics in medicine and biology. 2007;52:4371–4381. doi: 10.1088/0031-9155/52/15/001. [PubMed] [Cross Ref]

34. Christen M, Schenk O, Neufeld E, Messmer P, Burkart H. Parallel data-locality aware stencil computations on modern micro-architectures. Parallel & Distributed Processing IPDPS. 2009;2009:1–10. doi: 10.1109/IPDPS.2009.5161031. [Cross Ref]

35. Furuhata T, Song I, Rabin Y, Shimada K. Interactive prostate shape reconstruction from 3D TRUS images. Journal of Computational Design and Engineering. 2014;1:272–288. doi: 10.7315/JCDE.2014.027. [Cross Ref]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |