Home | About | Journals | Submit | Contact Us | Français |

**|**Biomed Opt Express**|**v.3(12); 2012 December 1**|**PMC3521306

Formats

Article sections

Authors

Related links

Biomed Opt Express. 2012 December 1; 3(12): 3223–3230.

Published online 2012 November 12. doi: 10.1364/BOE.3.003223

PMCID: PMC3521306

Received 2012 July 23; Revised 2012 September 25; Accepted 2012 September 27.

Copyright © 2012 Optical Society of America

This is an open-access article distributed under the terms of the Creative Commons
Attribution-Noncommercial-No Derivative Works 3.0 Unported License, which permits download and
redistribution, provided that the original work is properly cited. This license restricts the
article from being modified or used commercially.

This article has been cited by other articles in PMC.

In this report, we discuss the use of contemporary ray-tracing techniques to accelerate 3D
mesh-based Monte Carlo photon transport simulations. Single Instruction Multiple Data (SIMD) based
computation and branch-less design are exploited to accelerate ray-tetrahedron intersection tests
and yield a 2-fold speed-up for ray-tracing calculations on a multi-core CPU. As part of this work,
we have also studied SIMD-accelerated random number generators and math functions. The combination
of these techniques achieved an overall improvement of 22% in simulation speed as compared
to using a non-SIMD implementation. We applied this new method to analyze a complex numerical
phantom and both the phantom data and the improved code are available as open-source software at
*http://mcx.sourceforge.net/mmc/*.

Near-infrared (NIR) light with wavelength between 600 and 1000 nm can penerate deep into the tissue [1], owning primarily to the relatively low optical absorption of human tissue chromorphores, namely oxy/deoxy-hemoglobin, lipids and water. This remarkable characteristic makes NIR light a suitable candidate to probe deep tissue physiology such as angiogenesis and oxygen metabolism in a safe and non-invasive manner. Substantial efforts have been made in the past decade to utilize NIR imaging to facilitate the investigation of breast cancer and human brain functions, monitoring therapy responses, and a range of other applications [2]. A notable challenge for NIR imaging is the high scattering within the tissue. The transport of low-energy NIR photons is highly nonlinear and exhibits a complex and diffusive pattern. Accurate and efficient numerical methods play an essential role in NIR-based techniques such as diffuse optical tomography (DOT) and fluorescence molecular tomography (FMT).

The Monte Carlo (MC) method is a numerical technique offering both generality and accuracy in
modeling photon transport inside human tissue, and often serves as the gold standard in biophotonic
applications. After 20 years of evolution, biophotonic MC algorithms have become increasingly more
efficient and capable in handling complex tissue structures. One of the first MC codes, MCML
[3], can only simulate problems within layered
infinite media. A more general approach was proposed by Boas *et al.* for 3D
heterogeneous domains using vox-elated representations [4, 5]. However, to achieve higher accuracy
near curved boundaries in a voxelated domain, one has to increase the global density of the grid
resulting in a significant increase in memory costs and execution time. Over the past few years,
several mesh-based approaches have been pursued to obtain better accuracy in modelling complex
tissue boundaries. In [6], Li *et
al.* reported a Mouse Optical Simulation Environment (MOSE) that allows one to define 3D
heterogeneous structures using geometric primitives. Margallo-Balbás *et al.*
[7] and Ren *et al.*
[8] have reported methods using triangular
surface meshes and ray-surface intersection calculations to propagate photons in a
piece-wise-constant domain. This approach achieves better accuracy near tissue boundaries, however,
the speed of the ray-surface intersection tests depends on the partitioning scheme of the surfaces
[9] and the traversal of the data structure
often introduces significant computational overhead [10].

In the past year, two new MC algorithms were proposed to perform photon transport simulations using a 3D volumetric mesh. These algorithms include the Tetrahedron-based Inhomogeneous MC Optical Simulator (TIM-OS) by Shen & Wang [11] and the Mesh-based MC Method (MMCM) by Fang [12]. It has been shown that both methods pursue a similar approach, but differ slightly in the ray-tracing formulation used [13]. Compared to surface-based approaches, the use of a tetrahedral mesh leads to significant savings in the overhead of testing ray-triangle intersections. On average, only 2.5 ray-triangle intersection tests are required per step in MMCM and four for TIM-OS. Combined with the recent advances in image-based mesh generation methods [14],[15], the development of these new mesh-based techniques provides a robust path to fast and accurate modeling of photon propagation inside complex anatomical structures.

Ray-tracing techniques are at the heart of both mesh-based MC techniques. It is natural to
hypothesize that by incorporating the latest ray-tracing methods developed by the computer graphics
community, one can further advance mesh-based MC simulations. In particular, several Single
Instruction Multiple Data (SIMD)-based methods have been proposed recently, including Ward’s
method [10] and Shevtsov’s method
[16]. Recently, Havel and Herout proposed a
hybrid approach (referred to as H&H hereafter) that effectively combines both methods with
an efficient Streaming SIMD Extensions (SSE) implementation [17]. Moreover, in [16],
Shevtsov *et al.* described a branch-less ray-tracing strategy to take advantage of
the pipelining mechanism in modern CPU architectures. Incorporating these innovations into
biophotonic MC simulation is expected to improve overall speed. Our main interest in this report is
to discuss the implementation and comparison of four ray-tracing techniques: a Plücker-based
ray-tracer, a Badouel-based ray-tracer and their SSE counterparts, in a mesh-based MC
simulation.

The contribution of this work is to demonstrate the use of compteporary ray-tracing techniques for efficient MC simulations on a modern CPU processor. We highlight the ray-tracing nature of biophotonic MC simulations, and suggest that a thorough investigation to the ray-tracing technique state-of-the-art can not only guide the performance improvement of many existing MC algorithms, but also provide inspirations for the future generations of MC algorithms. In the remaining sections, we first introduce the general structure of a mesh-based Monte Carlo photon simulation. We subsequently examine the computational bottlenecks present in non-SSE implementations with a profiling analysis. Using the knowledge learned from the profiling study, we improve the simulation performance by incorporating SSE in the ray-tracing, random number generation and the calculations of math functions. We then examine the improvement in simulation speed by re-profiling the SSE-enabled codes and report the improved runtimes. We conclude with a brief discussion on limitations and future prospects.

In a mesh-based MC simulation, the problem domain is first discretized using a tetrahedral mesh
(or other shapes [12]). This is followed by
the launching of a large number of photon packets each initialized with a unitary weight. Each
simulated photon packet enters the domain at a specified source location **p**, enclosed by
a tetrahedron *e*_{0}, along an initial directional vector, **v**.
Position **p** and vector **v** define a 3D “ray”. For each
movement, we calculate a random scattering length *d* along **v** based on
the local scattering coefficient (*μ _{s}*) inside

When a photon hits a boundary between two media with different refractive indices, a reflection
coefficient, *R*, is calculated based on Fresnel’s law. A uniformly distributed random number,
*s* [0, 1], is selected. If *s* >
*R*, we transmit the photon across the boundary and update **v** by the
refraction direction. Otherwise, the photon stays in the current medium and propagates along a
reflected direction. When a photon passes through an external surface of the mesh, we terminate the
photon and launch a new one. At the end of the simulation, we output the volumetric flux (or
fluence) accumulated over the mesh by all simulated photon packets and save the distances traveled
by the exiting photons inside each tissue type (referred to as the partial path-lengths). With the
path-length information, one can quickly re-calculate the optical intensity at the detectors without
repeating the entire simulation [4].

We have published an open-source C implementation of the above algorithm, namely
*mmc*. Independently, Shen and Wang have published a C++ software,
*timos* [11]. By comparison
[13], *mmc* uses a
Plücker-coordinates-based ray-tracing scheme for generality while *timos*
uses a simplified Badouel algorithm [18] for
better speed. In both cases, the simulations are accelerated using multi-threaded computing
[12, 11] but neither of the published codes were optimized with SIMD.

Here, we perform a profiling analysis to identify the computational bottlenecks in the non-SSE
*mmc* implementation. An open-source profiler, Callgrind, is picked for its
availability and performance. The benchmark details are described in Section 3. Running the
profiler, we extract the run-time for each function of the simulation code and generate a
run-time-sensitive call-graph using a visualization tool, Kcachgrind. The resulting call-graph is
shown in Fig. 1(a). Each block in Fig. 1(a) represents a C-function corresponding to each step of the simulation
( `sincos` is a math function to produce a pair of sine and cosine
simultaneously). The size of the block is proportional to the execution time spent within the
function. We report the percentage of the run-time for each function by normalizing with the total
run-time. Note that the result here is for a single CPU core as the profiler is restricted to run
with a single thread.

Subroutine-level profiling for mesh-based MC simulation (a) without SSE, (b) with Havel &
Herout’s SSE-based algorithm, and (c) with additional SSE-enabled math library and RNG. The
area and percentage in each block represent the run-time **...**

From Fig. 1(a), we observe that the
Plücker-coordinate-based ray-tracing subroutine ( `plucker_raytet`) and
the scattering calculations ( `next_scatter`) take the majority of the
simulation time, accounting for 66.74% and 26.92%, respectively. Within the
ray-tracing function, the absorption and energy deposit calculations (
`plucker_absorb`) account for 24.98% of the total run-time, while the
rest of the calculations are related to the ray-tetrahedron intersection tests. Within the
photon-scattering calculations, we notice that the math and random number generator (RNG) functions
are the main consumers of the run-time, accounting for 11.01% and 8.91%,
respectively. From Fig. 1(a), it is clear that the
ray-tetrahedron intersection test is one of the biggest hotspots in the simulation. In addition,
optimizing math and RNG functions is also helpful in enhancing the overall speed.

We implemented two SSE-accelerated ray-triangle intersection methods in our software: an SSE ray-tracer based on the H&H method [17] and an SSE partial Baudoul’s method with branch-less structure [18, 16]. Both the H&H ray-tracer and the Plücker-based ray-tracer in [12] require the calculation of the Barycentric coordinates. The partial Baudoul’s methods, either SSE or non-SSE, skip the Barycentric coordinates with presumably reduced accuracy [13].

For the branch-less Baudoul’s method, we use a Structure-of-Arrays (SoA) format
[10] to organize the face norm data in the
memory. For example, instead of defining **N** =
{*N _{x}*,

From Fig. 1(a), the RNG and math functions account for
72% of the scattering-related calculations. Using SSE to accelerate these functions is
expected to improve the overall speed as well. In the previously released *mmc* code,
the default RNG was a 48-bit multi-threaded RNG function, `drand48_r`, defined
in the GNU C library (glibc). This RNG uses a linear-congruent algorithm to produce each random
number. In this work, we replace `drand48_r` by a more efficient
SIMD-oriented Fast Mersenne Twister (SFMT-19937) algorithm [19]. The SFMT-19937 algorithm has proven to have a vary long period
(2^{19937} − 1), and is specifically optimized with SIMD parallelism.

To accelerate the calculation of the math functions ( `log`,
`sincos`), we leveraged a free math library, SSEMath by Julien Pommier
[20]. In this library, each call to the math
function produces four values simultaneously with SSE commands. To utilize this fast math library in
the photon scattering module, we created a static buffer with a length multiple of four and filled
the buffer with random numbers generated by the SFMT RNG. We subsequently call the SSE version of
the `log` and `sincos` functions to compute four function
values simultaneously. Each time the MC simulator requests a random function, we pop a precomputed
value from the buffer. When a buffer is used up, we refill the buffer by repeating the above
process.

In this section, we profile the mesh-based Monte Carlo code with different ray-tracing algorithms and RNG/math configurations. From the profiling output, we quantitatively compare the proposed algorithms to demonstrate the benefits of using SSE acceleration.

For this purpose, we build a benchmark problem using the widely-used Digimouse atlas
[21]. From the segmented digital mouse CT
image, we create a 3D tetrahedral mesh using an image-based mesh generator, iso2mesh [15], and utilities based on the CGAL library [14]. As seen in Fig.
2, the resulting mesh contains 42,301 nodes and 209,907 tetrahedral elements, including 21
distinct tissue regions according to the segmentation. The mesh data, script, and optical properties
of each tissue type [22] are provided online
on our website [12]. A pencil-beam point
source is positioned on the skin above the right-hemisphere of the brain. All simulations are
performed on a Dell workstation running 64bit Ubuntu Linux 10.04. The computer has a quad-core Intel
E5520 CPU. The software was compiled with gcc 4.4.3 with `-O3`and
`-msse4` options. We launched a total of 10^{6} photons to produce the
profiling results using Callgrind.

A tetrahedral mesh generated from the Digimouse atlas. Different colors represent different
tissue types.

In Fig. 1(b), we report the performance benefits of
replacing the non-SSE Plücker-based ray-tracer with the H&H’s SSE-based
ray-tracer. The area of each box is proportional to its run-time. The width of the largest box in
Fig. 1(b) is scaled by matching the area of the
`next_scatter` block with that in Fig.
1(a), given that RNG and math functions are not changed in the two cases.

Next, we evaluate the improvement obtained from the SSE-accelerated math functions and the added
SFMT RNG in the code. The updated performance results are shown in Fig. 1(c). Again, the size of each box is proportional to its run-time and the area for
`havel_raytet` is set to match that in Fig.
1(b).

In addition to the subroutine-level profiling analysis, in Table 1, we summarize total runtimes for the ray-tracing calculation and the total
simulation time across various combinations of ray-tracers and SSE-enabled techniques. In each case,
we simulate 10^{7} photons with four CPU cores running in parallel. Note that the two
Badouel-based algorithms perform less computation than the other two algorithms at a cost of reduced
accuracy.

From Table 1, it is clear that the SSE-accelerated ray-tracing techniques make measurable improvements in the overall simulation speed. The total run-time is reduced by 22% for SSE-based H&H’s method compared to the Plücker’s method. For the simplified ray-tracing using the Badouel’s method, the branch-less algorithm reduces the simulation time by 26% by improving pipelining on a modern CPU. The acceleration of the ray-tracing calculation alone is 2X when compared to H&H with Plücker-based methods; comparing the branch-less and branched Badouel’s methods, the speedup is 1.3X. From the profiling results shown in Fig. 1(b) and (c), we see that the SSE-enabled RNG and math functions have accelerated the photon scattering calculations by 2.7X. Better computational efficiency was clearly achieved by simultaneously exploiting the multiple processor cores, the SIMD capability, and pipelining available in modern CPU processors. It is interesting to note that the speed-up ratio calculated by the profiling data in Fig. 1 is over-estimated compared to that derived from Table 1. We believe this is due to the fact that the profiler can only utilize a single CPU thread, thus, is a result of variable overhead when running single-threaded versus multi-threaded. Nonetheless, the relative run-times of the subroutines can provide clear insight and help guide us during code optimization.

The ability to fully utilize the parallelism provided in a modern CPU makes this code one of the
efficient MC models available. In the recent years, given the considerable progress on effectively
exploiting graphics processing units (GPU) for general purpose computing, new opportunities are
available to further accelerate MC simulations. Although the SSE instructions used in this work can
not be executed on a GPU directly, modern GPU hardware typically contains many wide SIMD units,
allowing parallel execution of code fragments on a grander scale. This can potentially lead to
dramatic reduction in simulation speed and ease of parallel programming. Recently, Powell and Leung
reported a GPU implementation of the mesh-based MC for acousto-optic applications [23]. The authors compared their GPU code to the first
release of *mmc* and demonstrated a 2X speed-up with a standard benchmark
[12, 23]. Interestingly, with the techniques reported here and other improvements, the
latest version of *mmc* has achieved 103 photons/ms for the same benchmark, a speed
twice as fast as the reported GPU code. Based on our previous study [5], we strongly believe that a GPU has great potential to provide
massive acceleration compared to a modern CPU. The rather low speed-up achieved in the forementioned
GPU implementation may be related to suboptimal memory utilization. This GPU code, as well as other
voxel- or surface-based MC codes, can potentially benefit from the efficient ray-tracing and memory
layout schemes reported here.

In summary, we have investigated the potential of accelerating biophotonic Monte Carlo
simulations on modern CPUs. We have used SSE constructs to accelerate ray-tracing calculations, as
well as key math functions and random number generators. With these techniques, we were able to
speed up the ray-tracing calculation by a factor of two, and improve the overall speed by 22 to
26%. The updated source code and benchmark dataset is provided on our website at
*http://mcx.sourceforge.net/mmc/*. The software can not only take advantage of
multi-core processors on a symmetric multiprocessor (SMP) system, but also has the potential to gain
massive acceleration when running on a software distributed shared memory (SDSM) cluster, either
locally or via cloud computing. The proposed techniques should also scale gracefully when many-core
CPU chipsets become available on the market.

QF wishes to thank the funding support from the Bill & Melinda Gates Foundation ( #OPP1035992).

1. Boas D. A., Brooks D. H., Miller E. L., DiMarzio C. A., Kilmer M., Gaudette R. J., Zhang Q., “Imaging the body with diffuse optical
tomography,” IEEE Signal Proc. Mag. 18, 57–75 (2001).10.1109/79.962278 [Cross Ref]

2. Gibson A. P., Hebden J. C., Arridge S. R., “Recent advances in diffuse optical
imaging,” Phys. Med. Biol. 50, R1–R43 (2005).10.1088/0031-9155/50/4/R01 [PubMed] [Cross Ref]

3. Wang L. H., Jacques S. L., Zheng L. Q., “MCML - Monte Carlo modeling of light transport in
multilayered tissues,” Comput. Meth. Prog. Bio. 47, 131–146 (1995).10.1016/0169-2607(95)01640-F [PubMed] [Cross Ref]

4. Boas D. A., Culver J., Stott J., Dunn A., “Three dimensional Monte Carlo code for photon migration
through complex heterogeneous media including the adult human head,”
Opt. Express 10, 159–170 (2002). [PubMed]

5. Fang Q., Boas D. A., “Monte Carlo simulation of photon migration in 3D turbid
media accelerated by graphics processing,” Opt. Express 17, 20178–20190 (2009).10.1364/OE.17.020178 [PMC free article] [PubMed] [Cross Ref]

6. Li H., Tian J., Zhu F., Cong W., Wang L. V., Hoffman E. A., Wang G., “A mouse optical simulation environment (MOSE) to
investigate bioluminescent phenomena in the living mouse with the monte carlo
method,” Academ. Radiol. 11, 1029–1038 (2004).10.1016/j.acra.2004.05.021 [PubMed] [Cross Ref]

7. Margallo-Balbäs E., French P. J., “Shape based Monte Carlo code for light transport in complex
heterogeneous Tissues,” Opt. Express 15, 14086–14098 (2007).10.1364/OE.15.014086 [PubMed] [Cross Ref]

8. Ren N., Liang J., Qu X., Li J., Lu B., Tian J., “GPU-based Monte Carlo simulation for light propagation in
complex heterogeneous tissues,” Opt. Express 18, 6811–6823 (2010).10.1364/OE.18.006811 [PubMed] [Cross Ref]

9. Wächter C., “Quasi-Monte Carlo light transport simulation by efficient
ray tracing,” Ph.D. dissertation (Ulm
University, Ulm, Germany,
2007).

10. Wald I., “Realtime ray tracing and interactive global
illumination,” Ph.D. dissertation (Saarland
Univ., Saarbrücken, Germany,
2004).

11. Shen H., Wang G., “A tetrahedron-based inhomogeneous Monte Carlo optical
simulator,” Phys. Med. Biol. 55, 947–962 (2010).10.1088/0031-9155/55/4/003 [PMC free article] [PubMed] [Cross Ref]

12. Fang Q., “Mesh-based Monte Carlo method using fast ray-tracing in
Plücker coordinates,” Biomed. Opt. Express 1, 165–175 (2010).10.1364/BOE.1.000165 [PMC free article] [PubMed] [Cross Ref]

13. Fang Q., “Comment on ‘A study on tetrahedron-based
inhomogeneous Monte-Carlo optical simulation’,” Biomed. Opt.
Express 2, 1258–1264 (2011).10.1364/BOE.2.001258 [PMC free article] [PubMed] [Cross Ref]

14. CGAL Editorial Board , CGAL User and Reference Manual, 3rd ed.
(2009).

15. Fang Q., Boas D. A., “Tetrahedral mesh generation from volumetric binary and
grayscale images,” in IEEE International Symposium on Biomedical
Imaging: from Nano to Macro, 2009. ISBI ’09 (IEEE,
2009), pp. 1142–114510.1109/ISBI.2009.5193259 [Cross Ref]

16. Shevtsov M., Soupikov A., Kapustin A., “Ray-triangle intersection algorithm for modern CPU
architectures,” in Proceedings of GraphiCon 2007
(2007), Vol. 11, pp.
33–39

17. Havel J., Herout A., “Yet faster ray-triangle intersection (using
SSE4),” IEEE Trans. Visualiz. Comput. Graphics 16, 434–438 (2010).10.1109/TVCG.2009.73 [PubMed] [Cross Ref]

18. Badouel D., Graphics Gems (Academic,
1990).

19. Saito M., Matsumoto M., Monte Carlo and Quasi-Monte Carlo Methods 2006
(Springer, 2008).

20. Pommier J., “Simple SSE and SSE2 optimized sin, cos, log and
exp” (2007), http://gruntthepeon.free.fr/ssemath/

21. Dogdas B., Stout D., Chatziioannou A., Leahy R. M., “Digimouse: a 3D whole body mouse atlas from CT and
cryosection data,” Phys Med Biol. 52, 577–87 (2007).10.1088/0031-9155/52/3/003 [PMC free article] [PubMed] [Cross Ref]

22. Cheong W. F., Prahl S. A., Welch A. J., “A review of the optical properties of biological
tissues,” IEEE J. Quantum Electronics 26, 2166–2185 (1990).10.1109/3.64354 [Cross Ref]

Articles from Biomedical Optics Express are provided here courtesy of **Optical Society of America**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |