The modern CPU is significantly different from that available in the early 1990s when MCML was developed. Currently, one of the most important CPU features is the single instruction multiple data (SIMD) support; for example, the Intel processor provides Streaming SIMD Extensions (SSE), which handles two double-precision floating-point numbers or four single-precision floating-point numbers in one instruction. This suggests that the modern CPU may run faster if the same instructions are grouped and applied together to an array of data. In the Monte Carlo simulation, there are multiple places where we can apply this technique. For example, the random number generation, a core function in the Monte Carlo simulation, is a time-consuming procedure in all the Monte Carlo simulation programs, because for each step or scattering a new random number is needed. Instead of generating one random number each time as in MCML, a better way is to generate an array of random numbers in one function call and use them one by one as needed. The Intel Math Kernel Library provides a function to generate an array of uniformly distributed random numbers, which gives a significant performance improvement over the use of a classic random number generator. We found that the speed of MCML can be improved 10–20% by simply replacing the original random number generator in MCML with an Intel vector random number generator. In addition to the random number generation, the Intel Math Kernel Library also provides vector implementations for certain computationally expensive core mathematical functions, such as log, power and trigonometric functions. These vector mathematical functions can be adapted to improve the Monte Carlo simulation. For example, when generating a random step for a photon, the computation involves a log function. After the generation of the array of random numbers, we can directly apply the vector log function on the array. By doing this, we obtained 30–50% improvement in speed, which was machine dependent. It is also possible to replace the Intel MKL library with the AMD math library for optimal performance on an AMD machine.

Another important trend is multi-core computer architecture. It becomes difficult to boost the CPU speed significantly by increasing the working frequency. A natural alternative way to increase computer speed is to pack more processor cores in one single CPU. For example, the current Intel Xeon and AMD Opteron CPUs have 4–6 computational cores. In the near future, general CPUs may likely have 8 or even 128 cores. Thus, there will be more leverage for us to utilize the power of multi-core techniques. It is fortunate that the Monte Carlo simulation is one of the most attractive parallel computing applications. In the real world, the trajectories of photons are essentially independent. In an ideal case, let us suppose that the number of processors is the same as the number of photons. Then, we can simulate the photon propagation in a single run. In the context of the current TIM-OS software, we use multi-thread programming to the degree the PC allows, that is, each of all available threads runs part of the simulation job. As a result, multi-thread programming improves the performance by almost eight times on an eight-core PC.

It is also natural to extend the parallelization to a cluster of machines each with multiple cores. We can always generate multiple TIM-OS runs on various machines, each of which is initialized with a different random seed and all of which collectively produce an improved outcome. In this way, we can even extend the parallelization to a large computer heterogeneous grid (under Windows, Linux, Unix, and so on). Note that a good parallel random generator is critical to manage the random number sequence for all the parallel threads. In our TIM-OS program, the MT2203 pseudorandom number generator (

Marsaglia and Tsang 2000,

Intel 2009a,

2009b) is used. This pseudorandom number generator provides a way to generate up to 1024 independent sequences. Hence, we can have up to 1024 independent threads in a parallel environment.

While the shared memory machine has more and more computational cores, this scheme may encounter a memory bandwidth bottleneck. Each time a photon enters into a new tetrahedron, the program needs to fetch information on the tetrahedron, including the geometry of the tetrahedron and its optical coefficients. By organizing these tetrahedrons in such a way that neighboring tetrahedrons are close to each other in the memory, the cache missing rate can be reduced to improve the simulation performance. This problem is exactly the same as the bandwidth problem in the area of sparse symmetric matrices. The bandwidth problem is NP-hard, and a complete search method is not suitable in our case. In our implementation, we reorder all the tetrahedrons on a preprocessor using the so-called reverse Cuthill–McKee algorithm (

Cuthill and McKee 1969). Even though we did not observe any significant improvement of the performance on our eight-core machine, it could be useful when the number of cores is increased in a shared memory machine. While our current SMP system has only eight cores, the current GPU system has already integrated up to 960 cores in one single machine. A total of 960 cores reading and writing memory randomly will greatly reduce the performance of the GPU machine. A recent GPU-based MCML implementation (

Alerstam *et al* 2009) on a Nvidia 9800GT graphic card showed a performance gain of about 58–137 times compared with the original MCML on a Core i7 CPU. It was observed that half of the running time was used to store the photon absorption information for a simple multi-layered geometry. For a complex geometry, such as the anatomy of a mouse, the memory bottleneck problem will be much worse than that with a multi-layered geometry. Organizing those data wisely will help reduce this problem.

The algorithm presented here is not limited to a traditional CPU-based machine. We can apply it to a GPU-based machine. The current mainstream eight-core workstation can reach 75G FLOPS for double-precision floating-point operations and 150G FLOPS for single-precision floating-point number operations. The current best GPU-based workstation has four Tesla S1070 graphic cards with 300G FLOPS for double-precision floating-point number operations and 4000G FLOPS for single-precision floating-point number operations. Hence, we will also work along this path, which may provide more than tenfold acceleration with an approximate fourfold increase in hardware cost.

The application of tetrahedron-based inhomogeneous Monte Carlo simulation is not limited to optical imaging. The TIM-OS scheme can also be adapted in x-ray studies, such as x-ray dual energy, backscattering and dark field imaging. X-ray small-angle scattering is a unique contrast technique that reveals subtle structural variations within soft tissues. By incorporating small-angle scattering phase functions in our TIM-OS software, we can accurately model the small-angle scattering process in a biomedically relevant environment. Since the number of x-ray scattering events is much smaller than the number of optical scattering events, it is possible to simulate dark field imaging in real time on a regular PC.

In conclusion, we have developed a new Monte Carlo scheme based on a relatively simple yet quite flexible tetrahedron-based optically inhomogeneous finite element mesh. This scheme greatly reduces the computational cost to solve optical radiative transport problems in the Monte Carlo simulation of complex heterogeneous biological tissues. The high performance of this scheme stems from the tetrahedron representation. In each tetrahedron, there are only four triangles. If a photon is in a tetrahedron, for each Monte Carlo propagation step the program only needs to test four triangles for a possible photon–triangle interaction. This type of interaction test is much simpler than traditional ray triangle interaction in the ray tracing scheme associated with a triangle-based surface mesh such as in triMC3D. Our numerical simulation results have confirmed the advantages of our tetrahedron-based approach. For a multi-layered tissue, TIM-OS is faster than the original MCML even if both the programs use only one CPU core. For highly complex mouse geometry, TIM-OS is capable of solving the optical forward propagation problem on a current desktop PC. This type of capability has never been previously demonstrated on a desktop PC.