Prior to testing the system performance, the 2D Barbara test image was chosen to verify our implementation since this image is quite often used for system verification [9
]. shows the reconstructions of the Barbara data set that contains 141 projections in the range of ±70° with the angular step of 1° where is the original image and are 100 iterations of SIRT and EM reconstructions, respectively. While the overall quality looks satisfactory, it is worth pointing out that the smearing around the mouth is due to the nature of limit-angle reconstruction since the projections outside the ranges of ±70° are missing, a typical practical challenge encountered in EMT reconstruction.
Fig. 3 Iterative reconstruction of the Barbara data set that contains 141 projections in the range of ±70° with the angular step of 1°. (a) Original Barbara image, (b) 100 iterations of SIRT reconstruction, and (c) 100 iterations of EM (more ...)
We then reconstructed a 3D volume of 20482 × 256 voxels from a real tilt series to further verify the quality of reconstruction. This tilt series contains 122 projection images of 20482 pixels (4 bytes floating point per pixel) acquired every 1° within ±60°for a Drosophila centrosome with microtubules re-grown with bovine alpha/beta-tubulin and frozen in vitreous ice. This experiment was conducted at liquid nitrogen temperature on a FEI G2 Polara 300kV F30 Helium transmission electron microscope equipped with a post column GATAN energy filter and a 40962 pixel lens coupled CCD camera (Ultracam, GATAN). The camera was set to 2× on-chip binning to form the final image size of 20482 pixels. The energy filter was centered at 0eV with a 20eV energy loss window. An in-house iterative alignment method was performed on the original tilt series prior to tomographic reconstruction. presents the aligned projection image at 0°, while the central z slice of the volume reconstructed on a cluster containing 4 GPUs is shown in . The quality of the reconstructed volume is comparable to the CPU reconstruction.
Drosophila centrosome with microtubules re-grown with bovine alpha/beta-tubulin and frozen in vitreous ice. (a) Aligned tilt image at 0°. (b) Central z slice of the tomogram reconstructed on a 4-GPU system.
Several factors are expected to affect the overall system performance including the number of GPUs involved in the computation, the number of nodes, which decides data traffic flow over the network, and the various GPUs having different computing power. At first, the role of number of GPUs was investigated. This test was performed on a system that contains 5 GTX295 cards (10 GPUs total, referred to as GTX GPUs thereafter) distributed over 3 computing nodes running the Fedora Linux operating system (2 nodes have 2 cards and one node has a single card). Each GTX GPU has 240 CUDA cores and 896 MB memory, and gives rise to 894 GFLOPs peak single precision floating point performance. Therefore, the entire system delivers nearly 9 TFLOPs for single precision floating point calculation and approximately 9 GB memory for tomographic reconstruction. Although our system has no restriction on where the data should be stored, the input projections and generated volumes were all placed on the local disk of the client host to allow separate measurements of disk IO and network overheads. The reconstruction was performed repeatedly to test the system performance given various numbers of GPUs. (Note that users can specify not only the nodes but also number of GPUs on each node to join the computation without physically altering the hardware.) Not only is this information useful during the decision making for what scale of the GPU computing system should be built in accordance with the dimensions of reconstruction, but it also helps end users select sufficient GPUs for their jobs and share the remainder for other purposes. depicts the performance change under the various GPU configurations. The total time is the duration from the end user’s perspective needed to complete the reconstruction. It was measured from the moment the user issued the reconstruction command to when the volume was saved on the disk. The reconstruction time represents the computational expense that was measured between the moment the client process requests the reconstruction and the moment the last node reports its completion of the reconstruction and is interchangeably referred to as the computational time hereafter. The system overhead, i.e., the difference between the total time and the reconstruction time, stems mainly from disk IO operations and projection and volume data flow across the network. The overall system performance is given in where the reconstruction time is also presented for comparison. Although the system performance is found improved in terms of absolute time in , the acceleration factor, the ratio of the time for one GPU to solve the problem to the time needed for a specified number of GPUs to solve the same problem, would be more specific to reflect the performance gain due to expanding the number of GPUs in the system. In the ideal case where the workload can be distributed among all the GPUs without additional overhead, the variation of the acceleration factor with respect to the number of GPUs would be a straight line with a slope of one. In terms of the computational performance, the acceleration factor is found very close to a straight line having a slope of 0.95. This trend suggests that adding more GPUs is an effective approach for continuous enhancement of the computational performance. This is not always true, however, in terms of the overall system performance. As the computational performance keeps improving, the system overhead becomes increasingly relevant, thereby decreasing the benefits of adding additional GPUs (. Even so, the overall system performance has been boosted nearly seven-fold when the system is expanded to have 10 GPUs.
Fig. 5 Performance of the NVIDIA GTX295 based system under various GPU configurations for 10 cycles of SIRT reconstruction for a tomogram of 20482 × 256 voxels from a tilt series containing 122 projection images of 20482 pixels. (a) The total and reconstruction (more ...)
also shows that the difference between the total and computational time, i.e., the system overhead, is constant regardless of the number of GPUs or equivalently, the number of nodes since the different GPU configurations correspond to various number of nodes in the cluster. As mentioned, the system overhead mainly originates from the disk IOs and data flow across the network. Since the disk IOs are performed only on the client host, the corresponding overhead is apparently independent of the number of nodes in the cluster. This suggests that the network overhead is independent of the number of nodes. To confirm this observation and identify the major contribution to the system overhead, a fixed size volume (20482 × 256) was repeatedly reconstructed from 122 projections of a 20482 pixels under various node configurations that all have 4 GTX GPUs in total. The results are given in .
Table 1 Overhead for reconstruction performed over various number of computing nodes with four GPUs in total. The input projections have 20482 × 122 pixels whereas the generated volume is of 20482 × 256 voxels. Both the input and output are 4 (more ...)
The first row in indicates how many nodes involved in the reconstruction. “Local” represents the configuration where both the client and reconstruction processes were running on the same machine. The second row shows how the four GPUs are located in these nodes. For example, the configuration of “2+1+1” indicates that the four GPUs are scattered over three nodes of which one provides 2 GPUs whereas the remaining two each provides a single GPU. The system overhead is given in the third row. The “Local” configuration should not induce any network overhead, thus the system overhead under this configuration represents a baseline overhead common to all other configurations. Therefore, the network overhead given in the fourth row was calculated by subtracting this baseline (23.2 seconds) from the system overhead. The network overhead remains quite constant, although some slight decrease can also be observed as more nodes were involved. To understand this phenomenon, it should be noted that the bulk of the network traffic (neglecting control messages between the client and nodes) is set solely by the size of the reconstruction and projection series and not by the number of nodes. That sets a firm lower limit on the time needed for network overhead. It is also worthy of pointing out that the network operations contribute a major portion to the system overhead. Therefore, using a higher speed network should have the most impact on lowering the system overhead.
As a convenience, the corresponding reconstruction time is also included in , which is almost identical with slight fluctuation, suggesting that the computational load is evenly distributed to the nodes.
Obviously, not all reconstructions necessitate using a large number of GPUs. Having a large-scale system frequently reconstruct small volumes is certainly not an efficient use of resources. Therefore, we would like to explore what volume size makes most efficient use of the system containing 10 GTX GPUs. In the subsequent test (), we reconstructed volumes of various sizes by altering the z height with fixed xy dimensions (2048 × 2048). The reconstruction time shown varies roughly in linear fashion for most z heights, however, the time cost per voxel () is high for small z heights. This may indicate the GPU computing power is not saturated when small volumes are reconstructed. As the z height grows, this profile decreases more slowly and becomes essentially flat around z = 640. The system has reached its maximum computing power. Therefore, it is recommended that the balance between the performance and the cost not be overlooked when building such a system.
Performance of a system formed by 10 GTX GPUs to reconstruct volumes of various z height with 20482 in x and y dimensions. (a) Reconstruction time in seconds. (b) Reconstruction time per voxel in microseconds.
Although the EMT reconstruction can be successfully performed on GTX295 cards, they were originally designed as video cards. Inspired by the great success of GPUs in scientific computation, NVidia has also developed Tesla and most recently Fermi GPUs as co-processors specifically for computation. Since Fermi was not available when this work was conducted, the performance of Tesla GPUs was investigated by running the EMT reconstruction on our Tesla S1070 system with 4 Tesla C1060 cards installed. Since a Tesla C1060 card has 1 GPU as opposed to 2 GPUs per GTX295 card, the second system was configured with 2 GTX295 cards allowing comparison of the same number of GPUs. It should be also pointed out that the Tesla and GTX GPUs have the same number of CUDA cores (240/GPU) and nearly identical processor speed (Tesla: 1.3GHz, GTX: 1.24GHz), but differ significantly in memory size (Tesla: 4GB/GPU, GTX: 896MB/GPU) with GTX 295 having a ten percent higher memory bandwidth (Tesla: 102 GB/s, GTX: 112 GB/s). The same projection data was used to repeatedly reconstruct volumes with fixed xy dimensions (20482) but variable z height. The reconstructions were performed locally to avoid the potential biased performance measurement by network traffic. Therefore, presents only the time consumed for reconstruction.
Fig. 7 Performance comparison of GTX295 cards versus Tesla C1060 graphics cards for EMT reconstruction. The reconstructions were performed on 4 GPUs provided either by 2 GTX295 cards or 4 Tesla C1060 cards. The volumes have various z height but fixed xy dimensions (more ...)
While the Tesla S1070 system performs better for smaller volumes of z height less than 400 voxels, the GTX295 catches up and eventually outpaces the Tesla S1070 system as the volume size grows. It should be pointed out that this comparison is based upon the same number of GPUs. However, since each GTX295 card has two GPUs and is less expensive than the single GPU Tesla C1060 card, a system with 4 GTX295 cards obviously performs reconstructions far faster and significantly cheaper than its counterpart with the same number of Tesla C1060 cards. According to the performance data given in , the reconstruction time is cut by nearly half when the number of GPUs is doubled. The less desirable performance of Tesla GPUs, in our opinion, is attributed to 2D textures used in our implementation of the kernel functions, leading to sequential reconstruction of slice by slice along the y direction. As a result, the potential benefit of the larger memory provided by Tesla is not realized, the number of processing cores becomes the decisive factor in the current implementation. This, combined with the significantly lower cost of the GTX295 cards, makes them the most cost effective system for high performance EM tomography.
Although this 10-GPU system formed by three computing nodes is by no means a large-scale cluster and costs less than $15,000, we would like to see how it performs for reconstructing volumes with typical dimensions in EMT. presents both the total and reconstruction time for 10 iterations of SIRT reconstruction. Note that all the input and output data are single-precision floating point and all stored locally on the client host. It takes less than 40 seconds in total to reconstruct a volume of 10242 × 256 voxels from an EMT tilt series of 10242 × 122 pixels whereas reconstructing a volume of 20482 × 512 voxels from 122 projections of 20482 pixels takes less than six minutes (total time). Although at this time 40962 × 512 voxels are by no means a typical size for EMT reconstructions, we listed the corresponding performance data to show that such a small cluster can indeed reconstruct a very large volume within a reasonable amount of time (~30 minutes total).
Performance of reconstructing EMT volumes of various dimensions by 10 iterations of SIRT using a GPU cluster containing three computing nodes with total 10 GPUs of GTX295.
It should also be noticed that the system overhead exceeds the reconstruction time when the volume of 20482 × 512 voxels was computed from projections of 20482 × 122 pixels. This hints that further effort on improving the system performance should be balanced between the expansion of the computing power and the reduction of the system overhead. Since indicates that the network operations contribute a major portion to the system overhead, a higher speed network, 10GB Ethernet or infiniband, is worthy of consideration.