An Acceleration Method for GPU-Based Volume Rendering by Localizing Texture Memory Reference
この論文をさがす
抄録
This paper presents a cache-aware method for accelerating texture-based volume rendering on the graphics processing unit (GPU). Because GPUs have a hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize their effective performance for this kind of memory-intensive applications. To accomplish this, our method localizes texture memory reference according to the location of the viewpoint. The key idea for this localization is to dynamically select the width and height of thread blocks (TBs) such that each warp, which is a series of 32 threads simultaneously processed on the GPU, can minimize the stride of memory access. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize the TB size so that the spatial locality can be exploited with less active TBs. For relatively large stride, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. In experiments using a GeForce GTX 580 card, we find that our cache-aware method doubles the worst rendering performance, as compared with the original implementation provided by the CUDA and OpenCL software development kits (SDKs).This paper presents a cache-aware method for accelerating texture-based volume rendering on the graphics processing unit (GPU). Because GPUs have a hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize their effective performance for this kind of memory-intensive applications. To accomplish this, our method localizes texture memory reference according to the location of the viewpoint. The key idea for this localization is to dynamically select the width and height of thread blocks (TBs) such that each warp, which is a series of 32 threads simultaneously processed on the GPU, can minimize the stride of memory access. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize the TB size so that the spatial locality can be exploited with less active TBs. For relatively large stride, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. In experiments using a GeForce GTX 580 card, we find that our cache-aware method doubles the worst rendering performance, as compared with the original implementation provided by the CUDA and OpenCL software development kits (SDKs).
収録刊行物
-
- 研究報告ハイパフォーマンスコンピューティング(HPC)
-
研究報告ハイパフォーマンスコンピューティング(HPC) 2013 (8), 1-7, 2013-02-14
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1570291227981029888
-
- NII論文ID
- 110009536433
-
- NII書誌ID
- AN10463942
-
- 本文言語コード
- en
-
- データソース種別
-
- CiNii Articles