An Acceleration Method for GPU-Based Volume Rendering by Localizing Texture Memory Reference

この論文をさがす

抄録

This paper presents a cache-aware method for accelerating texture-based volume rendering on the graphics processing unit (GPU). Because GPUs have a hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize their effective performance for this kind of memory-intensive applications. To accomplish this, our method localizes texture memory reference according to the location of the viewpoint. The key idea for this localization is to dynamically select the width and height of thread blocks (TBs) such that each warp, which is a series of 32 threads simultaneously processed on the GPU, can minimize the stride of memory access. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize the TB size so that the spatial locality can be exploited with less active TBs. For relatively large stride, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. In experiments using a GeForce GTX 580 card, we find that our cache-aware method doubles the worst rendering performance, as compared with the original implementation provided by the CUDA and OpenCL software development kits (SDKs).This paper presents a cache-aware method for accelerating texture-based volume rendering on the graphics processing unit (GPU). Because GPUs have a hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize their effective performance for this kind of memory-intensive applications. To accomplish this, our method localizes texture memory reference according to the location of the viewpoint. The key idea for this localization is to dynamically select the width and height of thread blocks (TBs) such that each warp, which is a series of 32 threads simultaneously processed on the GPU, can minimize the stride of memory access. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize the TB size so that the spatial locality can be exploited with less active TBs. For relatively large stride, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. In experiments using a GeForce GTX 580 card, we find that our cache-aware method doubles the worst rendering performance, as compared with the original implementation provided by the CUDA and OpenCL software development kits (SDKs).

収録刊行物

詳細情報 詳細情報について

  • CRID
    1570291227981029888
  • NII論文ID
    110009536433
  • NII書誌ID
    AN10463942
  • 本文言語コード
    en
  • データソース種別
    • CiNii Articles

問題の指摘

ページトップへ