SIMD命令を用いるUTF-8文字列デコード処理の高速化

井上, 拓, 小松, 秀昭, 中谷, 登志男

近年，XMLなど多くの用途において，テキストデータの標準的な表現形式として，1文字を1～3バイトの可変長で表現するUTF-8エンコーディングが用いられている．一方，Java仮想マシンなど多くの処理系においては，文字列の内部表現として1文字が2バイトの固定長であるUTF-16エンコーディングが用いられている．そのため，Javaで記述されたWebアプリケーションサーバなどの多量のテキストデータを取り扱うワークロードにおいては，テキストデータをUTF-8とUTF-16との間で相互に変換する処理が大きな処理時間を占める場合があり，このテキストデータ変換処理の高速化はシステム全体の性能向上において重要な意味を持つ．本研究では，SIMD命令を用いてUTF-8からUTF-16への変換をはじめとする可変長符号化データのデコード処理を高速に行う手法を提案する．この手法では複数のデータを並列に処理することに加えて，条件分岐での分岐予測ミスによるオーバヘッドを減少させることで，大きな性能向上が得られる．本手法をPowerPCアーキテクチャのSIMD命令セットであるVMX命令を用いて実装し，様々なテキストデータを入力としてUTF-8文字列デコード処理の性能を計測した結果，SIMD命令を用いない既存の方法と比較して単純な例で10倍以上，実際のテキストデータを用いたケースでも2倍から10倍の性能向上が得られた．

Recently UTF-8 encoding is widely used as a standard format for text data exchange. The Java programming language, however, uses UTF-16 encoding as its internal representation format for text data. As a result, data conversions between UTF-8 and UTF-16 consume considerable amount of CPU time in workloads that process large amount of text data, such as web application servers. Hence accelerating these conversions are important to improve the performance of many applications. In this paper, we present our new technique to accelerate decoding of variable-length formats, such as conversion from UTF-8 to UTF-16, by using SIMD instructions. The new technique can achieve higher performance by reducing overhead of branch mispredictions in addition to exploiting data parallelism of SIMD instructions. We implemented the technique using VMX instructions of the PowerPC architecture and evaluated its performance to decode various UTF-8 sequences on a PowerPC 970MP processor. As a result, we showed that our technique significantly accelerated the UTF-8 decoding compared to the existing method.

SIMD命令を用いるUTF-8文字列デコード処理の高速化

書誌事項

この論文をさがす

抄録

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

SIMD命令を用いるUTF-8文字列デコード処理の高速化

書誌事項

この論文をさがす

抄録

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について