大量頻度分布データのための高速探索アルゴリズム  [in Japanese] An Efficient Search Algorithm for Large Distribution Sets  [in Japanese]

Access this Article

Search this Article

Abstract

分布データは気象,産業,金融をはじめとする様々な分野で発生している.本論文では,与えられた問合せデータに対して,複数の分布データの中から類似データを探索する問題を対象とする.本論文では,分布データの探索手法として D-Search を提案する.D-Search は (1) SVD を利用した KL 情報量の近似計算を高速に行い,(2) 分布データを様々な近似の粒度で探索し,類似度の低い分布を高速に枝刈りする.本研究ではさらに D-Search の拡張版として,(3) 時系列分布データの類似探索手法を提案し,時系列分布データの中から任意の長さの類似部分シーケンスを探索する問題を扱う.様々な実データを用いた実験を行い,D-Search が時系列分布データから正確に部分シーケンスを検出し,そしてナイーブな手法と比較して大幅な性能向上を達成していることを明らかにした.Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions), to discover patterns, rules and outlier clouds. We propose D-Search, an efficient algorithm for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations by using SVD coefficients, (2) multi-step sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) timeseries distribution mining, which uses the D-Search algorithm and finds similar subsequences in time-series distribution datasets. Extensive experiments on real multi-dimensional datasets show D-Search is significantly faster than the naive method.

Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions), to discover patterns, rules and outlier clouds. We propose D-Search, an efficient algorithm for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations by using SVD coefficients, (2) multi-step sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) timeseries distribution mining, which uses the D-Search algorithm and finds similar subsequences in time-series distribution datasets. Extensive experiments on real multi-dimensional datasets show D-Search is significantly faster than the naive method.

Journal

  • 情報処理学会論文誌データベース(TOD)

    情報処理学会論文誌データベース(TOD) 2(3), 29-40, 2009-09-30

    情報処理学会

Keywords

Codes

  • NII Article ID (NAID)
    110007990052
  • NII NACSIS-CAT ID (NCID)
    AA11464847
  • Text Lang
    JPN
  • Article Type
    Article
  • ISSN
    1882-7799
  • NDL Article ID
    024302752
  • NDL Call No.
    YH247-812
  • Data Source
    NDL  NII-ELS  IPSJ 
Page Top