WWW空間上に存在する表構造の一索引化手法 An Indexing Method for Table Structures of HTML Format

この論文にアクセスする

この論文をさがす

著者

    • 岩口 義広 IWAGUCHI Yoshihiro
    • 徳島大学 工学部 知能情報工学科 Dpt. of Information Science & Intelligent Systems, Faculty of Engineering, Tokushima University
    • 鄭珉洙 JUNG Minsoo
    • 徳島大学 工学部 知能情報工学科 Dpt. of Information Science & Intelligent Systems, Faculty of Engineering, Tokushima University
    • 青江 順一 AOE Jun-ichi
    • 徳島大学 工学部 知能情報工学科 Dpt. of Information Science & Intelligent Systems, Faculty of Engineering, Tokushima University

抄録

近年のインターネット技術の発展は目覚ましく,WWW空間上には,莫大な数のHTML文書が存在している.これらHTML文書には,形式的な情報を容易に伝達するために,数多くの表が含まれている.表構造内には,行列方向の単語間の関係や単語の意味情報など,非常に有益な情報を含んでいる.しかし,従来のネットサーチエンジンは,表内の関係を示すタグを取り除き,各項目を単語の羅列として索引化していた.そのため,表内に明示されている各項目間の関係を検索に反映させることができなかった.本稿では,WWW空間上に存在するHTML形式の表構造を対象にし,表構造内の関係を保持したまま各項目を索引化する手法を提案する.本手法は,複雑な表を解析可能なように改良した完全2進木によるセグメンテーション法を用い,表内での各項目の位置情報をコンパクトなビット列で表現する.また,本手法で表現した位置情報(ビット列)は,奇数ビットが桁方向,偶数ビットが列方向の関係を表すため,行列方向の位置関係を高速に照合できる.WWW空間上から収集した200個の表構造(4836個の項目数)に対して,表内の座標を索引化する手法と比較実験を行った結果,本手法による索引が87%コンパクトであり,各項目間の照合速度に関しては,本手法が約5.4倍高速であった.The Internet technology has been developing rapidly, and a great mass of HTML documents exist in the WWW space. These HTML documents frequently include the table structure, which has a very useful information, such as the meanings and relations of words in the table. Traditional Internet searching engines, however, can not retrieve the suitable results with cosidering the relations between words in the table, because HTML tags which indicate the table structure are eliminated and all items in the table are regarded as the individual single word. In this paper, we propose the method to construct the index which keeps the relations in the table structure of HTML format. This method represents the position of each item in the table structure as the compact bit stream. Moreover, since the odd bits of this bit stream show the row relation of each item, on the other hand, the even bits are the column relation, it is very easy and quickly to compare the relation of positions of items in the table. From the experiment result using 200 HTML table structures, which are collected from WWW space by hand, it was found that this method can generate 87% percent smaller index and compare the position relations 5.4 times faster than the indexing method storing the row and column coordinates of each item.

The Internet technology has been developing rapidly, and a great mass of HTML documents exist in the WWW space. These HTML documents frequently include the table structure, which has a very useful information, such as the meanings and relations of words in the table. Traditional Internet searching engines, however, can not retrieve the suitable results with cosidering the relations between words in the table, because HTML tags which indicate the table structure are eliminated and all items in the table are regarded as the individual single word. In this paper, we propose the method to construct the index which keeps the relations in the table structure of HTML format. This method represents the position of each item in the table structure as the compact bit stream. Moreover, since the odd bits of this bit stream show the row relation of each item, on the other hand, the even bits are the column relation, it is very easy and quickly to compare the relation of positions of items in the table. From the experiment result using 200 HTML table structures, which are collected from WWW space by hand, it was found that this method can generate 87% percent smaller index and compare the position relations 5.4 times faster than the indexing method storing the row and column coordinates of each item.

収録刊行物

  • 情報処理学会研究報告情報学基礎(FI)

    情報処理学会研究報告情報学基礎(FI) 2001(20(2000-FI-061)), 159-166, 2001-03-05

    一般社団法人情報処理学会

参考文献:  11件中 1-11件 を表示

被引用文献:  4件中 1-4件 を表示

各種コード

  • NII論文ID(NAID)
    110002934310
  • NII書誌ID(NCID)
    AN10114171
  • 本文言語コード
    JPN
  • 資料種別
    Technical Report
  • ISSN
    09196072
  • NDL 記事登録ID
    5747975
  • NDL 雑誌分類
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL 請求記号
    Z14-1121
  • データ提供元
    CJP書誌  CJP引用  NDL  NII-ELS  IPSJ 
ページトップへ