複製文字列検知に基づいた Splog フィルタリング手法

竹田, 隆治, 高須, 淳宏

ブログなどの CGM （Consumer Generated Media）のデータは，消費者の実体験や生の声を含んでおり，顧客のニーズを分析したり，プロモーションの効果を検証したりするための情報源として，その重要性が増してきている．しかし，ブログには，商品の販売促進や，特定の web サイトのランクをあげることなどを目的とした splog と呼ばれるスパムコンテンツが含まれており，ブログの検索や分析に悪影響を及ぼしている．本稿では特に日本語における splog の特徴であるコピーコンテンツの検出に注目し，そのフィルタリング手法を提案する．日本語の splog は，さまざまな文書に含まれる文字列をコピーしつなぎ合わせることによって機械的に生成されることが多い．そこで，本稿では，動的計画法と suffix array を用いて，各ブログに含まれる文字列で，他の文書にも現れる文字列を効率良く検出するアルゴリズムを提案し，そのような文字列がブログに占める割合に基づいた splog のフィルタリング法を提案する．また，フィルタリング性能を評価するためのコーパスを構築し，提案手法が高いフィルタリング性能を実現できることを示すとともに，その特性を分析する．

CGM (Consumer Generated Media) data such as blog contains valuable information about customers reputation and it becomes important information source for detecting customers' needs and analyzing effects of various product promotion. However, CGM data contains spam content such as so called "splogs" that are generated for promoting products or improving rank of search results. They are harmful for CGM content retrieval and analysis. This paper proposes a splog filtering method based on the feature of Japanese splogs. The Japanese splogs are often generated by combining words and phrases appearing in various documents. This paper proposes an efficient copy string detection algorithm using the dynamic programming technique and suffix array and apply the proposed algorithm to calculate the ratio of copied strings in a blog. We construct an evaluation corpus for splog filters and show that the proposed method achieves high filtering performance using the corpus.

複製文字列検知に基づいた Splog フィルタリング手法

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (1)*注記

関連プロジェクト

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

複製文字列検知に基づいた Splog フィルタリング手法

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (1)*注記

関連プロジェクト

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について