blog分類のための半教師有り学習  [in Japanese] Semi-supervised Learning for Blog Classification  [in Japanese]

Access this Article

Search this Article

Author(s)

    • 池田 大介 IKEDA Daisuke
    • 東京工業大学大学院知能システム科学専攻 Department of Computational Intelligence and System Science, Tokyo Institute of Technology
    • 奥村 学 OKUMURA Manabu
    • 東京工業大学精密工学研究所 Precision and Intelligence Laboratory, Tokyo Institute of Technology

Abstract

blog著者の属性推定など,教師有り学習を用いblogを分類する研究がなされている.ラベルの無いblogであれば容易に収集が可能であるが,正解ラベル付きのblogは一般に高価である.そこで,本研究では半教師有り学習によるblog分類手法を提案する.blog中の各エントリはスタイルや内容が共通している.本研究ではこれに着目し,各エントリがどのblogに属していたか, という補助問題を解くことにより,blogのスタイルやコンテンツと言った各blogに固有の特徴をモデル化する.この情報を利用することで, 目的の分類問題の精度を向上させることができる.本手法を用いた, いくつかの分類タスクでの実験結果についても報告する.Classifying blogs, e.g. identifying bloggers' gender or age, is one of the most interesting problems in blog analysis today. Although it is usually solved by applying supervised learning techniques, it is not always easy to collect labeled blogs enough to train an accurate classifier. To the contrary, we can collect a huge amount of blogs that have no labels. In this paper, therefore, we propose a semi-supervised learning method for blog classification in order to incorporate unlabeled data into supervised learning. We assume that the entries from the same blog have the same characteristics. With this assumption, our method captures the characteristics of each blog, such as writing styles, and uses it to improve classification accuracy.

Classifying blogs, e.g. identifying bloggers' gender or age, is one of the most interesting problems in blog analysis today. Although it is usually solved by applying supervised learning techniques, it is not always easy to collect labeled blogs enough to train an accurate classifier. To the contrary, we can collect a huge amount of blogs that have no labels. In this paper, therefore, we propose a semi-supervised learning method for blog classification in order to incorporate unlabeled data into supervised learning. We assume that the entries from the same blog have the same characteristics. With this assumption, our method captures the characteristics of each blog, such as writing styles, and uses it to improve classification accuracy.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2008(4(2008-NL-183)), 59-66, 2008-01-22

    Information Processing Society of Japan (IPSJ)

Cited by:  1

Codes

  • NII Article ID (NAID)
    110006623475
  • NII NACSIS-CAT ID (NCID)
    AN10115061
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • Data Source
    CJPref  NII-ELS  IPSJ 
Page Top