大規模テキストデータの分類体系化のための機械学習に基づく半自動分類法の提案  [in Japanese] Proposal of a Semiautomatic Classification Method for Systematization of Large-scale Text Data based on Machine Learning  [in Japanese]

Access this Article

Search this Article

Author(s)

    • 下村 良 SHIMOMURA Ryo
    • 早稲田大学創造理工学部経営システム工学科 Department of Industrial and Management Systems Engineering, School of Creative Science and Engineering, Waseda University
    • 三川 健太 MIKAWA Kenta
    • 早稲田大学創造理工学部経営システム工学科 Department of Industrial and Management Systems Engineering, School of Creative Science and Engineering, Waseda University
    • 後藤 正幸 GOTO Masayuki
    • 早稲田大学創造理工学部経営システム工学科 Department of Industrial and Management Systems Engineering, School of Creative Science and Engineering, Waseda University

Abstract

近年の情報化により,企業は大量のテキストデータを蓄積可能となった.これらのデータからは様々な情報を抽出できる可能性があるため,データの効率的な分析手法が望まれている.これらのデータから情報を効率的に把握する方法としてその構造化が考えられ,既に様々な手法が提案されているが,全作業が人手によるため,その数が膨大な大規模テキストデータには適用できないという欠点がある.そこで本研究では,人手による分類手法に大規模テキストデータを扱う自動文書分類の技術を組み合わせ,大規模テキストデータの効率的な解析を支援する手法を提案する.また,ソフトウェア開発に関わる企業が保有する実データに適用し,その有効性を示す.

These days, many companies store enormous amounts of text data used in their operations in digital format because computers are now used in all processes in every department. However, the valuable information in this enormous amount of text data often cannot be used effectively. Normally, such data contains a lot of useful information for company workers and it is important to use the data effectively for the development of companies. However, the volume of the text data is sometimes too enormous to use the data directly. Even if analysts spend a lot of time in order to extract useful information from the text data, it may be impossible to analyze such huge amounts of text data. Generally, clustering or grouping by similarity and naming each group to provide category information are effective ways to grasp the tendency of the whole data and systematize many ideas. However, analysis by hand can only be carried out for small volumes of data in which analysts can see all data items or ideas. Therefore, it is difficult to apply clustering by hand to the large-scale text data which is stored in companies. If clustering and naming by analysts can be applied to enormous amounts of text data, it will be useful for extracting valuable information. In this study, we propose a new method based on the combination of clustering by hand and text classification in order to effectively analyze large-scale digital data which is stored in a company. The first step of the proposed method is to provide category information by hand for the sample data selected randomly from all the text data. The next step is to estimate classifiers through learning of this sample data, and to classify the rest of the data using the classifiers automatically. Using the proposed method, enormous amounts of text data can be systemized provided that only a small sample set is analyzed by hand. To verify the effectiveness of the proposed method, it is applied to the large-scale text data which was stored in a company as a case study.

Journal

  • Journal of Japan Industrial Management Association

    Journal of Japan Industrial Management Association 65(2), 51-60, 2014

    Japan Industrial Management Association

Codes

  • NII Article ID (NAID)
    130004684284
  • NII NACSIS-CAT ID (NCID)
    AN10561806
  • Text Lang
    JPN
  • ISSN
    1342-2618
  • NDL Article ID
    025698554
  • NDL Call No.
    Z4-298
  • Data Source
    NDL  J-STAGE 
Page Top