セキュリティレポートのマルチラベル分類のためのトピックモデルの汎化性能に着目した外れ値検出の適用

長田, 侑樹, 瀧田, 愼, 古本, 啓祐, 白石, 善明, 高橋, 健志, 毛利, 公美, 髙野, 泰洋, 森井, 昌克

セキュリティレポートに付与されるラベルは発行元によって異なっており，自組織に関連するレポートを見つけ出すのは容易ではない．そのため，日々増加していくセキュリティレポートから所望の情報を得るために，レポートの分類と高精度なラベル付けが望まれている．しかし文書分類を行う際，内容が他と大きく異なる文書が多数存在する場合，クラスタリング精度が低下する可能性がある．本稿では，文書クラスタリングの精度向上を目的として，他の文書と内容の異なる文書を外れ値文書と見なし，外れ値文書を除いてトピックモデルを構築することを提案する．ケーススタディとしてセキュリティベンダー8社が2017年から2019年に発行した2386件のセキュリティレポートに対して提案手法を適用し，トピックモデルの評価値であるPerplexityを考慮することでトピックモデルの汎化性能が向上することを確認した．また，汎化性能の向上に伴い，文書クラスタリングの精度が向上することを確認した．

The labels given to security reports vary from one publisher to another and it is not easy to find the reports that are relevant to your organization. Therefore, in order to obtain the desired information from the security reports, there is a need to classify and label the reports with high accuracy. However, if there are many documents whose contents are very different from those of others, the clustering accuracy may decrease. In this paper, we propose to apply outlier detection to topic models. By constructing topic models excluding outlier documents, we can improve the generalization ability of topic models. As a case study, we applied the proposed method to 2,386 security reports published by eight security vendors, and confirmed that the modeling accuracy of the topic model can be improved by considering Perplexity. We also confirmed that the accuracy of document clustering improved with the improvement of modeling accuracy.

セキュリティレポートのマルチラベル分類のためのトピックモデルの汎化性能に着目した外れ値検出の適用

書誌事項

抄録

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

セキュリティレポートのマルチラベル分類のためのトピックモデルの汎化性能に着目した外れ値検出の適用

書誌事項

抄録

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について