Document Separation between Native English and Nonnative English Using Long POS Strings

DOI HANDLE Web Site Open Access
  • Yukino Kensei
    Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Doctoral Program
  • Aoki Sayaka
    Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Master's Program
  • Tanigawa Ryuji
    Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Master's Program
  • Tomiura Yoichi
    Department of Intelligent Systems, Faculty of Information Science and Electrical Engineering, Kyushu University

Bibliographic Information

Other Title
  • 長い品詞列を文書特徴とした母語話者英文書・非母語話者英文書の判別
  • ナガイ ヒンシレツ オ ブンショ トクチョウ ト シタ ボゴワシャ エイブンショ ヒボゴワシャ エイブンショ ノ ハンベツ

Search this article

Abstract

We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.

Journal

Details 詳細情報について

Report a problem

Back to top