Document Separation between Native English and Nonnative English Using Long POS Strings
-
- Yukino Kensei
- Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Doctoral Program
-
- Aoki Sayaka
- Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Master's Program
-
- Tanigawa Ryuji
- Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Master's Program
-
- Tomiura Yoichi
- Department of Intelligent Systems, Faculty of Information Science and Electrical Engineering, Kyushu University
Bibliographic Information
- Other Title
-
- 長い品詞列を文書特徴とした母語話者英文書・非母語話者英文書の判別
- ナガイ ヒンシレツ オ ブンショ トクチョウ ト シタ ボゴワシャ エイブンショ ヒボゴワシャ エイブンショ ノ ハンベツ
Search this article
Abstract
We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.
Journal
-
- 九州大学大学院システム情報科学紀要
-
九州大学大学院システム情報科学紀要 11 (2), 115-119, 2006-09-26
Faculty of Information Science and Electrical Engineering, Kyushu University
- Tweet
Keywords
Details 詳細情報について
-
- CRID
- 1390853649773708672
-
- NII Article ID
- 110005207704
-
- NII Book ID
- AN10569524
-
- DOI
- 10.15017/1516865
-
- ISSN
- 21880891
- 13423819
-
- HANDLE
- 2324/1516865
-
- NDL BIB ID
- 8536634
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- IRDB
- NDL
- CiNii Articles
-
- Abstract License Flag
- Allowed