低頻度byte 列を活用した言語識別

Bibliographic Information

Other Title
  • テイヒンド byteレツ オ カツヨウ シタ ゲンゴ シキベツ
  • Language Identification Using Low-frequent Byte-strings
  • 自然言語

Search this article

Abstract

本稿では,言語識別のための言語特徴として,低頻度byte 列を活用することを提案する.一般的な言語識別では,byte 列の出現傾向の類似度を各言語・識別対象文書間で求め,その大小で言語を識別する.従来手法は,出現確率の推定値の信頼性が高い高頻度byte 列を言語特徴に利用し,信頼性の低い低頻度byte 列を活用してこなかった.しかし,低頻度であっても,長いbyte 列は特定の言語のみに出現する傾向が強く,言語の特定に大きく貢献すると期待できる.低頻度byte 列を利用するためには,低頻度byte 列でも十分に識別に影響を与えることができ,かつ,出現確率推定値の変動に頑強な類似尺度が必要となる.本稿で提案する類似尺度はbyte 列の積集合サイズに基づいており,これらの条件を満たしている.本稿では,提案手法の有効性を示すために,2 種類の言語集合に対して識別実験を行っている.提案手法は従来手法に比べて高い識別精度を示しており,特に,類似言語間や小規模文書の識別において顕著に優位性が認められる.

This paper proposes a language identification method which uses low-frequent byte-strings as language features. A general method identifies the language of a document by choosing the language which has the most similar probability distribution of byte-strings to that of the document. Most previous methods, whose similarity measures are based on frequencies of byte-strings, never use the low-frequent byte-strings because of the fluctuation of their frequencies. However, among low-frequent byte-strings, there are a lot of effective byte-strings in language identification, which tend to appear in a particular language. The similarity measure using not only frequent byte-strings but also low-frequent ones should be robust to the fluctuation of the estimated probability and be sufficiently influenced by the low-frequent byte-strings. The similarity measure used in the proposed method is based on an intersection size of byte-strings between each language and a target document. Here are two examinations: the one is for similar languages and another is for dissimilar languages. They show that the proposed method has higher accuracy than the previous methods and has advantage in the language identification among similar languages or for short target documents.

Journal

Citations (1)*help

See more

References(11)*help

See more

Related Projects

See more

Keywords

Details 詳細情報について

Report a problem

Back to top