統計的言語モデルを用いたOCR誤り訂正システムの構築

竹内, 孔一, 松本, 裕治

近年インターネットの普及により OCRを用いたテキストの電子化がますます重要な課題となってきた. 日本語におけるOCR誤り訂正の先行研究には OCRの文字候補と品詞タグ付きコーパスを利用した研究がある. しかしながら分野が異なれば単語の出現分布などが変わることから誤り訂正を行う分野と同分野のタグ付きコーパスを用意する必要がありそれには大変コストがかかる. また分野によっては統計学習に必要な電子化テキストデータがない場合も多い. そこでまず我々は学習用として電子化された大量テキストデータを仮定したOCR誤り訂正システムを構築しランダムに生成された文字置換誤りテキストに対する訂正実験を行った. 次に電子化テキストがない分野に対して OCR処理された誤りを含むテキストを学習に利用するシステムを作成し評価を行った. システムは文字trigram 統計的形態素解析システム単語trigramを用いた. 大量テキストを仮定したシステムでは 90%の文字読み取り精度のテキストを92.9%まで改善し 95%の精度のテキストを96.4%にまで改善した. また電子化テキストデータがない場合について実際のOCR処理されたテキストに対する訂正実験を行いその有効性を示す.

In recent years, OCR error correction is getting more and more important for the purpose of converting printed texts into electronic ones on computers. As a previous work, there exists a study of OCR post processing which uses OCR's character candidates and a morphological analyzer trained on part-of-speech-tagged corpus. However, too much cost is required to prepare pos-tagged corpus for each domain. In this paper, we present an OCR error correction method which uses stochastic language models trained on large texts. We also construct an OCR error correction system which uses OCR's output texts in a domain in which no large scale training text exists. Our system consists of the models of character trigram, a stochastic morphological analyzer and word trigram. We show that the models trained on large texts improve a text of 90% correct character rate into that of 92.9% correct rate and a 95% correct text into a 96.4% correct one. We also show how the models trained on OCR's output texts correct errors in the OCR's output texts.

統計的言語モデルを用いたOCR誤り訂正システムの構築

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (12)*注記

参考文献 (18)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

統計的言語モデルを用いたOCR誤り訂正システムの構築

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (12)*注記

参考文献 (18)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について