階層Pitman-Yor 過程に基づく可変長n-gram 言語モデル Bayesian Variable Order n-gram Language Model Based on Hierarchical Pitman-Yor Processes
-
- 持橋 大地 MOCHIHASHI DAICHI
- ATR音声言語コミュニケーション研究所 ATR Spoken Language Communication Research Laboratories
-
- 隅田 英一郎 SUMITA EIICHIRO
- ATR音声言語コミュニケーション研究所 ATR Spoken Language Communication Research Laboratories
この論文にアクセスする
この論文をさがす
著者
-
- 持橋 大地 MOCHIHASHI DAICHI
- ATR音声言語コミュニケーション研究所 ATR Spoken Language Communication Research Laboratories
-
- 隅田 英一郎 SUMITA EIICHIRO
- ATR音声言語コミュニケーション研究所 ATR Spoken Language Communication Research Laboratories
抄録
本論文では,n-gram 分布の階層的生成モデルである階層Pitman-Yor 過程を拡張することで,各単語の生まれた隠れたマルコフ過程のオーダを自動的に推定し,適切な文脈を用いる可変長n-gram言語モデルを提案する.無限の深さを持つ予測接尾辞木上の確率過程を考えることにより,句を確率的に発見し,適切な文脈長を学習することができる.これにより,従来不可能だった高次nグラムの学習が可能になる.本手法は言語モデルだけでなく,マルコフモデル一般について,そのオーダをデータから推定できる可変長生成モデルとなっている.英語および日本語の標準的なコーパスでの実験により,提案法の有効性を確認した.This paper proposes a variable order n-gram language model by extending a recently proposed model based on the hierarchical Pitman-Yor processes. Introducing a stochastic process on an infinite depth prediction suffix tree, we can infer the hidden n-gram context from which each word originated. Experiments on standard large corpora showed validity and efficiency of the proposed model. Our architecture is also applicable to general Markov models to estimate their variable orders of generation.
This paper proposes a variable order n-gram language model by extending a recently proposed model based on the hierarchical Pitman-Yor processes. Introducing a stochastic process on an infinite depth prediction suffix tree, we can infer the hidden n-gram context from which each word originated. Experiments on standard large corpora showed validity and efficiency of the proposed model. Our architecture is also applicable to general Markov models to estimate their variable orders of generation.
収録刊行物
-
- 情報処理学会論文誌
-
情報処理学会論文誌 48(12), 4023-4032, 2007-12-15
一般社団法人情報処理学会
参考文献: 29件中 1-29件 を表示
-
1
- A mathematical theory of communication
-
SHANNON C. E.
Bell System Technical Journal 27, 379-423, 623-656, 1948
被引用文献1件
-
2
- 形態素周辺確率を用いた分かち書きの一般化とその応用
-
工藤拓
言語処理学会全国大会論文集NLP-2005, 2005
被引用文献1件
-
3
- Entropy-based Pruning of Back-off Language Models
-
STOLCKE A.
Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998, 1998
被引用文献1件
-
4
- Beyond Word N-grams
-
PEREIRA F.
Proc. 3rd Workshop on Very Large Corpora, 1995, 1995
被引用文献1件
-
5
- Probabilistic Document Modelling
-
COWANS P.
Ph.D. Thesis, University of Cambridge, 2006
被引用文献1件
-
6
- <no title>
-
TEH Y. W.
A Bayesian Interpretation of Interpolated Kneser-Ney, 2006
被引用文献1件
-
7
- A Hierarchical Bayesian Language Model based on Pitman-Yor Processes
-
TEH Y. W.
Proc. COLING/ACL 2006, 2006
被引用文献1件
-
8
- <no title>
-
TEH Y. W.
Hierarchical Dirichlet Processes, 2004
被引用文献1件
-
9
- Non-parametric Bayesian Methods
-
GHAHRAMANI Z.
UAI 2005 Tutorial, 2005
被引用文献1件
-
10
- <no title>
-
GILKS W. R.
Markov Chain Monte Carlo in Practice, 1996
被引用文献1件
-
11
- Fast search for Dirichlet process mixture models
-
DAUME H. III
AISTATS 2007, 2007
被引用文献1件
-
12
- A Chinese Corpus for Linguistics Research
-
HUANG C.-R.
Proc. COLING 1992, 1992
被引用文献1件
-
13
- <no title>
-
GOODMAN J. T.
A Bit of Progress in Language Modeling, Extended Version, 2001
被引用文献1件
-
14
- MeCab : Yet Another Part-of-Speech and Morphological Analyzer
-
KUDO T.
http://mecab.sourceforge.net/
被引用文献1件
-
15
- SRILM-An Extensible Language Modeling Toolkit
-
STOLCKE A.
Proc. ICSLP, 2002 2, 901-904, 2002
被引用文献1件
-
16
- The Power of Amnesia
-
RON D.
Advances in Neural Information Processing Systems 6, 176-183, 1994
被引用文献1件
-
17
- Newsweeder : Learning to filter netnews
-
LANG K.
Proc. 12th International Conference on Machine Learning, 1995, 1995
被引用文献1件
-
18
- Estimating a Dirichlet distribution
-
MINKA T. P.
http://research.microsoft.com/〜minka/papers/dirichlet/, 2000
被引用文献1件
-
19
- <no title>
-
PITMAN J.
Combinatorial Stochastic Processes, 2002
被引用文献1件
-
20
- Variable n-grams and extensions for conversational speech language modeling
-
SIU M.
IEEE Trans. on Speech and Audio Processing 8, 63-75, 2000
被引用文献3件
-
21
- A generalization of the PST algorithm : modeling the sparse nature of protein sequences
-
LEONARDI F. G.
Bioinformatics 22(11), 1302-1307, 2006
DOI 被引用文献2件
-
22
- Variable Length Markov Chains
-
BUHLMANN P.
The Annals of Statistics 27(2), 480-513, 1999
DOI 被引用文献2件
-
23
- Back-off Method for N-gram Smoothing based on Binomial Posteriori Distribution
-
KAWABATA T.
ICASSP-96 I, 192-195, 1996
被引用文献1件
-
24
- Improved backing-off for m-gram language model
-
KNESER R.
Proceedings of the IEEE International Conference of Acoustic, Speech, and Signal processing, 1995 1, 181-184, 1995
被引用文献18件
-
25
- The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator
-
PITMAN J.
Annals of Probability 25(2), 855-900, 1997
DOI 被引用文献4件
-
26
- Self-Adjusting Binary Search Trees
-
SLEATOR D. D.
JACM 32(3), 652-686, 1985
被引用文献11件
-
27
- Latent diirichlet allocation
-
BLEI D. M.
Journal of Machine Learning Research 3, 993-1022, 2003
DOI 被引用文献75件
-
28
- An Empirical Study of Smoothing Techniques for Language Modeling
-
CHEN S. F.
Proc. of ACL 1996, 1996
被引用文献7件
-
29
- The context-tree weighting method : Basic properties
-
WILLEMS F. M. J.
IEEE Trans. Inf. Theory IT-41(3), 653-664, 1995
DOI 被引用文献59件