-
- Tommi Jaakkola
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
-
- Michael I. Jordan
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
-
- Satinder P. Singh
- Department of Computer Science, University of Massachusetts, Amherst, MA 01003 USA
抄録
<jats:p> Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(λ) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(λ) and Q-learning belong. </jats:p>
収録刊行物
-
- Neural Computation
-
Neural Computation 6 (6), 1185-1201, 1994-11
MIT Press - Journals
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1361981468552361600
-
- NII論文ID
- 30036176680
-
- ISSN
- 1530888X
- 08997667
-
- データソース種別
-
- Crossref
- CiNii Articles