リセット機能を活用したシミュレータにおける効率的な方策学習

橋本, 大世, 鶴岡, 慶雅

強化学習ではシミュレータを使った方策学習が一般的である. これは, シミュレータでは実環境よりも速くかつ安全にデータを収集できるためである. 強化学習は試行錯誤を繰り返しながら学習するため一般的に大量のデータが必要であり, シミュレータを使っても学習に長時間かかることが多い. そのため強化学習の実応用に向けて, シミュレータにおける方策学習のサンプル効率を高めることが重要である. サンプル効率の向上を目的とする研究は数多く存在するが, シミュレータでの学習の特性を利用する研究は不十分であり改善の余地がある. そこで本研究では, シミュレータが備えるリセット機能を活用して方策学習を効率化する手法を検討する. 具体的には, 累積報酬の高い軌跡を素早く見つけることで学習効率を高める. そのために, リセットする状態を選ぶ基準や, 不必要なデータ収集を避ける方法を提案する. 実験では,CartPole という古典的なタスクと, Pong, Boxing というビデオゲームのタスクにおいて提案手法の有効性を定量的に検証した. 加えて, 提案手法の動作に関する定性的な分析も行った.

In reinforcement learning, it is common to use a simulator for policy learning. This is because an agent can collect data faster and more safely on a simulator than in a real environment. Reinforcement learning generally requires a large amount of data because its learning process is a trial-and-error, and training often takes a long time to learn even using a simulator. Therefore, it is crucial to improve the sample efficiency of policy learning on a simulator for practical applications of reinforcement learning. Although there is a large body of work aiming to improve the sample efficiency, there are not many studies that exploit the characteristics of learning on a simulator. Therefore, in this study, we investigate an approach to improve the efficiency of policy learning by utilizing the reset function of a simulator. Speciﬁcally, we improve the learning efficiency by quickly ﬁnding trajectories with high cumulative rewards. For this purpose, we propose a criterion to select reset states and a method to avoid unnecessary data collection. In the experiments, we quantitatively veriﬁed the effectiveness of the proposed method in a classical task called CartPole and video game tasks of Pong and Boxing. In addition, we conducted a qualitative analysis of the behavior of the proposed method.

リセット機能を活用したシミュレータにおける効率的な方策学習

書誌事項

抄録

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

リセット機能を活用したシミュレータにおける効率的な方策学習

書誌事項

抄録

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について