Infinite-horizon policy-gradient estimation
Baxter, J. & Bartlett, P. L. (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319-350 .
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0,1) (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward. ©2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.
Impact and interest:
Citation countsare sourced monthly fromand citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
|Item Type:||Journal Article|
|Keywords:||Algorithms, Computational methods, Markov processes, Multi agent systems, Problem solving, Random processes, Gradient-based approaches, Policy parameters, Value-function methods, Learning systems, OAVJ|
|Subjects:||Australian and New Zealand Standard Research Classification > MATHEMATICAL SCIENCES (010000) > APPLIED MATHEMATICS (010200)|
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100)
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Science and Technology|
Past > Schools > Mathematical Sciences
|Copyright Owner:||AI Access Foundation, Inc.|
|Deposited On:||12 Aug 2011 12:55|
|Last Modified:||12 Aug 2011 12:55|
Repository Staff Only: item control page