QUT ePrints

Infinite-horizon policy-gradient estimation

Baxter, J. & Bartlett, P. L. (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319-350 .

View at publisher

Abstract

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0,1) (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward. ©2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

Impact and interest:

222 citations in Scopus
Search Google Scholar™
149 citations in Web of Science®

Citation countsare sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 43933
Item Type: Journal Article
Keywords: Algorithms, Computational methods, Markov processes, Multi agent systems, Problem solving, Random processes, Gradient-based approaches, Policy parameters, Value-function methods, Learning systems, OAVJ
DOI: 10.1613/jair.806
ISSN: 1076-9757
Subjects: Australian and New Zealand Standard Research Classification > MATHEMATICAL SCIENCES (010000) > APPLIED MATHEMATICS (010200)
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100)
Divisions: Past > QUT Faculties & Divisions > Faculty of Science and Technology
Past > Schools > Mathematical Sciences
Copyright Owner: AI Access Foundation, Inc.
Deposited On: 12 Aug 2011 12:55
Last Modified: 12 Aug 2011 12:55

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page