REGAL : a regularization based algorithm for reinforcement learning in weakly communicating MDPs
Bartlett, Peter L. & Tewari, Ambuj (2009) REGAL : a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009)), McGill University, Montreal.
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~ O(HS p AT ). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.
Impact and interest:
Citation countsare sourced monthly fromand citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
|Item Type:||Conference Paper|
|Keywords:||algorithm, optimal regret rate, Markov Decision Process (MDP)|
|Subjects:||Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100)|
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Science and Technology|
Past > Schools > Mathematical Sciences
|Copyright Owner:||Copyright 2009 [please consult the authors]|
|Deposited On:||06 Sep 2011 08:28|
|Last Modified:||06 Sep 2011 08:28|
Repository Staff Only: item control page