一个很好的功能——在线函数的近似替代外文翻译资料

 2022-12-10 04:12

Greedy feature replacement for online value function approximation

Abstract: Reinforcement learning (RL) in real-world problems requires function approximations that depend on selecting the appropriate feature representations. Representational expansion techniques can make linear approximators represent value functions more effectively; however, most of these techniques function well only for low dimensional problems. In this paper, we present the greedy feature replacement (GFR), a novel online expansion technique, for value-based RL algorithms that use binary features. Given a simple initial representation, the feature representation is expanded incrementally. New feature dependencies are added automatically to the current representation and conjunctive features are used to replace current features greedily. The virtual temporal difference (TD) error is recorded for each conjunctive feature to judge whether the replacement can improve the approximation. Correctness guarantees and computational complexity analysis are provided for GFR. Experimental results in two domains show that GFR achieves much faster learning and has the capability to handle large-scale problems.

Key words: Reinforcement learning, Function approximation, Feature dependency, Online expansion, Feature replacement

1 Introduction

The complex, high dimensional problem is the main obstacle in the transition of reinforcement learning (RL) from the toy world to real-world applications. Traditional tabular reinforcement learning has high demands on computing time and storage space, making RL in large spaces a challenge (Tsitsiklis, 1994; Sprague and Ballard, 2003). However, using function approximation introduces generalization into RL, effectively reducing time and space burdens, and providing a solution to high dimensional problems (Singh and Yee, 1994; Sutton,1996).

Linear value function approximation is widely considered an effective method for RL owing to its simple form and complete theoretical basis.Nevertheless, linear approximators using only initial features run poorly in many applications, as they cannot capture the relationship between features to express the value function well. Researchers such as Albus (1971) and Sturtevant and White (2006) have pursued manual approaches that recognize the important feature dependencies to improve the learning performance; however, discovering feature dependencies automatically is clearly more attractive. Several online automatic expansion techniques have been developed, such as adaptive tile coding (ATC) and sparse distributed memories (SDM) (Ratitch and Precup, 2004; Whiteson et al., 2007), which simplify the learning process. However, these techniques still cannot handle large-scale problems well, because their new features must correspond to the full dimensional state space even though a lower dimensional space is enough.

2 Background-Markov decision process

The Markov decision process (MDP) is the theoretical basis of RL (Puterman, 1994; Barto et al.,1995; Kolter and Ng, 2009; Pazis and Lagoudakis,2009). An MDP is formalized as a tuple M=(S, A, P, R,, D), where S is the set of possible states, A is the set of possible actions, P denotes the state transition probabilities—p(s, a, s′) is the transition probability of taking action a from state s to s′, R is the corresponding reward function—r(s, a) denotes the expected reward for taking action a in state s,  is a real number between 0 and 1, which is the discount factor for future rewards, and D represents the initial state distribution. A policy pi;: SA is a mapping from

the states to the actions, and pi;(s) gives the action selection in state s.

3 Greedy feature replacement- Method

We employ linear value function approximation with binary features. Using binary feature representation,computing the estimated values is simple and fast; therefore, it has been widely used in RL (Albus,1971; Sutton, 1996). In binary representation, the conjunction of two features can represent the dependency between them. By adding feature conjunctions to the representation, nonlinearities can be captured by the linear value function approximator. Therefore, a linear approximator can achieve the effect of a nonlinear one that uses these initial features (Geramifard et al., 2011). Like iFDD, GFR selects feature conjunctions as new features and incrementally adds them to the feature representation. Therefore, the feature conjunction selection process is the key point of our algorithm. The GFR algorithm is based on a simple principle: If the feature representation has been perfect, adding new feature conjunctions will not further reduce the approximation error. Namely, if adding new feature conjunctions can reduce the approximation error, the feature representation still needs to be expanded. Different from iFDD, GFR can optimize the sequence in which the feature conjunctions are discovered and improve the performance of representational expansion. In the GFR algorithm, we expand the feature representation in the form of feature replacement. By using  f (s, a)=1 to indicate that the binary feature f in the current feature set is true (also called lsquo;activersquo;) when taking action a in state s, we derive the definition of the term lsquo;feature replacementrsquo;.

4 Experiments

We chose two representative domains tocompare GFR against other feature representation techniques including initial representation, tabular representation, iFDD,

剩余内容已隐藏,支付完成后下载完整资料


一个很好的功能——在线函数的近似替代

摘要: 强化学习(RL)在实际问题需要函数近似表示,取决于选择合适的特征。表征扩展技术可以更有效地使线性近似者表示值函数;然而,大多数的这些技术功能只对低维问题。在本文中,我们提出了贪婪的功能替换(GFR),小说在线扩展技术,为价值取向的RL使用二进制特性的算法。给定一个简单的初始表示,特征表示是逐步扩大。新功能依赖自动添加到当前的表示和贪婪地连接功能是用于替换当前功能。虚拟时间差异(TD)为每个连接的特性来判断错误记录更换可以提高近似。保证正确性和计算复杂性分析是肾小球滤过率(GFR)的提供。肾小球滤过率(GFR)在两个域实验结果表明,达到更快的学习和有能力处理大规模问题。

关键字: 强化学习 函数逼近 功能依赖 网络扩张 功能替换

1 介绍

复杂的高维问题的主要障碍是强化学习(RL)的过渡从玩具世界真实世界的应用程序。传统的表格强化
学习有很高的要求计算时间和存储空间,使得RL在大空间的挑战(Tsitsiklis,1994;斯普拉格和巴拉德,1994)。然而,使用函数近似介绍泛化成RL,有效地减少了时间和空间的负担,并提供解决高维问题(辛格和绮,1994;萨顿,1996)。
线性值函数逼近被广泛认为是一种有效的方法对RL由于其简单的形式和完整的理论基础。然而,线性近似者只使用初始特征在许多应用程序运行不佳,因为它们不能捕获特性表达价值函数之间的关系。研究人员如阿不思·(1971)和Sturtevant和白色(2006)追求手工方法识别重要的功能依赖关系改善学习性能;然而,发现功能依赖自动显然更具吸引力。数家在线自动扩展技术已经开发出来,如自适应块编码(ATC)和稀疏分布记忆(SDM)(Ratitch Precup,2004;Whiteson et al .,2007),简化学习过程。然而,这些技术仍然不能处理大规模问题,因为他们的新特性必须对应全维状态空间尽管较低维的空间就足够了。

2 背景-马尔可夫决策过程

马尔可夫决策过程(MDP)的理论基础是RL(Puterman,Barto et al .,1994;科特勒和Ng,1995;2009;Pazis Lagoudakis,2009)。MDP正式作为一个元组M =(S,P,R,,D),S是可能状态的集合,是一组可能的行动,P表示状态转换probabilities-p(S,S′)是采取行动的状态的转移概率S′S,R是相应的回报函数(,)表示期望的奖励采取行动在国家年代,实数在0和1之间,这是未来回报的折现系数,D代表了初始状态分布。一个政策pi;:S→的映射美国的行动,pi;(s)给出了选择动作状态。

3贪婪的功能替换- 方法

我们雇佣与二进制线性值函数逼近特性。使用二进制特征表示、计算估计的值是简单和快速,因此,它已广泛应用于RL(白色,1971;萨顿,1971)。在二进制表示,两个特性的结合可以代表它们之间的依赖关系。通过添加功能连接词表示,非线性可以被线性值函数的估计值。因此,线性近似者可以实现非线性的影响使用这些最初的特性(Geramifard et al .,2011)。iFDD、肾小球滤过率(GFR)选择功能连接词等新特性和逐步将它们添加到特征表示。因此,特征选择过程结合的关键算法。肾小球滤过率(GFR)算法是基于一个简单的原则:如果特性表示已经完美,添加新功能连接词将不会进一步减少近似误差。即,如果添加新功能连接词可以减少逼近误差,该特性表示还需要扩大。不同于iFDD,肾小球滤过率(GFR)可以优化的顺序功能连接词被发现和改善性能的表征性扩张。肾小球滤过率(GFR)算法,我们扩展功能
表示替换的形式特征。通过使用f(s,a)= 1,表示二进制特征f在当前特性集是真的(也称为“活跃的”)在国家年代当采取行动,我们推导出“功能替代”一词的定义。

4 实验

我们选择两个代表域tocompare肾小球滤过率(GFR)对其他特征表示技术包括初始表示,表格表示,iFDD,iFDD (新版本提出的iFDD Geramifard et al .(2013))。采用є-greedy政策探索。我们在每个域使用情景RL。

5 结论和未来的工作

肾小球滤过率(GFR)是一种新型的在线表征扩张技术可以结合任何在线线性值函数逼近方法使用二进制特征。该算法扩展了功能表示逐步通过发现取代地区执行功能替换。肾小球滤过率(GFR)可以应用于高维域,从而克服局限在传统的在线扩展技术。肾小球滤过率(GFR)此外,通过识别取代的地区,可以优化的顺序功能连接词被发现,因此提高性能的表征扩张。我们证明了肾小球滤过率(GFR)结合TD会导致渐近最优近似的结果。我们还提供了计算复杂性分析算法。此外,倒立摆和持续监测的实验经验显示,肾小球滤过率(GFR)的优越性。在未来的工作中,有可能雇佣用于降低标准差的方法,提高算法的稳定性。肾小球滤过率(GFR)虽然可以很好的适应大空间和有一个杰出的平均表现,像其他RL方法,每次运行之间的差异仍然显著。确保这种方法的稳定性对肾小球滤过率(GFR)的实际应用很重要。未来改进的另一个领域是肾小球滤过率(GFR)扩大应用的实际问题。OurNsuccess在虚拟RL任务显示,肾小球滤过率(GFR)有能力解决大空间的问题。然而,我们的方法应用到实际问题仍然是一个巨大的挑战。例如,感觉不完整问题甚至会导致完全展开表示不能获得好的近似的结果。在这种情况下,肾小球滤过率(GFR)结合部分可观测的RL的方法可能会提供一个解决这个问题也值得研究。

引用

Albus, J.S., 1971. A theory of cerebellar function. Math. Biosci., 10(1-2):25-61. [doi:10.1016/0025-5564(71)900 51-4]

Barto, A.G., Bradtke, S.J., Singh, S.P., 1995. Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2):81-138. [doi:10.1016/0004-3702(94)00011-O]

Buro, M., 1999. From simple features to sophisticated evaluation functions. Proc. 1st Int. Conf. on Computers and Games, p.126-145. [doi:10.1007/3-540-48957-6_8] de Hauwere, Y.M., Vrancx, P., Noweacute;, A., 2010.Generalized learning automata for multi-agent reinforcement learning.AI Commun., 23(4):311-324. [doi:10.3233/AIC-2010- 0476]

Geramifard, A., Doshi, F., Redding, J., et al., 2011. Online discovery of feature dependencies. Proc. 28th Int. Conf. on Machine Learning, p.881-888.

Geramifard, A., Dann, C., How, J.P., 2013. Off-policy learning combined with automatic feature expansion for solving large MDPs. Proc. 1st Multidisciplinary Conf. on Reinforcement Learning and Decision Making, p.29-33.

Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res.,4:237-285. [doi:10.1613/jair.301]

Kolter, J.Z., Ng, A.Y., 2009. Near-Bayesian exploration in polynomial time. Proc. 26th Annual Int. Conf. on Machine Learning, p.513-520. [doi:10.1145/1553374. 1553441]

剩余内容已隐藏,支付完成后下载完整资料


资料编号:[31101],资料为PDF文档或Word文档,PDF文档可免费转换为Word

您需要先支付 30元 才能查看全部内容!立即支付

课题毕业论文、文献综述、任务书、外文翻译、程序设计、图纸设计等资料可联系客服协助查找。