John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel Department of Electrical Engineering and Computer Science University of California, Berkeley {joschu,pcmoritz,levine,jordan,pabbeel}@eecs.berkeley.edu
主要是说用GAE模拟Advantage 函数,降低variance. 通过使用参数控制一系列的actions对reward的影响范围 This observation suggests an interpretation of Equation(16): reshape the rewards using V to shrink the temporal extent of the response function,and then introduce a “steeper” discount γλ to cut off the noise arising from long delays, i.e., ignore terms ∇θlogπθ(at|st)⋅δVt+l where l>>1/(1−γλ)
GAN: generalized advantage estimation
Two main challenges large number of samples difficulty of obtaining stable and steady improvement解决办法
using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ).We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.重点:
后更新value functionThe choice Ψt=Aπ(st,at) yields almost the lowest possible variance, though in parcticce, the advantage function is not known and must be estimated.introduce a prameter γ to reduce variance by downweighting rewards对应delayde effects 但代价是introducing bias.
g^=1N∑n=1N∑t=0∞A^nt∇θlogπθ(ant|snt)−−(9) 其中的 n 代表batch序号
令 V 是近似value function, 定义 δVt=rt+γV(st+1)−V(st) 可以当成是action at 的advantage估计
即代表了一部分a telescoping sum advantageThe generalized advantage estimator GAE !!!!!!!! 用GAE创造一个biased gγ 估计, 通过改写公式6 具体算法:
