Fisher's Blog

Sein heißt Werden
Leben heißt Lernen

0%

Proximal Policy Optimization 代码实现

Proximal Policy Optimization Algorithms 一文的基础上,可以看出来 PPO 比 TRPO 算法实现起来方便得多,相比于 Actor-Critic 算法,最重要的改动在于把目标函数进行了替换 (surrogate objective) ,同时在更新这个替代的目标函数时对它加上了一定更新幅度的限制。在实际的代码实现中,我们根据论文中的说明,将 Actor 和 Critic 合并起来,共用一套神经网络参数,只用一个损失函数来进行优化。直接看完整代码

https://github.com/BlueFisher/RL-PPO-with-Unity 项目中有更加完善,功能更加强大的 PPO 代码,智能体运行环境基于 Unity ML-Agents Toolkit 。

PPO

最重要的 PPO 类架构为:

1
2
3
4
5
6
class PPO(object):
def __init__(self, s_dim, a_bound, c1, c2, epsilon, lr, K):
def _build_net(self, scope, trainable):
def get_v(self, s):
def choose_action(self, s):
def train(self, s, a, discounted_r):

_build_net 神经网络构建函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
with tf.variable_scope(scope):
l1 = tf.layers.dense(self.s, 100, tf.nn.relu, trainable=trainable)
mu = tf.layers.dense(l1, 1, tf.nn.tanh, trainable=trainable)
sigma = tf.layers.dense(l1, 1, tf.nn.softplus, trainable=trainable)

# 状态价值函数 v 与策略 π 共享同一套神经网络参数
v = tf.layers.dense(l1, 1, trainable=trainable)

mu, sigma = mu * self.a_bound, sigma + 1

norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)

params = tf.global_variables(scope)
return norm_dist, v, params

__init__

首先是一些变量的定义,同时构建神经网络,获取到策略 π 与状态价值函数 V

1
2
3
4
5
6
7
8
9
self.sess = tf.Session()

self.a_bound = a_bound
self.K = K

self.s = tf.placeholder(tf.float32, shape=(None, s_dim), name='s_t')

pi, self.v, params = self._build_net('network', True)
old_pi, old_v, old_params = self._build_net('old_network', False)

然后是一个轨迹采样完成之后得到的衰减过后的奖励,这一步的计算在主函数中事先进行: rt+γrt+1++γTt+1+γTtV(sT)

1
self.discounted_r = tf.placeholder(tf.float32, shape=(None, 1), name='discounted_r')

构建优势函数: A^t=V(st)+rt+γrt+1++γTt+1+γTtV(sT)

1
advantage = self.discounted_r - old_v

最关键的替代的目标函数: LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

1
2
3
4
5
6
7
self.a = tf.placeholder(tf.float32, shape=(None, 1), name='a_t')
ratio = pi.prob(self.a) / old_pi.prob(self.a)

L_clip = tf.reduce_mean(tf.minimum(
ratio * advantage, # 替代的目标函数 surrogate objective
tf.clip_by_value(ratio, 1 - epsilon, 1 + epsilon) * advantage
))

再与 Critic 结合,同时加上信息熵: LtCLIP+VF+S(θ)=E^t[LtCLIP(θ)c1LtVF+c2S[πθ](st)]

1
2
3
L_vf = tf.reduce_mean(tf.square(self.discounted_r - self.v))
S = tf.reduce_mean(pi.entropy())
L = L_clip - c1 * L_vf + c2 * S

至此基本完成整个算法的构造,其余细节可以在完整代码中查看。

主函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def simulate():
s = env.reset()
r_sum = 0
trans = []
for step in range(T_TIMESTEPS):
a = ppo.choose_action(s)
s_, r, done, _ = env.step(a)
trans.append([s, a, (r + 8) / 8])
s = s_
r_sum += r

v_s_ = ppo.get_v(s_)
for tran in trans[::-1]:
v_s_ = tran[2] + GAMMA * v_s_
tran[2] = v_s_

return r_sum, trans


for i_iteration in range(ITER_MAX):
futures = [executor.submit(simulate) for _ in range(ACTOR_NUM)]
concurrent.futures.wait(futures)

trans_with_discounted_r = []
r_sums = []
for f in futures:
r_sum, trans = f.result()
r_sums.append(r_sum)
trans_with_discounted_r += trans

print(i_iteration, r_sums)

for i in range(0, len(trans_with_discounted_r), BATCH_SIZE):
batch = trans_with_discounted_r[i:i + BATCH_SIZE]
s, a, discounted_r = [np.array(e) for e in zip(*trans_with_discounted_r)]
ppo.train(s, a, discounted_r[:, np.newaxis])

根据论文中的说明,我们在采样时,用了多个 Actor 并行地采样,由于规模比较小,采样上的时间差别不大,主要耗时在训练过程中。

参考

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/12_Proximal_Policy_Optimization/simply_PPO.py

Powered By Valine
v1.5.2