defsimulate(): s = env.reset() r_sum = 0 trans = [] for step in range(T_TIMESTEPS): a = ppo.choose_action(s) s_, r, done, _ = env.step(a) trans.append([s, a, (r + 8) / 8]) s = s_ r_sum += r
v_s_ = ppo.get_v(s_) for tran in trans[::-1]: v_s_ = tran[2] + GAMMA * v_s_ tran[2] = v_s_
return r_sum, trans
for i_iteration in range(ITER_MAX): futures = [executor.submit(simulate) for _ in range(ACTOR_NUM)] concurrent.futures.wait(futures)
trans_with_discounted_r = [] r_sums = [] for f in futures: r_sum, trans = f.result() r_sums.append(r_sum) trans_with_discounted_r += trans
print(i_iteration, r_sums)
for i in range(0, len(trans_with_discounted_r), BATCH_SIZE): batch = trans_with_discounted_r[i:i + BATCH_SIZE] s, a, discounted_r = [np.array(e) for e in zip(*trans_with_discounted_r)] ppo.train(s, a, discounted_r[:, np.newaxis])
根据论文中的说明,我们在采样时,用了多个 Actor 并行地采样,由于规模比较小,采样上的时间差别不大,主要耗时在训练过程中。
参考
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.