-
摘要: 离线强化学习利用预先收集的专家数据或其他经验数据,在不与环境交互的情况下离线学习动作策略。与在线强化学习相比,离线强化学习具有样本效率高、交互成本低的优势。强化学习中通常使用Q值估计函数或Q值估计网络表示状态−动作的价值。因无法通过与环境交互及时修正Q值估计误差,离线强化学习往往面临外推误差严重、样本利用率低的问题。为此,提出基于时间差分误差的离线强化学习采样方法,使用时间差分误差作为样本优先采样的优先度度量,通过使用优先采样和标准采样相结合的采样方式,提升离线强化学习的采样效率并缓解分布外误差问题。同时,在使用双Q值估计网络的基础上,根据目标网络的不同计算方法,比较了3种时间差分误差度量所对应的算法的性能。此外,为消除因使用优先经验回放机制的偏好采样产生的训练偏差,使用了重要性采样机制。通过在强化学习公测数据集——深度数据驱动强化学习数据集上与已有研究成果相比,基于时间差分误差的离线强化学习采样方法在最终性能、数据效率和训练稳定性上均有更好的表现。消融实验表明,优先采样和标准采样相结合的采样方式对算法性能的发挥至关重要,同时,使用最小化双目标Q值估计的时间差分误差优先度度量所对应的算法,在多个任务上具有最优的性能。基于时间差分误差的离线强化学习采样方法可与任何基于Q值估计的离线强化学习方法结合,具有性能稳定、实现简单、可扩展性强的特点。Abstract: Offline reinforcement learning uses pre-collected expert data or other empirical data to learn action strategies offline without interacting with the environment. Offline reinforcement learning is preferable to online reinforcement learning because it has lower interaction costs and trial-and-error risks. However, offline reinforcement learning often faces the issues of severe extrapolation errors and low sample utilization because the Q-value estimation errors cannot be corrected in time by interacting with the environment. To this end, this paper suggests an effective sampling strategy for offline reinforcement learning based on TD-error, using TD-error as the priority measure for priority sampling, and enhancing the sampling efficacy of offline reinforcement learning and addressing the issue of out-of-distribution error by using a combination of priority sampling and uniform sampling. Meanwhile, based on the use of the dual Q-value estimation network, this paper examines the performance of the algorithms corresponding to their time-difference error measures when determining the target network using three approaches, including the minimum, the maximum, and the convex combined of dual Q-value network, according to the various calculation techniques of the target network. Furthermore, to eliminate the training bias arising from preference sampling using priority sampling, this paper uses a significant sampling mechanism. By comparing with existing offline reinforcement learning research results combining sampling strategies on the D4RL baseline, the algorithm proposed shows better performance in terms of the final performance, data efficiency, and training stability. To confirm the contribution of each research point in the algorithm, two experiments were performed in the ablation experiment section of this study. Experiment 1 shows that the algorithm using the sampling method with a combination of uniform sampling and priority sampling outperforms the algorithm using uniform sampling alone and the algorithm using priority sampling alone in terms of sample utilization and strategy stability, while experiment 2 compares the effect on the performance of the algorithm based on the double Q-value estimation network produced by the double network of a maximum, minimum, and maximum-minimum convex combination of values based on the dual Q-value estimation network with a total of three different time-difference calculation methods on the performance of the algorithm. Experimental evidence shows that the algorithm in the research that uses the least amount of dual networks performs better overall and in terms of data utilization than the other two algorithms, but its strategy variance is higher. The approach described in this paper can be used in conjunction with any offline reinforcement learning method based on Q-value estimation. This approach has the advantages of stable performance, straightforward implementation, and high scalability, and it supports the use of reinforcement learning techniques in real-world settings.
-
Key words:
- offline /
- reinforcement learning /
- sample strategy /
- experience replay buffer /
- TD-error
-
图 3 本文的方法CQL_H与CQL_PER、CQL_PER_N_return算法在3种环境的3类数据上的性能比较. (a) Hopper; (b) HalfCheetah; (c) Walker2d
Figure 3. The performance of the methods CQL_H and CQL_PER, CQL_PER_N_return algorithms in this paper is compared on three types of data for three environments: (a) Hopper; (b) HalfCheetah; (c) Walker2d
算法1:基于时间差分误差的离线强化学习采样方法(CQL版本) 初始化:双Q值网络$ {Q_{{\varphi _1}}} $,$ {Q_{{\varphi _2}}} $,双Q值目标网络$ {Q_{\varphi {'_1}}} $,$ {Q_{\varphi {'_2}}} $,策略网络${\pi _\theta }$,Q值网络更新步长$\tau $,策略网络${\pi _\theta }$的参数更新步长$\eta $,片段长度为$H$,批处理大小为$N$,初始优先度设置为1,最大训练步数$T$,优先采样的最大步数${T_p}$,标准经验回放池$B$,优先经验回放池${B_p}$,Q网络参数$\varphi $的梯度$\Delta $,Q网络参数$\varphi $的更新步长$\zeta $,数据序号$i$,目标Q值网络参数软更新系数$\tau $。 对于训练步数$t < T$: 如果$t < {T_p}$(即从优先经验池中采样): 1. 据公式(15)计算优先采样率并从优先经验池中采样$N$个批处理数据 2. 算重要性采样权重:${w_i} = {(N \cdot P(i))^{ - \beta }}/{\text{ma}}{{\text{x}}_i}{w_i}$ 3. 据公式 (11)、(12)或(13)估计Q目标值$ {Q_{{\text{target}}}}({s_t},{a_t}) $ 4. 算Q值网络梯度变化
$\Delta \leftarrow \Delta + {\delta _i} \cdot {\nabla _\phi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$5. 新Q值网络:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $ 6. 更新目标Q值网络:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$ 7. 据公式(16)更新策略网络 #否则(即从标准经验池中采样): 8. 根据公式 (11)、(12)或(13)估计Q目标值$ {Q_{{\text{target}}}}({s_t},{a_t}) $ 9. 计算Q值网络梯度变化
$\Delta \leftarrow \Delta + {\nabla _\varphi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$10. 更新Q值网络:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $ 11. 软更新目标Q值网络:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$ 12. 根据公式(16)更新策略网络 表 1 实验所用的D4RL数据集
Table 1. D4RL dataset used in our experiment
Task Datasets Samples/$ 10^{4} $ Hopper Hopper-random 1 Hopper-medium 1 Hopper-medium-expert 2 Halfcheetah Halfcheetah-random 1 Halfcheetah-medium 1 Halfcheetah-medium-expert 2 Walker2d Walker2d-random 1 Walker2d-medium 1 Walker2d-medium-expert 2 -
参考文献
[1] Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575(7782): 350 doi: 10.1038/s41586-019-1724-z [2] Kiran B R, Sobh I, Talpaert V, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst, 2022, 23(6): 4909 doi: 10.1109/TITS.2021.3054625 [3] Degrave J, Felici F, Buchli J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022, 602(7897): 414 [4] Fawzi A, Balog M, Huang A, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022, 610(7930): 47 doi: 10.1038/s41586-022-05172-4 [5] Liang X X, Feng Y H, Huang J C, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. J Softw, 2020, 31(4): 948梁星星, 冯旸赫, 黄金才, 等. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4):948 [6] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning//International Conference on Machine Learning. New York, 2016: 1928 [7] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor//International Conference on Machine Learning. Stockholm, 2018: 1861 [8] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods // International Conference on Machine Learning. Stockholm, 2018: 1587 [9] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels // International Conference on Machine Learning. California, 2019: 2555 [10] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[J/OL]. arXiv preprint (2020-05-17) [2022-10-22].https://arxiv.org/abs/1912.01603 [11] Hafner D, Lillicrap T, Norouzi M, et al. Mastering atari with discrete world models[J/OL]. arXiv preprint (2022-02-12) [2022-10-22].https://arxiv.org/abs/2010.02193 [12] Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration // International Conference on Machine Learning. California, 2019: 2052 [13] Zhang L F, Zhang Y L, Liu S X, et al. ORAD: A new framework of offline Reinforcement Learning with Q-value regularization. Evol Intel, 2022: 1 [14] Mao Y H, Wang C, Wang B, et al. MOORe: Model-based offline-to-online reinforcement learning[J/OL]. arXiv preprint (2022-01-25) [2022-10-22]. https://arvix.org/abs/2201.10070 [15] Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst, 2021, 34: 20132 [16] Kumar A, Zhou A, Tucker G, et al. Conservative Q-learning for offline reinforcement learning // Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, 2020: 1179 [17] Fu J, Kumar A, Nachum O, et al. D4rl: Datasets for deep data-driven reinforcement learning[J/OL]. arXiv preprint (2021-02-06) [2022-10-22]. https://arxiv.org/abs/2004.07219 [18] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv preprint (2016-02-25) [2022-10-22]. https://arxiv.org/abs/1511.05952 [19] Liu H, Trott A, Socher R, et al. Competitive experience replay[J/OL]. arXiv preprint (2019-02-17) [2022-10-22]. https://arxiv.org/abs/1902.00528 [20] Fu Y W, Wu D, Boulet B. Benchmarking sample selection strategies for batch reinforcement learning[J/OL]. OpenReview. net (2022-01-29) [2022-10-22]. https://openreview.net/forum?id=WxBFVNbDUT6 [21] Lee S, Seo Y, Lee K, et al. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble // Conference on Robot Learning. London, 2022: 1702 [22] Bellman R. A Markovian decision process. J Math Mech, 1957: 679 [23] Hessel M, Modayil J, Van Hasselt H, et al. Rainbow: Combining improvements in deep reinforcement learning// The Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 3215 [24] Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization // International Conference on Machine Learning. Lille, 2015: 1889 [25] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J/OL]. arXiv preprint (2017-08-28) [2022-10-22]. https://arxiv.org/abs/1707.06347 [26] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT press. 2018 [27] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236 -