• 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中国科技论文统计源期刊
  • 中国科学引文数据库来源期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于时间差分误差的离线强化学习采样策略

张龙飞 冯旸赫 梁星星 刘世旋 程光权 黄金才

张龙飞, 冯旸赫, 梁星星, 刘世旋, 程光权, 黄金才. 基于时间差分误差的离线强化学习采样策略[J]. 工程科学学报. doi: 10.13374/j.issn2095-9389.2022.10.22.001
引用本文: 张龙飞, 冯旸赫, 梁星星, 刘世旋, 程光权, 黄金才. 基于时间差分误差的离线强化学习采样策略[J]. 工程科学学报. doi: 10.13374/j.issn2095-9389.2022.10.22.001
ZHANG Longfei, FENG Yanghe, LIANG Xingxing, LIU Shixuan, Cheng Guangquan, Huang Jincai. Sample strategy based on TD-error for offline reinforcement learning[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.10.22.001
Citation: ZHANG Longfei, FENG Yanghe, LIANG Xingxing, LIU Shixuan, Cheng Guangquan, Huang Jincai. Sample strategy based on TD-error for offline reinforcement learning[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.10.22.001

基于时间差分误差的离线强化学习采样策略

doi: 10.13374/j.issn2095-9389.2022.10.22.001
基金项目: 国家自然科学基金面上资助项目(62273352)
详细信息
    通讯作者:

    E-mail:fengyanghe@nudt.edu.cn

  • 中图分类号: TG142.71

Sample strategy based on TD-error for offline reinforcement learning

More Information
  • 摘要: 离线强化学习利用预先收集的专家数据或其他经验数据,在不与环境交互的情况下离线学习动作策略。与在线强化学习相比,离线强化学习具有样本效率高、交互成本低的优势。强化学习中通常使用Q值估计函数或Q值估计网络表示状态−动作的价值。因无法通过与环境交互及时修正Q值估计误差,离线强化学习往往面临外推误差严重、样本利用率低的问题。为此,提出基于时间差分误差的离线强化学习采样方法,使用时间差分误差作为样本优先采样的优先度度量,通过使用优先采样和标准采样相结合的采样方式,提升离线强化学习的采样效率并缓解分布外误差问题。同时,在使用双Q值估计网络的基础上,根据目标网络的不同计算方法,比较了3种时间差分误差度量所对应的算法的性能。此外,为消除因使用优先经验回放机制的偏好采样产生的训练偏差,使用了重要性采样机制。通过在强化学习公测数据集——深度数据驱动强化学习数据集上与已有研究成果相比,基于时间差分误差的离线强化学习采样方法在最终性能、数据效率和训练稳定性上均有更好的表现。消融实验表明,优先采样和标准采样相结合的采样方式对算法性能的发挥至关重要,同时,使用最小化双目标Q值估计的时间差分误差优先度度量所对应的算法,在多个任务上具有最优的性能。基于时间差分误差的离线强化学习采样方法可与任何基于Q值估计的离线强化学习方法结合,具有性能稳定、实现简单、可扩展性强的特点。

     

  • 图  1  本文方法的框架和网络的具体架构. (a) 本文方法的框架; (b) 网络的具体架构

    Figure  1.  Framework of the method and the specific architecture of the network in this paper: (a) framework of the method in this paper; (b) the specific architecture of the network

    Notes: MLP represents the multiple layer perception

    图  2  本文实验所使用的DMControl中的3个仿真环境. (a) Hopper;(b) HalfCheetah;(c) Walker2d

    Figure  2.  The three simulation environments in DMControl used for the experiments in this paper: (a) Hopper; (b) HalfCheetah; (c) Walker2d

    图  3  本文的方法CQL_H与CQL_PER、CQL_PER_N_return算法在3种环境的3类数据上的性能比较. (a) Hopper; (b) HalfCheetah; (c) Walker2d

    Figure  3.  The performance of the methods CQL_H and CQL_PER, CQL_PER_N_return algorithms in this paper is compared on three types of data for three environments: (a) Hopper; (b) HalfCheetah; (c) Walker2d

    图  4  本文的方法CQL_H使用不同采样方式在3种环境的3类数据上的性能比较. (a) Hopper; (b) HalfCheetah; (c) Walker2d.

    Figure  4.  The performance of the method CQL_H using different sampling methods in this paper is compared on 3 types of data for a total of 3 environments: (a) Hopper; (b) HalfCheetah; (c) Walker2d

    图  5  本文的方法CQL_H使用3种时间差分误差的优先采样策略的性能比较. (a) Hopper-medium; (b) HalfCheetah-medium; (c) Walker2d-medium

    Figure  5.  Comparison of CQL_H with three different TD-error: (a) Hopper-medium; (b) HalfCheetah-medium; (c) Walker2d-medium

    算法1:基于时间差分误差的离线强化学习采样方法(CQL版本)
    初始化:双Q值网络$ {Q_{{\varphi _1}}} $,$ {Q_{{\varphi _2}}} $,双Q值目标网络$ {Q_{\varphi {'_1}}} $,$ {Q_{\varphi {'_2}}} $,策略网络${\pi _\theta }$,Q值网络更新步长$\tau $,策略网络${\pi _\theta }$的参数更新步长$\eta $,片段长度为$H$,批处理大小为$N$,初始优先度设置为1,最大训练步数$T$,优先采样的最大步数${T_p}$,标准经验回放池$B$,优先经验回放池${B_p}$,Q网络参数$\varphi $的梯度$\Delta $,Q网络参数$\varphi $的更新步长$\zeta $,数据序号$i$,目标Q值网络参数软更新系数$\tau $。
    对于训练步数$t < T$:
      如果$t < {T_p}$(即从优先经验池中采样):
      1. 据公式(15)计算优先采样率并从优先经验池中采样$N$个批处理数据
      2. 算重要性采样权重:${w_i} = {(N \cdot P(i))^{ - \beta }}/{\text{ma}}{{\text{x}}_i}{w_i}$
      3. 据公式 (11)、(12)或(13)估计Q目标值$ {Q_{{\text{target}}}}({s_t},{a_t}) $
      4. 算Q值网络梯度变化
      $\Delta \leftarrow \Delta + {\delta _i} \cdot {\nabla _\phi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$
      5. 新Q值网络:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $
      6. 更新目标Q值网络:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$
      7. 据公式(16)更新策略网络
      #否则(即从标准经验池中采样):
      8. 根据公式 (11)、(12)或(13)估计Q目标值$ {Q_{{\text{target}}}}({s_t},{a_t}) $
      9. 计算Q值网络梯度变化
      $\Delta \leftarrow \Delta + {\nabla _\varphi }\left[ {{{({E_{({s_t},{a_t})\~\mathcal{D}}}\left[ {{Q_\varphi }({s_t},{a_t})} \right] - {Q_{{\text{target}}}}({s_t},{a_t}))}^2}} \right]$
      10. 更新Q值网络:$\varphi \leftarrow \varphi + \zeta \cdot {\text{ }}\Delta $
      11. 软更新目标Q值网络:$\varphi ' \leftarrow \tau \varphi + (1 - \tau )\varphi '$
      12. 根据公式(16)更新策略网络
    下载: 导出CSV

    表  1  实验所用的D4RL数据集

    Table  1.   D4RL dataset used in our experiment

    TaskDatasetsSamples/$ 10^{4} $
    HopperHopper-random1
    Hopper-medium1
    Hopper-medium-expert2
    HalfcheetahHalfcheetah-random1
    Halfcheetah-medium1
    Halfcheetah-medium-expert2
    Walker2dWalker2d-random1
    Walker2d-medium1
    Walker2d-medium-expert2
    下载: 导出CSV
  • [1] Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575(7782): 350 doi: 10.1038/s41586-019-1724-z
    [2] Kiran B R, Sobh I, Talpaert V, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst, 2022, 23(6): 4909 doi: 10.1109/TITS.2021.3054625
    [3] Degrave J, Felici F, Buchli J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022, 602(7897): 414
    [4] Fawzi A, Balog M, Huang A, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022, 610(7930): 47 doi: 10.1038/s41586-022-05172-4
    [5] Liang X X, Feng Y H, Huang J C, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. J Softw, 2020, 31(4): 948

    梁星星, 冯旸赫, 黄金才, 等. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4):948
    [6] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning//International Conference on Machine Learning. New York, 2016: 1928
    [7] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor//International Conference on Machine Learning. Stockholm, 2018: 1861
    [8] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods // International Conference on Machine Learning. Stockholm, 2018: 1587
    [9] Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels // International Conference on Machine Learning. California, 2019: 2555
    [10] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[J/OL]. arXiv preprint (2020-05-17) [2022-10-22].https://arxiv.org/abs/1912.01603
    [11] Hafner D, Lillicrap T, Norouzi M, et al. Mastering atari with discrete world models[J/OL]. arXiv preprint (2022-02-12) [2022-10-22].https://arxiv.org/abs/2010.02193
    [12] Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration // International Conference on Machine Learning. California, 2019: 2052
    [13] Zhang L F, Zhang Y L, Liu S X, et al. ORAD: A new framework of offline Reinforcement Learning with Q-value regularization. Evol Intel, 2022: 1
    [14] Mao Y H, Wang C, Wang B, et al. MOORe: Model-based offline-to-online reinforcement learning[J/OL]. arXiv preprint (2022-01-25) [2022-10-22]. https://arvix.org/abs/2201.10070
    [15] Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning. Adv Neural Inf Process Syst, 2021, 34: 20132
    [16] Kumar A, Zhou A, Tucker G, et al. Conservative Q-learning for offline reinforcement learning // Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, 2020: 1179
    [17] Fu J, Kumar A, Nachum O, et al. D4rl: Datasets for deep data-driven reinforcement learning[J/OL]. arXiv preprint (2021-02-06) [2022-10-22]. https://arxiv.org/abs/2004.07219
    [18] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. arXiv preprint (2016-02-25) [2022-10-22]. https://arxiv.org/abs/1511.05952
    [19] Liu H, Trott A, Socher R, et al. Competitive experience replay[J/OL]. arXiv preprint (2019-02-17) [2022-10-22]. https://arxiv.org/abs/1902.00528
    [20] Fu Y W, Wu D, Boulet B. Benchmarking sample selection strategies for batch reinforcement learning[J/OL]. OpenReview. net (2022-01-29) [2022-10-22]. https://openreview.net/forum?id=WxBFVNbDUT6
    [21] Lee S, Seo Y, Lee K, et al. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble // Conference on Robot Learning. London, 2022: 1702
    [22] Bellman R. A Markovian decision process. J Math Mech, 1957: 679
    [23] Hessel M, Modayil J, Van Hasselt H, et al. Rainbow: Combining improvements in deep reinforcement learning// The Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, 2018: 3215
    [24] Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization // International Conference on Machine Learning. Lille, 2015: 1889
    [25] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J/OL]. arXiv preprint (2017-08-28) [2022-10-22]. https://arxiv.org/abs/1707.06347
    [26] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT press. 2018
    [27] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529 doi: 10.1038/nature14236
  • 加载中
图(5) / 表(2)
计量
  • 文章访问数:  121
  • HTML全文浏览量:  34
  • PDF下载量:  13
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-10-22
  • 网络出版日期:  2023-03-28

目录

    /

    返回文章
    返回