基于时间差分误差的离线强化学习采样策略

张龙飞; 冯旸赫; 梁星星; 刘世旋; 程光权; 黄金才

doi:10.13374/j.issn2095-9389.2022.10.22.001

摘要: 离线强化学习利用预先收集的专家数据或其他经验数据，在不与环境交互的情况下离线学习动作策略。与在线强化学习相比，离线强化学习具有样本效率高、交互成本低的优势。强化学习中通常使用Q值估计函数或Q值估计网络表示状态−动作的价值。因无法通过与环境交互及时修正Q值估计误差，离线强化学习往往面临外推误差严重、样本利用率低的问题。为此，提出基于时间差分误差的离线强化学习采样方法，使用时间差分误差作为样本优先采样的优先度度量，通过使用优先采样和标准采样相结合的采样方式，提升离线强化学习的采样效率并缓解分布外误差问题。同时，在使用双Q值估计网络的基础上，根据目标网络的不同计算方法，比较了3种时间差分误差度量所对应的算法的性能。此外，为消除因使用优先经验回放机制的偏好采样产生的训练偏差，使用了重要性采样机制。通过在强化学习公测数据集—深度数据驱动强化学习数据集上与已有研究成果相比，基于时间差分误差的离线强化学习采样方法在最终性能、数据效率和训练稳定性上均有更好的表现。消融实验表明，优先采样和标准采样相结合的采样方式对算法性能的发挥至关重要，同时，使用最小化双目标Q值估计的时间差分误差优先度度量所对应的算法，在多个任务上具有最优的性能。基于时间差分误差的离线强化学习采样方法可与任何基于Q值估计的离线强化学习方法结合，具有性能稳定、实现简单、可扩展性强的特点。

Abstract: Offline reinforcement learning uses pre-collected expert data or other empirical data to learn action strategies offline without interacting with the environment. Offline reinforcement learning is preferable to online reinforcement learning because it has lower interaction costs and trial-and-error risks. However, offline reinforcement learning often faces the issues of severe extrapolation errors and low sample utilization because the Q-value estimation errors cannot be corrected in time by interacting with the environment. To this end, this paper suggests an effective sampling strategy for offline reinforcement learning based on TD-error, using TD-error as the priority measure for priority sampling, and enhancing the sampling efficacy of offline reinforcement learning and addressing the issue of out-of-distribution error by using a combination of priority sampling and uniform sampling. Meanwhile, based on the use of the dual Q-value estimation network, this paper examines the performance of the algorithms corresponding to their time-difference error measures when determining the target network using three approaches, including the minimum, the maximum, and the convex combined of dual Q-value network, according to the various calculation techniques of the target network. Furthermore, to eliminate the training bias arising from preference sampling using priority sampling, this paper uses a significant sampling mechanism. By comparing with existing offline reinforcement learning research results combining sampling strategies on the D4RL baseline, the algorithm proposed shows better performance in terms of the final performance, data efficiency, and training stability. To confirm the contribution of each research point in the algorithm, two experiments were performed in the ablation experiment section of this study. Experiment 1 shows that the algorithm using the sampling method with a combination of uniform sampling and priority sampling outperforms the algorithm using uniform sampling alone and the algorithm using priority sampling alone in terms of sample utilization and strategy stability, while experiment 2 compares the effect on the performance of the algorithm based on the double Q-value estimation network produced by the double network of a maximum, minimum, and maximum-minimum convex combination of values based on the dual Q-value estimation network with a total of three different time-difference calculation methods on the performance of the algorithm. Experimental evidence shows that the algorithm in the research that uses the least amount of dual networks performs better overall and in terms of data utilization than the other two algorithms, but its strategy variance is higher. The approach described in this paper can be used in conjunction with any offline reinforcement learning method based on Q-value estimation. This approach has the advantages of stable performance, straightforward implementation, and high scalability, and it supports the use of reinforcement learning techniques in real-world settings.

基于时间差分误差的离线强化学习采样策略

Sample strategy based on TD-error for offline reinforcement learning