苏牧青, 王寅, 濮锐敏, 余萌. 基于强化学习的多无人车协同围捕方法[J]. 工程科学学报. DOI: 10.13374/j.issn2095-9389.2023.09.15.004
引用本文: 苏牧青, 王寅, 濮锐敏, 余萌. 基于强化学习的多无人车协同围捕方法[J]. 工程科学学报. DOI: 10.13374/j.issn2095-9389.2023.09.15.004
SU Muqing, WANG Yin, PU Ruimin, YU Meng. Cooperative encirclement method for multiple unmanned ground vehicles based on reinforcement learning[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2023.09.15.004
Citation: SU Muqing, WANG Yin, PU Ruimin, YU Meng. Cooperative encirclement method for multiple unmanned ground vehicles based on reinforcement learning[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2023.09.15.004

基于强化学习的多无人车协同围捕方法

Cooperative encirclement method for multiple unmanned ground vehicles based on reinforcement learning

  • 摘要: 本文面向无人车协同围捕问题开展研究,提出了一种基于柔性执行者−评论家(SAC)算法框架的协同围捕算法. 针对多无人车之间的协同性差的问题,在网络结构中加入长短期记忆(LSTM)构建记忆功能,帮助无人车利用历史观测序列进行更稳健的决策;针对网络结构中引入LSTM所导致的状态空间维度增大、效率低的问题,提出引入注意力机制,通过对状态空间进行注意力权重的计算和选择,将注意力集中在与任务相关的关键状态上,从而约束状态空间维度并保证网络的稳定性,实现多无人车之间稳定高效的合作并提高算法的训练效率. 为解决协同围捕任务中奖励稀疏的问题,提出通过混合奖励函数将奖励函数分为个体奖励和协同奖励,通过引入个体奖励和协同奖励,无人车在围捕过程中可以获得更频繁的奖励信号. 个体奖励通过引导无人车向目标靠近来激励其运动行为,而协同奖励则激励群体无人车共同完成围捕任务,从而进一步提高算法的收敛速度. 最后,通过仿真和实验表明,该方法具有更快的收敛速度,相较于SAC算法,围捕时间缩短15.1%,成功率提升7.6%.

     

    Abstract: Collaborative encirclement of multiple unmanned ground vehicles (UGVs) is a focal challenge in the realm of multiagent collaborative tasks, representing a fundamental issue in complex undertakings such as multiagent collaborative search and interception. Although optimization algorithms have yielded rich research outcomes in collaborative encirclement, challenges persist, including poor real-time computational efficiency and weak robustness. Reinforcement learning theory holds considerable promise for addressing multiagent sequential decision problems. This paper delves into the study of the collaborative encirclement of multiple UGVs based on deep reinforcement learning theory, focusing on the following key aspects: establishing a kinematic model for UGVs to describe the collaborative encirclement task, detailing the collaborative encirclement process, developing strategies for target UGV escape, and addressing challenges arising from the increasing number of UGVs, which results in a complex environment and issues such as algorithmic instability, dimension explosion, and poor convergence. This paper introduces a collaborative encirclement algorithm based on the soft actor–critic (SAC) framework. To address issues related to poor collaboration and weak generalization among multiple UGVs, long short-term memory is incorporated into the network structure, serving as a memory function for UGVs. This tactic aids in capturing and using information from historical observation sequences, effectively processing time–series data, making more accurate decisions, promoting mutual collaboration among UGVs, and enhancing system stability. To tackle the issue of increased state space dimensions and low training efficiency during collaborative encirclement, an attention mechanism is introduced to calculate and select attention weights in the state space, focusing attention on key states relevant to the task. This strategy helps constrain state space dimensions, ensuring network stability, achieving stable and efficient collaboration among multiple UGVs, and improving algorithm training efficiency. To address the problem of sparse rewards in collaborative encirclement tasks, a mixed reward function is proposed that divides the reward function into individual and collaborative rewards. Individual rewards guide UGVs toward the target, incentivizing their motion behavior, whereas collaborative rewards motivate a group of UGVs to collectively accomplish the encirclement task. This approach further guides UGVs to obtain more frequent reward signals, ultimately enhancing the algorithm convergence speed. Simulation and experimental results demonstrate that the proposed method achieves faster convergence than SAC, with a 15.1% reduction in encirclement time and a 7.6% improvement in success rate. Finally, the improved algorithm developed in this paper is deployed on a UGV platform, and real-world experiments in typical encirclement scenarios validate its feasibility and effectiveness in embedded systems.

     

/

返回文章
返回