Abstract:
This study addresses the problem of rapid multi-agent search and coverage in unknown, dynamic environments by proposing an integrated distributed control architecture that combines reinforcement learning (RL) with model predictive control (MPC). In such environments, agents must explore quickly while facing time-varying obstacles, incomplete prior maps, and limited onboard sensing—challenges that complicate the simultaneous assurance of efficiency and safety. Conventional model-based coverage controllers depend on accurate environmental models, while purely learning-based policies may issue unsafe commands when faced with unseen dynamics. The objective is to enable the swarm to perform coverage missions safely and efficiently under uncertainty. First, the proposed architecture employs the multi-agent deep deterministic policy gradient (MADDPG) algorithm to learn a cooperative coverage policy that generates nominal coverage actions for each agent. During training, a centralized critic utilizes joint observations and actions to stabilize learning, while trained actors are deployed in a decentralized manner. At execution time, each agent computes its command based on its own observation via the learned actor, eliminating the need for a centralized planner. The coverage reward encourages the rapid exploration of uncovered regions, coordinated behavior to minimize redundant overlaps, and smooth control efforts, while penalties discourage collisions and excessive mode switching. Then, an MPC module incorporating control barrier functions (CBFs) is introduced to refine the nominal coverage commands, ensuring compliance with safety constraints. Safety constraints are encoded via CBF inequalities that enforce obstacle avoidance, interagent separation, and optional workspace boundaries, while the MPC further accounts for actuator limits and short-horizon dynamics. At each time step, the controller solves a constrained optimization problem to find the closest safe action to the RL–generated command to maintain coverage performance when the risk level is low. This separation allows the learning module to focus on coverage behavior, while the MPC–CBF module provides online safety certification without retraining the policy. Finally, a dynamic weighting fusion mechanism is designed to enable agents to adaptively switch between three operational modes: full-domain coverage, dynamic avoidance, and emergency braking, to balance coverage efficiency with safety assurance. Specifically, fusion weights are computed from predictive risk indicators such as the minimum forecast distance to obstacles and neighboring agents, and the feasibility margin of the CBF constraints. When the risk is low, the RL action dominates. As risk increases, the MPC–CBF output gains influence. In imminent-collision scenarios, an emergency braking command is triggered, which avoids hard switching and improves robustness. We evaluated the framework using the dynamic-obstacle avoidance success rate as the primary metric and employed two ablation studies to quantify the contribution of each component. Experimental results in scenarios with moving obstacles show that the complete framework achieves a higher obstacle-avoidance success rate than two ablated variants, one without the MPC–CBF module and another without the dynamic weighting fusion mechanism. These results underscore the roles of online safety certification and adaptive-mode fusion in ensuring robust collision avoidance in unknown, dynamic environments. In addition, semi-physical experiments confirmed the feasibility and scalability of the proposed method in real-world scenarios. In summary, the proposed RL–MPC integrated architecture offers a practical and extensible solution for safety-critical distributed coverage and rapid searches in previously unseen, dynamic environments.