With the increasing complexity of task execution and the diverse range of environmental conditions, a single unmanned aerial vehicle (UAV) is insufficient to meet practical mission requirements. Multi-UAV systems have vast potential for applications in areas such as search and rescue. During the execution of search and rescue missions, UAVs acquire the location of the target to be rescued and subsequently plan a path that avoids obstacles and leads to the target point. Traditional path planning algorithms require prior knowledge of obstacle distribution in the map, which may be challenging to obtain in real-world missions. To address the issue of traditional path planning algorithms relying on prior map information, this paper proposes a reinforcement learning-based approach for collaborative exploration of multiple UAVs in unknown environment. Firstly, considering the characteristics of collaborative exploration tasks and various constraints of UAV clusters, a Markov decision process (MDP) is employed to establish a game model and task objectives for the UAV cluster. The UAVs need to satisfy dynamic and obstacle avoidance constraints during mission execution, with the objective of maximizing the search and rescue success rate. Secondly, a reinforcement learning-based method for collaborative exploration of multiple UAVs is proposed. The Multi-Agent Soft Actor-Critic (MASAC) algorithm is utilized to iteratively train the UAVs' collaborative exploration strategies. The Actor network generates UAV actions, while the Critic network evaluates the quality of these strategies. To enhance the algorithm's generalization capability, training is conducted in randomly generated map environments. To avoid UAVs being obstructed by concave obstacles, a breadth-first search algorithm is used to calculate rewards based on the path distance between the UAVs and targets, rather than geometric distance. During the exploration process, each UAV continuously collects the map information and shares it with all other UAVs. They make individual action decisions based on the environment and information from other agents, and the mission is considered successful when multiple UAVs hover above the target. Finally, a virtual simulation platform for algorithm validation is developed using the Unity game engine. The proposed algorithm is implemented using PyTorch, and the bidirectional interaction between the Unity environment and the Python algorithm is achieved through the ML-Agents framework. Comparative experiments are conducted on the virtual simulation platform, comparing the proposed method against a non-cooperative single-agent SAC algorithm. The proposed method demonstrates advantages in terms of task success rate, task completion efficiency, and episode rewards, validating the feasibility and effectiveness of the proposed approach.