基于大语言模型和机器学习模型协作的特征筛选管道助力缓蚀剂精准预测

杨景智; 刘典典; 龚海燕; 郭鑫; 金宇婷; 马菱薇; 张达威; 李晓刚

doi:10.13374/j.issn2095-9389.2025.10.21.001

基于大语言模型和机器学习模型协作的特征筛选管道助力缓蚀剂精准预测

Collaborative feature screen with large language and machine learning model to enhance corrosion inhibitor prediction

摘要

摘要: 从工农业生产到国防科技，材料腐蚀遍及国民经济中的各个领域，严重威胁设施装备服役安全，造成巨大的经济损失，对人类生命健康产生极大的威胁和隐患. 金属缓蚀剂能够改变金属表面状态，使电化学反应的活化能垒增高，从而减缓金属腐蚀速率. 缓蚀剂具有低剂量、低成本、高效率等优点，因此成为应用最广泛的腐蚀抑制手段之一. 然而，缓蚀剂的种类多样，作用机制复杂，且与环境因素密切相关. 传统的腐蚀研究方法，比如失重测试和电化学测试等通常需要大量的人力、物力成本和时间消耗，极大地阻碍了高性能缓蚀剂的设计与应用. 需要一种更加高效的技术手段推动缓蚀剂的研究. 近年来，材料基因工程技术的发展引领着腐蚀研究从经验试错向数字化、智能化方向变革，利用人工智能技术可实现对现有数据进行分析来预测庞大的未知空间，并探究材料成分、结构与性能的潜在关系. 本文基于大语言模型（LLM）和机器学习模型协作的特征筛选管道，借助系统性腐蚀知识注入、提示词设计和递归筛选等技术，从209种特征描述符中筛选得到13种与饱和CO₂环境下缓蚀性能最相关的描述符，这些描述符涉及分子物理化学性质，分子结构性质以及环境参数. 筛选后，模型预测的均方误差由121降到11. 后续的腐蚀实验验证了模型的预测精度与泛化能力. 本文开发的缓蚀剂特征筛选流程与机器学习模型，显著提升了CO₂环境下高性能缓蚀剂的研发效率.

Abstract: Corrosion affects every sector of the national economy, from industrial and agricultural production to defense technology. It poses a serious threat to the safety of equipment in service, leads to substantial economic losses, and presents significant risks to human life and health. Metal corrosion inhibitors can modify the surface characteristics of metals, increase the activation energy barrier of corrosion reactions, affect surface electrochemical behavior, and slow down the corrosion rate. These inhibitors have advantages such as low dosage, low cost, and high efficiency, making them one of the most widely used methods for corrosion control. However, there are many types of inhibitors with complex mechanisms, which are closely related to environmental factors. Conventional laboratory methods such as precise weight lose analysis or electrochemical measurements such as potentiodynamic polarization and electrochemical impedance spectroscopy are labor-intensive, time-consuming, and costly, which greatly hinders the design and application of high-performance inhibitors. There is an urgent need for a more efficient approach to advance inhibitor research. A recent paradigm shift driven by advancements in materials genome engineering (MGE) is enabling researchers to move beyond the traditional trial-and-error approach. By integrating high-throughput computational tools with fundamental chemical principles, MGE facilitates a more systematic and intelligent exploration of materials science. At the core of this transformation lies machine learning (ML), which serves as a powerful pattern recognition engine. ML algorithms can analyze vast historical experimental data to predict the performance of novel materials and uncover the often hidden, nonlinear relationships between molecular features and their functional properties. In this study, we developed a novel methodology that synergizes a state-of-the-art large language model (LLM) with a predictive ML framework. The LLM was employed to systematically parse and extract meaningful molecular features from thousands of unstructured research papers and experimental datasets, specifically focusing on inhibitors used in CO₂-saturated environments. We constructed a comprehensive corrosion inhibitor research dataset by extracting 1152 data entries from 174 peer-reviewed articles on inhibitor development and application in CO₂-saturated environments. These entries contain detailed information on molecular structures, corrosion environment parameters, inhibitor concentrations, experimental temperatures, and inhibition efficiency metrics. Statistical analysis revealed that the target variables in our dataset exhibited relatively uniform distributions without significant skewness or clustering, indicating a balanced data structure that supports robust model training and generalization. Our methodology implements a two-stage feature selection strategy based on a collaborative large-small model pipeline. We first established a domain-specific knowledge framework by injecting corrosion science expertise into the Deepseek-R1 LLM, enabling systematic analysis of unstructured scientific texts. This LLM-based approach allowed us to efficiently screen an initial set of 204 molecular descriptors down to 50 candidates that demonstrate clear relevance to CO₂ corrosion inhibition mechanisms. We then applied quantitative statistical techniques using a smaller specialized model to further refine the feature set through correlation analysis and recursive feature elimination. This two-phase process reduced the final feature count to 13 non-redundant descriptors that comprehensively captured the interplay between molecular structure, inhibitor concentration, and environmental parameters. The selected 13 features reduced the mean squared error from 121 to 11 of the models. To validate our approach, we built a gradient boosting model incorporating both the selected molecular features and environmental parameters. We identified five representative molecules and their corresponding corrosion environments for experimental testing. The results demonstrated the good generalization ability of the model, confirming its potential for practical application in corrosion inhibitor design and development.

HTML全文

参考文献(42)

施引文献

资源附件(0)