Abstract:
Corrosion affects every sector of the national economy, from industrial and agricultural production to defense technology. It poses a serious threat to the safety of equipment in service, leads to substantial economic losses, and presents significant risks to human life and health. Metal corrosion inhibitors can modify the surface characteristics of metals, increase the activation energy barrier of corrosion reactions, affect surface electrochemical behavior, and slow down the corrosion rate. These inhibitors have advantages such as low dosage, low cost, and high efficiency, making them one of the most widely used methods for corrosion control. However, there are many types of inhibitors with complex mechanisms, which are closely related to environmental factors. Conventional laboratory methods such as precise weight lose analysis or electrochemical measurements such as potentiodynamic polarization and electrochemical impedance spectroscopy are labor-intensive, time-consuming, and costly, which greatly hinders the design and application of high-performance inhibitors. There is an urgent need for a more efficient approach to advance inhibitor research. A recent paradigm shift driven by advancements in materials genome engineering (MGE) is enabling researchers to move beyond the traditional trial-and-error approach. By integrating high-throughput computational tools with fundamental chemical principles, MGE facilitates a more systematic and intelligent exploration of materials science. At the core of this transformation lies machine learning (ML), which serves as a powerful pattern recognition engine. ML algorithms can analyze vast historical experimental data to predict the performance of novel materials and uncover the often hidden, nonlinear relationships between molecular features and their functional properties. In this study, we developed a novel methodology that synergizes a state-of-the-art large language model (LLM) with a predictive ML framework. The LLM was employed to systematically parse and extract meaningful molecular features from thousands of unstructured research papers and experimental datasets, specifically focusing on inhibitors used in CO
2-saturated environments. We constructed a comprehensive corrosion inhibitor research dataset by extracting
1152 data entries from 174 peer-reviewed articles on inhibitor development and application in CO
2-saturated environments. These entries contain detailed information on molecular structures, corrosion environment parameters, inhibitor concentrations, experimental temperatures, and inhibition efficiency metrics. Statistical analysis revealed that the target variables in our dataset exhibited relatively uniform distributions without significant skewness or clustering, indicating a balanced data structure that supports robust model training and generalization. Our methodology implements a two-stage feature selection strategy based on a collaborative large-small model pipeline. We first established a domain-specific knowledge framework by injecting corrosion science expertise into the Deepseek-R1 LLM, enabling systematic analysis of unstructured scientific texts. This LLM-based approach allowed us to efficiently screen an initial set of 204 molecular descriptors down to 50 candidates that demonstrate clear relevance to CO
2 corrosion inhibition mechanisms. We then applied quantitative statistical techniques using a smaller specialized model to further refine the feature set through correlation analysis and recursive feature elimination. This two-phase process reduced the final feature count to 13 non-redundant descriptors that comprehensively captured the interplay between molecular structure, inhibitor concentration, and environmental parameters. The selected 13 features reduced the mean squared error from 121 to 11 of the models. To validate our approach, we built a gradient boosting model incorporating both the selected molecular features and environmental parameters. We identified five representative molecules and their corresponding corrosion environments for experimental testing. The results demonstrated the good generalization ability of the model, confirming its potential for practical application in corrosion inhibitor design and development.