差分隐私保护的随机森林算法及在钢材料上的应用

陈薛辉; 冯燕; 钱权

doi:10.13374/j.issn2095-9389.2022.05.29.002

摘要: 基于数据驱动的材料信息学被认为是材料研发第四范式，可以极大降低新材料的研发成本，缩短研发周期。然而，数据驱动的方法在材料数据共享利用时，会增加材料研发中关键工艺等敏感信息的隐私泄露风险。因此，面向隐私保护的机器学习是材料信息学中的关键问题。基于此，本文针对在材料信息学领域广泛使用的随机森林模型，提出了一种差分隐私保护的随机森林算法。算法将整体隐私预算分配到每棵树上，在建决策树过程中引入差分隐私的拉普拉斯机制和指数机制，即在决策树的分裂过程中采用指数机制随机选择分裂特征，同时采用拉普拉斯机制对节点数量添加噪声，实现对随机森林算法的差分隐私保护。本文结合钢材料疲劳性能预测实验，验证算法在数据分别采用集中式存储和分布式存储下的有效性。实验结果表明，在添加差分隐私保护后，各目标性能的预测决定系数R²值均达到0.8以上，与普通随机森林的结果相差很小。另外，在数据分布式存储情况下，随着隐私预算的增加，各目标性能的预测R²值随之增加。同时，随着最大树深度的增加，算法整体的预测精度先增加后降低，当最大树深度取5时，预测精度最好。综合看来，本文算法在实现随机森林的差分隐私保护前提下，仍能保持较高的预测精度，且数据在分散存储的分布式网络的环境中，可根据隐私预算等算法参数设置，实现隐私保护强度和预测精度的平衡，有广泛的应用前景。

Abstract: Data-driven material informatics is considered the fourth paradigm of materials research and development (R&D), which can greatly reduce R&D costs and shorten the R&D cycle. However, the data-driven method increases the risk of privacy disclosure when sharing and using materials data and sensitive information such as key processes in materials R&D. Therefore, privacy-preserving machine learning is a key issue in material informatics. The mainstream privacy protection methods in the current times include differential privacy, secure multi-party computation, federated learning, etc. The differential privacy model proposes strict definitions and metrics for quantitative evaluation of privacy protection, and the noise added by differential privacy is independent of the data scale. Only a small amount of noise is required to achieve a high level of protection, which considerably improves data usability. A novel differential privacy preserving random forest algorithm (DPRF) is proposed based on the fact that random forest is one of the most widely used models in material informatics. DPRF introduces the Laplace mechanism and exponential mechanism of differential privacy during the decision process tree building. First, the total privacy budget for the DPRF algorithm is set and then equally divided into each decision tree. During the tree-building process, the splitting features are randomly selected in the decision tree by the exponential mechanism and noise is added to the number of nodes by the Laplace mechanism, which is effective for differential privacy protection for the random forest. In experiments such as steel fatigue prediction experiments, the efficacies of DPRF under centralized or distributed data storage are verified. By setting different privacy budgets, the R² of the predicted results of the DPRF algorithm can reach more than 0.8 for each target feature after adding differential privacy, which is not much different from the original random forest algorithm. A distributed data storage scenario shows that with the increase of privacy budget, the R² of each target property prediction gradually increases. Comparing the effect of different tree depths in DPRF, it is shown that the overall R² of the target prediction tends to increase and then later decrease .as the maximum depth of the tree increases. Overall, the best prediction accuracy is achieved when the maximum depth of the tree is set at 5. In summary, DPRF has good prediction accuracy in terms of achieving differential privacy protection of random forests. Specifically, in a distributed and decentralized data environment, DPRF can strike a balance between privacy-preserving strength and prediction accuracy by setting privacy budgets, tree depth, etc., which shows a wide range of application prospects of our algorithm.

差分隐私保护的随机森林算法及在钢材料上的应用

Differential privacy protection random forest algorithm and its application in steel materials