Differential privacy protection random forest algorithm and its application in steel materials
-
摘要: 基于数据驱动的材料信息学被认为是材料研发第四范式,可以极大降低新材料的研发成本,缩短研发周期。然而,数据驱动的方法在材料数据共享利用时,会增加材料研发中关键工艺等敏感信息的隐私泄露风险。因此,面向隐私保护的机器学习是材料信息学中的关键问题。基于此,本文针对在材料信息学领域广泛使用的随机森林模型,提出了一种差分隐私保护的随机森林算法。算法将整体隐私预算分配到每棵树上,在建决策树过程中引入差分隐私的拉普拉斯机制和指数机制,即在决策树的分裂过程中采用指数机制随机选择分裂特征,同时采用拉普拉斯机制对节点数量添加噪声,实现对随机森林算法的差分隐私保护。本文结合钢材料疲劳性能预测实验,验证算法在数据分别采用集中式存储和分布式存储下的有效性。实验结果表明,在添加差分隐私保护后,各目标性能的预测决定系数R2值均达到0.8以上,与普通随机森林的结果相差很小。另外,在数据分布式存储情况下,随着隐私预算的增加,各目标性能的预测R2值随之增加。同时,随着最大树深度的增加,算法整体的预测精度先增加后降低,当最大树深度取5时,预测精度最好。综合看来,本文算法在实现随机森林的差分隐私保护前提下,仍能保持较高的预测精度,且数据在分散存储的分布式网络的环境中,可根据隐私预算等算法参数设置,实现隐私保护强度和预测精度的平衡,有广泛的应用前景。Abstract: Data-driven material informatics is considered the fourth paradigm of materials research and development (R&D), which can greatly reduce R&D costs and shorten the R&D cycle. However, the data-driven method increases the risk of privacy disclosure when sharing and using materials data and sensitive information such as key processes in materials R&D. Therefore, privacy-preserving machine learning is a key issue in material informatics. The mainstream privacy protection methods in the current times include differential privacy, secure multi-party computation, federated learning, etc. The differential privacy model proposes strict definitions and metrics for quantitative evaluation of privacy protection, and the noise added by differential privacy is independent of the data scale. Only a small amount of noise is required to achieve a high level of protection, which considerably improves data usability. A novel differential privacy preserving random forest algorithm (DPRF) is proposed based on the fact that random forest is one of the most widely used models in material informatics. DPRF introduces the Laplace mechanism and exponential mechanism of differential privacy during the decision process tree building. First, the total privacy budget for the DPRF algorithm is set and then equally divided into each decision tree. During the tree-building process, the splitting features are randomly selected in the decision tree by the exponential mechanism and noise is added to the number of nodes by the Laplace mechanism, which is effective for differential privacy protection for the random forest. In experiments such as steel fatigue prediction experiments, the efficacies of DPRF under centralized or distributed data storage are verified. By setting different privacy budgets, the R2 of the predicted results of the DPRF algorithm can reach more than 0.8 for each target feature after adding differential privacy, which is not much different from the original random forest algorithm. A distributed data storage scenario shows that with the increase of privacy budget, the R2 of each target property prediction gradually increases. Comparing the effect of different tree depths in DPRF, it is shown that the overall R2 of the target prediction tends to increase and then later decrease .as the maximum depth of the tree increases. Overall, the best prediction accuracy is achieved when the maximum depth of the tree is set at 5. In summary, DPRF has good prediction accuracy in terms of achieving differential privacy protection of random forests. Specifically, in a distributed and decentralized data environment, DPRF can strike a balance between privacy-preserving strength and prediction accuracy by setting privacy budgets, tree depth, etc., which shows a wide range of application prospects of our algorithm.
-
表 1 差分隐私保护的树模型算法对比分析
Table 1. Comparative analysis among different differential privacy preserving tree model algorithms
Algorithm Basic model Realization mechanism Task Data storage SuLQ-based ID3 Decision tree Laplace Classification Centralization DiffP-ID3 Decision tree Laplace & Exponential Classification Centralization DiffP-C4.5 Decision tree Laplace & Exponential Classification Centralization DiffPRF Random forest Laplace & Exponential Classification Centralization DiffPRFs Random forest Laplace & Exponential Classification Centralization DPRF Random forest Laplace & Exponential Regression Centralization & distribution 算法1 基于差分隐私保护的DPRF算法 输入:训练数据集D,特征集合F,隐私预算B,决策树数量T,决策树最大深度d,树分裂时随机特征个数m,数据分布情况下节点数N;
输出:满足ε-差分隐私的随机森林;
停止条件 :随机森林建立的决策树数量达到T或隐私预算耗尽;
Procedure DPRF_fit (D,F,B,T,d,m)
1: Forest={};
2: 将整体的隐私预算平均分给每棵树,每棵决策树分配到的隐私预算$ \varepsilon ' = B/T $;
3: for i=1 to T; //循环建立T棵树
4: 在数据集D中有放回采样得到数据子集Dt,从特征集合F中随机选择m个特征;
5: 将决策树获得的隐私预算分配到每一层,再将每一层的隐私预算分为$\varepsilon '' = \dfrac{ { {\varepsilon '} } }{ {d + 1} }$;
6: ε=ε''/2;
7: Treei=BuildTree(Dt,m,ε,d,0); //下述为建树过程
8: if 当前节点满足树停止建立条件设置当前节点为叶子节点,叶子节点取值为叶子节点所有样本的目标值的均值,|NDt|=|NDt|+Laplace(1/ε),返回叶子节点;
9: else
10: for each_feature in m
11: 以当前特征中的值划分左右数据集,记录划分时平均绝对误差MAE最小的值为当前特征的split_value;
12: 当前特征以split_value划分数据集,计算该特征分数$\text{ex}\mathrm{p}\left(\dfrac{\epsilon }{2\mathrm{\Delta }q}q\left({D}_{\mathrm{C} },f\right)\right)$;
13: 计算m个特征的特征分数总分,任意特征f被选中为当前节点的分裂特征的概率满足:$\dfrac{\mathrm{exp}(\dfrac{\epsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}{ {\sum }_{1}^{m}\mathrm{exp}(\frac{\epsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}$,其中$ q({D}_{\mathrm{C}},f) $为可用性函数,$ \Delta q $为敏感度;
14: 根据选出特征f的split_value,划分左右数据集,并在左右数据集上继续建树;
15: Forest=Forset∪Treei;
16: end for
17: return ForestProcedure predict (Forest, Dtest)
1: Result={};
2: for d in Dtest
3: sum_predict=0;
4: for tree in Forest
5: 遍历当前树,到达叶子节点,得到预测值predict_value;
6: sum_predict+=predict_value;
7: res=sum_predict/length(Forest);
8: Result=Result∪res;
9: return ResultProcedure Distributed_fit (F,B,T,d,m)
1: Forest_Distributed ={};
2: 将整体的隐私预算平均分给个节点,每个节点分配到的隐私预算E=B/N;
3: for i=1 to n
4: 设节点i的数据集为Di;
5: foresti=DPRF_fit (Di,F,E,T,d,m);
6: Forest_Distribute = Forest_Distributed∪foresti;
7: return Forest_DistributedProcedure Distributed_Predict(D, Forest_Distribute)
1: Result=0;
2: for i=1 to n
3: r=predict(Forest_Distributei,D);
4: Result+=r;
5: Result=Result/n;
6: return Result表 2 NIMS钢疲劳数据集具体特征信息
Table 2. Descriptor information of the NIMS dataset
Feature Description Minimum value Maximum value Mean value Standard deviation NT Normalizing temperature 825 900 865.6 17.37 QT Hardening temperature 825 865 846.2 9.86 TT Tempering temperature 550 680 605 42.4 C Carbon content 0.28 0.57 0.407 0.061 Si Silicon content 0.16 0.35 0.258 0.034 Mn Manganese content 0.37 1.3 0.849 0.294 P Phosphorus content 0.007 0.031 0.016 0.005 S Sulfur content 0.003 0.03 0.014 0.006 Ni Nickel content 0.01 2.78 0.548 0.899 Cr Chromium content 0.01 1.12 0.556 0.419 Cu Copper content 0.01 0.22 0.064 0.045 Mo Molybdenum content 0 0.24 0.066 0.089 RR Reduction ratio 420 5530 971.2 601.4 dA Plastic inclusion 0 0.13 0.047 0.032 dB Discontinuous inclusions 0 0.05 0.003 0.009 dC Isolated inclusion 0 0.04 0.008 0.01 表 3 随机森林与差分隐私保护随机森林预测结果
Table 3. Predictive results of target properties with random forest and DPRF
Model and R2 privacy budget Fatigue Tensile Fracture Hardness RF 0.9059 0.9282 0.9252 0.9193 ε=0.1 DPRF 0.6588 0.6469 0.7588 0.6565 ε=0.25 DPRF 0.6930 0.6906 0.7721 0.7008 ε=0.5 DPRF 0.7704 0.7605 0.7918 0.7593 ε=1.0 DPRF 0.8035 0.8105 0.8219 0.8094 ε=3.0 DPRF 0.8249 0.8270 0.8461 0.8399 ε=10.0 DPRF 0.8527 0.8462 0.8852 0.8641 表 4 不同隐私预算下各目标性能的预测结果
Table 4. Predictive results of target properties under different privacy budgets
ε R2 Fatigue Tensile Fracture Hardness 0.3 0.6153 0.6030 0.6979 0.6139 0.75 0.6563 0.6748 0.7585 0.6502 1.5 0.7038 0.7448 0.8082 0.7308 2.25 0.7615 0.7773 0.8377 0.7618 3.0 0.7981 0.8025 0.8491 0.8017 9.0 0.8130 0.8380 0.8677 0.8429 表 5 不同树深度下各目标性能的预测结果
Table 5. Predictive results of each target property under different tree depths
d R2 Fatigue Tensile Fracture Hardness 3 0.6027 0.6113 0.6796 0.6387 4 0.7088 0.7061 0.7951 0.7183 5 0.7961 0.8025 0.8491 0.8017 6 0.7560 0.7605 0.8568 0.7659 7 0.6920 0.7427 0.8251 0.7303 -
参考文献
[1] Zhou S G, Li F, Tao Y F, et al. Privacy preservation in database applications: A survey. Chin J Comput, 2009, 32(5): 847 doi: 10.3724/SP.J.1016.2009.00847周水庚, 李丰, 陶宇飞, 等. 面向数据库应用的隐私保护研究综述. 计算机学报, 2009, 32(5):847 doi: 10.3724/SP.J.1016.2009.00847 [2] Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst, 2002, 10(5): 557 doi: 10.1142/S0218488502001648 [3] Du W L, Atallah M J. Secure multi-party computation problems and their applications: A review and open problems//Proceedings of the 2001 Workshop on New Security Paradigms. Cloudcroft, 2001: 13 [4] Konečný J, McMahan H B, Yu F X, et al. Federated learning: Strategies for improving communication efficiency [J/OL]. ArXiv Preprint (2017-10-30) [2022-5-29]. https://arxiv.org/abs/1610.05492 [5] Dwork C. Differential privacy//Proceedings of the 33rd International Conference on Automata, Languages and Programming. New York, 2006: 1 [6] Xiong J, Zhang T Y, Shi S Q. Machine learning of mechanical properties of steels. Sci China Technol Sci, 2020, 63(7): 1247 doi: 10.1007/s11431-020-1599-5 [7] Dai M Y, Hu J M. Field-free spin-orbit torque perpendicular magnetization switching in ultrathin nanostructures. Npj Comput Mater, 2020, 6: 78 doi: 10.1038/s41524-020-0347-0 [8] Huber L, Hadian R, Grabowski B, et al. A machine learning approach to model solute grain boundary segregation. Npj Comput Mater, 2018, 4: 64 doi: 10.1038/s41524-018-0122-7 [9] Choudhary K, Garrity K F, Sharma V, et al. High-throughput density functional perturbation theory and machine learning predictions of infrared, piezoelectric, and dielectric responses. Npj Comput Mater, 2020, 6: 64 doi: 10.1038/s41524-020-0337-2 [10] Bartel C J, Trewartha A, Wang Q, et al. A critical examination of compound stability predictions from machine-learned formation energies. Npj Comput Mater, 2020, 6: 97 doi: 10.1038/s41524-020-00362-y [11] Tang S L, Meng Y, Wang G Q, et al. Extraction of metamorphic minerals by multiscale segmentation combined with random forest. Chin J Eng, 2022, 44(2): 170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002唐淑兰, 孟勇, 王国强, 等. 结合多尺度分割和随机森林的变质矿物提取. 工程科学学报, 2022, 44(2):170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002 [12] Chen L, Fu D M. Processing and modeling dual-rate sampled data in seawater corrosion monitoring of low alloy steels. Chin J Eng, 2022, 44(1): 95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009陈亮, 付冬梅. 低合金钢海水腐蚀监测中的双率数据处理与建模. 工程科学学报, 2022, 44(1):95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009 [13] Sigmund G, Gharasoo M, Hüffer T, et al. Deep learning neural network approach for predicting the sorption of ionizable and polar organic pollutants to a wide range of carbonaceous materials. Environ Sci Technol, 2020, 54(7): 4583 doi: 10.1021/acs.est.9b06287 [14] Le T D, Noumeir R, Quach H L, et al. Critical temperature prediction for a superconductor: A variational Bayesian neural network approach. IEEE Trans Appl Supercond, 2020, 30(4): 1 [15] Wei M, Wang Q, Ye M, et al. An indirect remaining useful life prediction of lithium-ion batteries based on a NARX dynamic neural network. Chin J Eng, 2022, 44(3): 380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007魏孟, 王桥, 叶敏, 等. 基于NARX动态神经网络的锂离子电池剩余寿命间接预测. 工程科学学报, 2022, 44(3):380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007 [16] De Cock M, Dowsley R, Horst C, et al. Efficient and private scoring of decision trees, support vector machines and logistic regression models based on pre-computation. IEEE Trans Dependable Secure Comput, 2019, 16(2): 217 doi: 10.1109/TDSC.2017.2679189 [17] Wu Y C, Cai S F, Xiao X K, et al. Privacy preserving vertical federated learning for tree-based models [J/OL]. ArXiv Preprint (2020-08-14) [2020-05-29]. https://arxiv.org/abs/2008.06170 [18] Liu Y, Liu Y T, Liu Z J, et al. Federated forest. IEEE Trans Big Data, 2022, 8(3): 843 doi: 10.1109/TBDATA.2020.2992755 [19] Cheng K W, Fan T, Jin Y L, et al. SecureBoost: A lossless federated learning framework. IEEE Intell Syst, 2021, 36(6): 87 doi: 10.1109/MIS.2021.3082561 [20] Blum A, Dwork C, McSherry F, et al. Practical privacy: The SuLQ framework//Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Baltimore, 2005: 128 [21] Friedman A, Schuster A. Data mining with differential privacy//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, 2010: 493 [22] Patil A, Singh S. Differential private random forest//2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Delhi, 2014: 2623 [23] Mu H R, Ding L P, Song Y N, et al. DiffPRFs: Random forest under differential privacy. J Commun, 2016, 37(9): 175 doi: 10.11959/j.issn.1000-436x.2016169穆海蓉, 丁丽萍, 宋宇宁, 等. DiffPRFs: 一种面向随机森林的差分隐私保护算法. 通信学报, 2016, 37(9):175 doi: 10.11959/j.issn.1000-436x.2016169 [24] Breiman L. Random forests. Mach Learn, 2001, 45(1): 5 doi: 10.1023/A:1010933404324 [25] Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. J Priv Confidentiality, 2017, 7(3): 17 doi: 10.29012/jpc.v7i3.405 [26] McSherry F, Talwar K. Mechanism design via differential privacy//48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07). Providence, 2007: 94 [27] Kairouz P, Oh S, Viswanath P. The composition theorem for differential privacy. IEEE Trans Inf Theory, 2017, 63(6): 4037 doi: 10.1109/TIT.2017.2685505 [28] Agrawal A, Choudhary A. An online tool for predicting fatigue strength of steel alloys based on ensemble data mining. Int J Fatigue, 2018, 113: 389 doi: 10.1016/j.ijfatigue.2018.04.017 -