• 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中国科技论文统计源期刊
  • 中国科学引文数据库来源期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

差分隐私保护的随机森林算法及在钢材料上的应用

陈薛辉 冯燕 钱权

陈薛辉, 冯燕, 钱权. 差分隐私保护的随机森林算法及在钢材料上的应用[J]. 工程科学学报. doi: 10.13374/j.issn2095-9389.2022.05.29.002
引用本文: 陈薛辉, 冯燕, 钱权. 差分隐私保护的随机森林算法及在钢材料上的应用[J]. 工程科学学报. doi: 10.13374/j.issn2095-9389.2022.05.29.002
CHEN Xue-hui, FENG Yan, QIAN Quan. Differential privacy protection random forest algorithm and its application in steel materials[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.05.29.002
Citation: CHEN Xue-hui, FENG Yan, QIAN Quan. Differential privacy protection random forest algorithm and its application in steel materials[J]. Chinese Journal of Engineering. doi: 10.13374/j.issn2095-9389.2022.05.29.002

差分隐私保护的随机森林算法及在钢材料上的应用

doi: 10.13374/j.issn2095-9389.2022.05.29.002
基金项目: 国家重点研发计划资助项目(2018YFB0704400);云南省重大科技专项资助项目(202002AB080001-2,202102AB080019-3);之江实验室科研攻关资助项目(2021PE0AC02);上海张江国家自主创新示范区专项发展资金重大项目(ZJ2021-ZD-006)
详细信息
    通讯作者:

    E-mail: qqian@shu.edu.cn

  • 中图分类号: TG391

Differential privacy protection random forest algorithm and its application in steel materials

More Information
  • 摘要: 基于数据驱动的材料信息学被认为是材料研发第四范式,可以极大降低新材料的研发成本,缩短研发周期。然而,数据驱动的方法在材料数据共享利用时,会增加材料研发中关键工艺等敏感信息的隐私泄露风险。因此,面向隐私保护的机器学习是材料信息学中的关键问题。基于此,本文针对在材料信息学领域广泛使用的随机森林模型,提出了一种差分隐私保护的随机森林算法。算法将整体隐私预算分配到每棵树上,在建决策树过程中引入差分隐私的拉普拉斯机制和指数机制,即在决策树的分裂过程中采用指数机制随机选择分裂特征,同时采用拉普拉斯机制对节点数量添加噪声,实现对随机森林算法的差分隐私保护。本文结合钢材料疲劳性能预测实验,验证算法在数据分别采用集中式存储和分布式存储下的有效性。实验结果表明,在添加差分隐私保护后,各目标性能的预测决定系数R2值均达到0.8以上,与普通随机森林的结果相差很小。另外,在数据分布式存储情况下,随着隐私预算的增加,各目标性能的预测R2值随之增加。同时,随着最大树深度的增加,算法整体的预测精度先增加后降低,当最大树深度取5时,预测精度最好。综合看来,本文算法在实现随机森林的差分隐私保护前提下,仍能保持较高的预测精度,且数据在分散存储的分布式网络的环境中,可根据隐私预算等算法参数设置,实现隐私保护强度和预测精度的平衡,有广泛的应用前景。

     

  • 图  1  DPRF算法总体框架

    Figure  1.  Framework of the DPRF algorithm

    图  2  ε=10.0、d=5时DPRF算法各目标特征真实值与预测值散点图.(a)疲劳;(b)拉伸;(c)断裂;(d)硬度

    Figure  2.  Scatter diagrams of the real and predictive values of each target of the DPRF algorithm, whereby ε=10.0, d=5: (a) fatigue; (b) tensile; (c) fracture; (d) hardness

    图  3  DPRF算法在不同隐私预算(a)和不同最大树深度下(b)各目标性能的预测结果

    Figure  3.  Predive results of each target property of DPRF algorithms under different privacy budgets (a) and tree depths (b)

    表  1  差分隐私保护的树模型算法对比分析

    Table  1.   Comparative analysis among different differential privacy preserving tree model algorithms

    AlgorithmBasic modelRealization mechanismTaskData storage
    SuLQ-based ID3Decision treeLaplaceClassificationCentralization
    DiffP-ID3Decision treeLaplace & ExponentialClassificationCentralization
    DiffP-C4.5Decision treeLaplace & ExponentialClassificationCentralization
    DiffPRFRandom forestLaplace & ExponentialClassificationCentralization
    DiffPRFsRandom forestLaplace & ExponentialClassificationCentralization
    DPRFRandom forestLaplace & ExponentialRegressionCentralization & distribution
    下载: 导出CSV
    算法1 基于差分隐私保护的DPRF算法
    输入:训练数据集D,特征集合F,隐私预算B,决策树数量T,决策树最大深度d,树分裂时随机特征个数m,数据分布情况下节点数N
    输出:满足ε-差分隐私的随机森林;
    停止条件 :随机森林建立的决策树数量达到T或隐私预算耗尽;
    Procedure DPRF_fit (D,F,B,T,d,m)
    1: Forest={};
    2: 将整体的隐私预算平均分给每棵树,每棵决策树分配到的隐私预算$ \varepsilon ' = B/T $;
    3: for i=1 to T; //循环建立T棵树
    4:  在数据集D中有放回采样得到数据子集Dt,从特征集合F中随机选择m个特征;
    5:  将决策树获得的隐私预算分配到每一层,再将每一层的隐私预算分为$\varepsilon '' = \dfrac{ { {\varepsilon '} } }{ {d + 1} }$;
    6:   ε=ε''/2;
    7:  Treei=BuildTree(Dt,m,ε,d,0); //下述为建树过程
    8:   if 当前节点满足树停止建立条件设置当前节点为叶子节点,叶子节点取值为叶子节点所有样本的目标值的均值,|NDt|=|NDt|+Laplace(1/ε),返回叶子节点;
    9:  else
    10:   for each_feature in m
    11:    以当前特征中的值划分左右数据集,记录划分时平均绝对误差MAE最小的值为当前特征的split_value;
    12:    当前特征以split_value划分数据集,计算该特征分数$\text{ex}\mathrm{p}\left(\dfrac{\epsilon }{2\mathrm{\Delta }q}q\left({D}_{\mathrm{C} },f\right)\right)$;
    13:   计算m个特征的特征分数总分,任意特征f被选中为当前节点的分裂特征的概率满足:$\dfrac{\mathrm{exp}(\dfrac{\epsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}{ {\sum }_{1}^{m}\mathrm{exp}(\frac{\epsilon }{2\mathrm{\Delta }q}q({D}_{\mathrm{c} },f))}$,其中$ q({D}_{\mathrm{C}},f) $为可用性函数,$ \Delta q $为敏感度;
    14:   根据选出特征f的split_value,划分左右数据集,并在左右数据集上继续建树;
    15:  Forest=Forset∪Treei;
    16: end for
    17: return Forest
    Procedure predict (Forest, Dtest)
    1: Result={};
    2: for d in Dtest
    3:  sum_predict=0;
    4:  for tree in Forest
    5:   遍历当前树,到达叶子节点,得到预测值predict_value;
    6:   sum_predict+=predict_value;
    7:  res=sum_predict/length(Forest);
    8: Result=Result∪res;
    9: return Result
    Procedure Distributed_fit (F,B,T,d,m)
    1: Forest_Distributed ={};
    2: 将整体的隐私预算平均分给个节点,每个节点分配到的隐私预算E=B/N;
    3: for i=1 to n
    4:  设节点i的数据集为Di;
    5:  foresti=DPRF_fit (Di,F,E,T,d,m);
    6:  Forest_Distribute = Forest_Distributed∪foresti;
    7: return Forest_Distributed
    Procedure Distributed_Predict(D, Forest_Distribute)
    1: Result=0;
    2: for i=1 to n
    3:  r=predict(Forest_Distributei,D);
    4:  Result+=r;
    5: Result=Result/n;
    6: return Result
    下载: 导出CSV

    表  2  NIMS钢疲劳数据集具体特征信息

    Table  2.   Descriptor information of the NIMS dataset

    FeatureDescriptionMinimum valueMaximum valueMean valueStandard deviation
    NTNormalizing temperature825900865.617.37
    QTHardening temperature825865846.29.86
    TTTempering temperature55068060542.4
    CCarbon content0.280.570.4070.061
    SiSilicon content0.160.350.2580.034
    MnManganese content0.371.30.8490.294
    PPhosphorus content0.0070.0310.0160.005
    SSulfur content0.0030.030.0140.006
    NiNickel content0.012.780.5480.899
    CrChromium content0.011.120.5560.419
    CuCopper content0.010.220.0640.045
    MoMolybdenum content00.240.0660.089
    RRReduction ratio4205530971.2601.4
    dAPlastic inclusion00.130.0470.032
    dBDiscontinuous inclusions00.050.0030.009
    dCIsolated inclusion00.040.0080.01
    下载: 导出CSV

    表  3  随机森林与差分隐私保护随机森林预测结果

    Table  3.   Predictive results of target properties with random forest and DPRF

    Model andR2
    privacy budgetFatigueTensileFractureHardness
    RF0.90590.92820.92520.9193
    ε=0.1 DPRF0.65880.64690.75880.6565
    ε=0.25 DPRF0.69300.69060.77210.7008
    ε=0.5 DPRF0.77040.76050.79180.7593
    ε=1.0 DPRF0.80350.81050.82190.8094
    ε=3.0 DPRF0.82490.82700.84610.8399
    ε=10.0 DPRF0.85270.84620.88520.8641
    下载: 导出CSV

    表  4  不同隐私预算下各目标性能的预测结果

    Table  4.   Predictive results of target properties under different privacy budgets

    εR2
    FatigueTensileFractureHardness
    0.30.61530.60300.69790.6139
    0.750.65630.67480.75850.6502
    1.50.70380.74480.80820.7308
    2.250.76150.77730.83770.7618
    3.00.79810.80250.84910.8017
    9.00.81300.83800.86770.8429
    下载: 导出CSV

    表  5  不同树深度下各目标性能的预测结果

    Table  5.   Predictive results of each target property under different tree depths

    dR2
    FatigueTensileFractureHardness
    30.60270.61130.67960.6387
    40.70880.70610.79510.7183
    50.79610.80250.84910.8017
    60.75600.76050.85680.7659
    70.69200.74270.82510.7303
    下载: 导出CSV
  • [1] Zhou S G, Li F, Tao Y F, et al. Privacy preservation in database applications: A survey. Chin J Comput, 2009, 32(5): 847 doi: 10.3724/SP.J.1016.2009.00847

    周水庚, 李丰, 陶宇飞, 等. 面向数据库应用的隐私保护研究综述. 计算机学报, 2009, 32(5):847 doi: 10.3724/SP.J.1016.2009.00847
    [2] Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst, 2002, 10(5): 557 doi: 10.1142/S0218488502001648
    [3] Du W L, Atallah M J. Secure multi-party computation problems and their applications: A review and open problems//Proceedings of the 2001 Workshop on New Security Paradigms. Cloudcroft, 2001: 13
    [4] Konečný J, McMahan H B, Yu F X, et al. Federated learning: Strategies for improving communication efficiency [J/OL]. ArXiv Preprint (2017-10-30) [2022-5-29]. https://arxiv.org/abs/1610.05492
    [5] Dwork C. Differential privacy//Proceedings of the 33rd International Conference on Automata, Languages and Programming. New York, 2006: 1
    [6] Xiong J, Zhang T Y, Shi S Q. Machine learning of mechanical properties of steels. Sci China Technol Sci, 2020, 63(7): 1247 doi: 10.1007/s11431-020-1599-5
    [7] Dai M Y, Hu J M. Field-free spin-orbit torque perpendicular magnetization switching in ultrathin nanostructures. Npj Comput Mater, 2020, 6: 78 doi: 10.1038/s41524-020-0347-0
    [8] Huber L, Hadian R, Grabowski B, et al. A machine learning approach to model solute grain boundary segregation. Npj Comput Mater, 2018, 4: 64 doi: 10.1038/s41524-018-0122-7
    [9] Choudhary K, Garrity K F, Sharma V, et al. High-throughput density functional perturbation theory and machine learning predictions of infrared, piezoelectric, and dielectric responses. Npj Comput Mater, 2020, 6: 64 doi: 10.1038/s41524-020-0337-2
    [10] Bartel C J, Trewartha A, Wang Q, et al. A critical examination of compound stability predictions from machine-learned formation energies. Npj Comput Mater, 2020, 6: 97 doi: 10.1038/s41524-020-00362-y
    [11] Tang S L, Meng Y, Wang G Q, et al. Extraction of metamorphic minerals by multiscale segmentation combined with random forest. Chin J Eng, 2022, 44(2): 170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002

    唐淑兰, 孟勇, 王国强, 等. 结合多尺度分割和随机森林的变质矿物提取. 工程科学学报, 2022, 44(2):170 doi: 10.3321/j.issn.1001-053X.2022.2.bjkjdxxb202202002
    [12] Chen L, Fu D M. Processing and modeling dual-rate sampled data in seawater corrosion monitoring of low alloy steels. Chin J Eng, 2022, 44(1): 95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009

    陈亮, 付冬梅. 低合金钢海水腐蚀监测中的双率数据处理与建模. 工程科学学报, 2022, 44(1):95 doi: 10.3321/j.issn.1001-053X.2022.1.bjkjdxxb202201009
    [13] Sigmund G, Gharasoo M, Hüffer T, et al. Deep learning neural network approach for predicting the sorption of ionizable and polar organic pollutants to a wide range of carbonaceous materials. Environ Sci Technol, 2020, 54(7): 4583 doi: 10.1021/acs.est.9b06287
    [14] Le T D, Noumeir R, Quach H L, et al. Critical temperature prediction for a superconductor: A variational Bayesian neural network approach. IEEE Trans Appl Supercond, 2020, 30(4): 1
    [15] Wei M, Wang Q, Ye M, et al. An indirect remaining useful life prediction of lithium-ion batteries based on a NARX dynamic neural network. Chin J Eng, 2022, 44(3): 380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007

    魏孟, 王桥, 叶敏, 等. 基于NARX动态神经网络的锂离子电池剩余寿命间接预测. 工程科学学报, 2022, 44(3):380 doi: 10.3321/j.issn.1001-053X.2022.3.bjkjdxxb202203007
    [16] De Cock M, Dowsley R, Horst C, et al. Efficient and private scoring of decision trees, support vector machines and logistic regression models based on pre-computation. IEEE Trans Dependable Secure Comput, 2019, 16(2): 217 doi: 10.1109/TDSC.2017.2679189
    [17] Wu Y C, Cai S F, Xiao X K, et al. Privacy preserving vertical federated learning for tree-based models [J/OL]. ArXiv Preprint (2020-08-14) [2020-05-29]. https://arxiv.org/abs/2008.06170
    [18] Liu Y, Liu Y T, Liu Z J, et al. Federated forest. IEEE Trans Big Data, 2022, 8(3): 843 doi: 10.1109/TBDATA.2020.2992755
    [19] Cheng K W, Fan T, Jin Y L, et al. SecureBoost: A lossless federated learning framework. IEEE Intell Syst, 2021, 36(6): 87 doi: 10.1109/MIS.2021.3082561
    [20] Blum A, Dwork C, McSherry F, et al. Practical privacy: The SuLQ framework//Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Baltimore, 2005: 128
    [21] Friedman A, Schuster A. Data mining with differential privacy//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, 2010: 493
    [22] Patil A, Singh S. Differential private random forest//2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Delhi, 2014: 2623
    [23] Mu H R, Ding L P, Song Y N, et al. DiffPRFs: Random forest under differential privacy. J Commun, 2016, 37(9): 175 doi: 10.11959/j.issn.1000-436x.2016169

    穆海蓉, 丁丽萍, 宋宇宁, 等. DiffPRFs: 一种面向随机森林的差分隐私保护算法. 通信学报, 2016, 37(9):175 doi: 10.11959/j.issn.1000-436x.2016169
    [24] Breiman L. Random forests. Mach Learn, 2001, 45(1): 5 doi: 10.1023/A:1010933404324
    [25] Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. J Priv Confidentiality, 2017, 7(3): 17 doi: 10.29012/jpc.v7i3.405
    [26] McSherry F, Talwar K. Mechanism design via differential privacy//48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07). Providence, 2007: 94
    [27] Kairouz P, Oh S, Viswanath P. The composition theorem for differential privacy. IEEE Trans Inf Theory, 2017, 63(6): 4037 doi: 10.1109/TIT.2017.2685505
    [28] Agrawal A, Choudhary A. An online tool for predicting fatigue strength of steel alloys based on ensemble data mining. Int J Fatigue, 2018, 113: 389 doi: 10.1016/j.ijfatigue.2018.04.017
  • 加载中
图(3) / 表(6)
计量
  • 文章访问数:  181
  • HTML全文浏览量:  45
  • PDF下载量:  22
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-05-29
  • 网络出版日期:  2022-07-27

目录

    /

    返回文章
    返回