隐语SecretFlow金融风控全链路能力展示#

This tutorial is only available in Chinese.

Last updated: Oct 7, 2023

请使用v0.8.3或以上版本的隐语进行实验。

以下代码仅作为示例,请勿在生产环境直接使用。

本次实验将会展示如何使用隐语进行在风控领域常用的Logistic Regeression模型和XGB模型的模型研发工作。

隐语接下来将会开放模型部署和在线/离线模型预测功能,敬请期待。

实验目标#

在本次实验中,我们将会利用一个开源数据集训练一个金融风控场景常用的线性回归和XGB模型。在此过程中将包含以下步骤:

  • 样本对齐

  • 特征预处理

  • 数据分析

  • 模型训练

  • 模型预测

  • 模型评估

请依次执行所有步骤确保实验可以顺利完成。

实验前置工作#

初始化隐语框架#

在本次实验中,我们将会包含两个节点:alicebob . 在真实业务场景,他们将会代表两个不同实体,他们之间的原始数据不被允许直接相互传输,但是他们的原始数据将会被一起用以研发一个模型。

在下面的代码中,我们建立了一个 SecretFlow Cluster, 基于 alicebob 两个节点,我们还创建了三个device:

  • alice: PYU device, 负责在alice侧的本地计算,计算输入、计算过程和计算结果仅alice可见

  • bob: PYU device, 负责在bob侧的本地计算,计算输入、计算过程和计算结果仅bob可见

  • spu: SPU device, 负责alice和bob之间的密态计算,计算输入和计算结果为密态,由alice和bob各掌握一个分片,计算过程为MPC计算,由alice和bob各自的SPU Runtime一起执行。

如果你尚未理解以上的一些概念,比如SPU设备,请参考这篇`文档 <../developer/design/architecture.md>`__.

[1]:
import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
The version of SecretFlow: 1.2.0.dev20231007
2023-10-07 18:03:47,500 INFO worker.py:1538 -- Started a local Ray instance.

在上面的log中,你应该发现,在spu的创建过程中,alice和bob两边都各有一个 SPURuntime 被建立并互相创建连接。

数据集#

本次实验我们采用的原始数据是来自UCI的Bank Marketing Data Set. 这个数据集汇集了一家葡萄牙银行机构电话营销的结果。

我们添加了uid这一列用于接下来隐私求交的实验。

我们首先看一下数据集所包含的信息。

[2]:
import pandas as pd

# secretflow.utils.simulation.datasets contains mirrors of some popular open dataset.
from secretflow.utils.simulation.datasets import dataset

df = pd.read_csv(dataset('bank_marketing_full'), sep=';')
df['uid'] = df.index + 1

df
[2]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y uid
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no 1
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no 2
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no 3
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no 4
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45206 51 technician married tertiary no 825 no no cellular 17 nov 977 3 -1 0 unknown yes 45207
45207 71 retired divorced primary no 1729 no no cellular 17 nov 456 2 -1 0 unknown yes 45208
45208 72 retired married secondary no 5715 no no cellular 17 nov 1127 5 184 3 success yes 45209
45209 57 blue-collar married secondary no 668 no no telephone 17 nov 508 4 -1 0 unknown no 45210
45210 37 entrepreneur married secondary no 2971 no no cellular 17 nov 361 2 188 11 other no 45211

45211 rows × 18 columns

该数据集包含了45211个样本,每一个样本代表了一个目标客户。

每个样本包含16个feature,我们这里简单描述一下这个数据集所有的feature。

feature

描述

取值

uid

客户编码

数字

age

年龄

数字

job

工作类型

‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’

marital

婚姻状况

‘divorced’,‘married’,‘single’,‘unknown’

education

教育状况

‘tertiary’, ‘secondary’, ‘unknown’, ‘primary’

default

是否有不良信用记录

‘no’,‘yes’,‘unknown’

housing

是否有房贷

‘no’,‘yes’,‘unknown’

loan

是否有个人贷款

‘no’,‘yes’,‘unknown’

contact

联系方式

‘cellular’,‘telephone’

month

上次联系月份

‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’

day

上次联系月日

数字

duration

上次沟通时间

数字

campaign

本次活动已经沟通的次数

数字

pdays

距离上次沟通经过的天数

数字

previous

在本次活动之前已经沟通的次数

数字

poutcome

之前活动的结果

‘unknown’, ‘failure’, ‘other’, ‘success’

每个样本的label - y表示对于目标客户的营销结果(是否签订了定额存款合同),取值是’yes’,‘no’。

我们假定以上16个feature由两个机构分别掌握,具体如下。

  • alice: age, job, marital, education, default, balance, housing, loan

  • bob: contact, day, month, duration, campaign, pdays, previous, poutcome, y

在真实业务场景中, alice和bob所掌握的数据可能是没有对齐的,为了模拟这种情况,我们将数据集shuffle之后,再随机各取90%来模拟这个状况。

[3]:
import numpy as np

df_alice = df.iloc[:, np.r_[0:8, -1]].sample(frac=0.9)

df_alice
[3]:
age job marital education default balance housing loan uid
29713 55 management single tertiary no -350 no no 29714
37441 49 services married secondary no 2409 yes no 37442
16828 37 entrepreneur married tertiary no 1514 no yes 16829
20613 55 blue-collar married secondary no 388 no no 20614
25582 54 admin. married secondary no 3136 yes no 25583
... ... ... ... ... ... ... ... ... ...
42934 34 technician married secondary no 3000 yes yes 42935
4141 35 technician single secondary no -206 yes no 4142
5478 31 admin. single tertiary no 701 yes no 5479
6257 33 blue-collar married secondary no -241 yes yes 6258
2776 40 blue-collar married secondary no -31 yes no 2777

40690 rows × 9 columns

[4]:
df_bob = df.iloc[:, 8:].sample(frac=0.9)

df_bob
[4]:
contact day month duration campaign pdays previous poutcome y uid
35491 cellular 7 may 600 2 330 17 failure no 35492
371 unknown 6 may 195 1 -1 0 unknown no 372
33382 cellular 20 apr 198 1 -1 0 unknown no 33383
26770 cellular 20 nov 207 1 -1 0 unknown no 26771
33620 cellular 20 apr 20 4 -1 0 unknown no 33621
... ... ... ... ... ... ... ... ... ... ...
28643 cellular 29 jan 94 2 -1 0 unknown no 28644
16449 cellular 23 jul 122 3 -1 0 unknown no 16450
29287 cellular 2 feb 175 2 -1 0 unknown no 29288
33126 telephone 20 apr 73 3 325 3 failure no 33127
34736 cellular 6 may 87 1 -1 0 unknown no 34737

40690 rows × 10 columns

我们这里将df_alice和df_bob保存为文件,作为alice和bob两方的原始输入。

至此,我们完成了所有实验准备工作。

[5]:
import tempfile

_, alice_path = tempfile.mkstemp()
_, bob_path = tempfile.mkstemp()
df_alice.reset_index(drop=True).to_csv(alice_path, index=False)
df_bob.reset_index(drop=True).to_csv(bob_path, index=False)

样本对齐(隐私求交)#

显然,第一步我们需要将两边的数据对齐。 隐私求交(Private Set Intersection)是一种密码学方法,可以获取两个集合的交集,而不泄露任何其他信息。 在隐语中,SPU设备支持三种隐私求交算法:

  • ECDH:半诚实模型, 基于公钥密码学,原本适用于小数据集,但是隐语优化后已经能支持10亿量级的数据。

  • KKRT:半诚实模型, 基于布谷鸟哈希(Cuckoo Hashing)以及高效不经意传输扩展(OT Extension),适用于大数据集(比如千万数据集)。

  • BC22PCG:半诚实模型, 基于随机相关函数生成器,适用于大数据集。

由于我们这里的数据集较小,我们这里采用的是ECDH方法。

方式一:将隐私求交结果保存至文件#

在一些应用场景场景中,alice和bob可能在隐私求交之后将结果直接保存至文件中,之后再进行后续操作。这个时候,请调用psi_csv接口。

在以下代码中,我们分别制定了两边需要求交的key以及输入和输出路径。

我们需要指定双方的输入文件和输出文件路径。对于ECDH来说,由于双方的地位是平等的,receiver并没有实际含义,你可以任意指定。我们需要设定正确的protocol。sort设为true之后,join的结果将会被排序。

请阅读 psi_csv 的文档。

[6]:
_, alice_psi_path = tempfile.mkstemp()
_, bob_psi_path = tempfile.mkstemp()

spu.psi_csv(
    key="uid",
    input_path={alice: alice_path, bob: bob_path},
    output_path={alice: alice_psi_path, bob: bob_psi_path},
    receiver="alice",
    protocol="ECDH_PSI_2PC",
    sort=True,
)
(SPURuntime pid=3812799) 2023-10-07 18:03:49.999 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry
(SPURuntime pid=3812799) 2023-10-07 18:03:50.999 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:33409} (0x0x3bbdcc0): Connection refused [R1][E112]Not connected to 127.0.0.1:33409 yet, server_id=0'
(SPURuntime pid=3812799) 2023-10-07 18:03:50.999 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3812799) 2023-10-07 18:03:51.999 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:33409} (0x0x3bbdcc0): Connection refused [R1][E112]Not connected to 127.0.0.1:33409 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:33409 yet, server_id=0'
(SPURuntime pid=3812799) 2023-10-07 18:03:51.999 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3812800) 2023-10-07 18:03:52.009 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:41217'
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.012 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.013 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpalym9ko6, precheck_switch:true
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.012 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.012 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpr7dudg52, precheck_switch:true
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.022 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034013188158 | LC_ALL=C uniq -d > duplicate-keys.1696673034013188158
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.023 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034012852841 | LC_ALL=C uniq -d > duplicate-keys.1696673034012852841
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.032 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpalym9ko6, size=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.032 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.032 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.032 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.036 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.078 [info] [bucket_psi.cc:operator():385] ECDH progress 10%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.102 [info] [bucket_psi.cc:operator():385] ECDH progress 20%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.122 [info] [bucket_psi.cc:operator():385] ECDH progress 30%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.032 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpr7dudg52, size=40690
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.033 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.033 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.033 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.036 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.079 [info] [bucket_psi.cc:operator():385] ECDH progress 10%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.097 [info] [bucket_psi.cc:operator():385] ECDH progress 20%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.115 [info] [bucket_psi.cc:operator():385] ECDH progress 30%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.133 [info] [bucket_psi.cc:operator():385] ECDH progress 40%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.146 [info] [bucket_psi.cc:operator():385] ECDH progress 50%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.171 [info] [bucket_psi.cc:operator():385] ECDH progress 60%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.184 [info] [bucket_psi.cc:operator():385] ECDH progress 70%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.205 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.209 [info] [bucket_psi.cc:operator():385] ECDH progress 80%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.225 [info] [bucket_psi.cc:operator():385] ECDH progress 90%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.135 [info] [bucket_psi.cc:operator():385] ECDH progress 40%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.159 [info] [bucket_psi.cc:operator():385] ECDH progress 50%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.185 [info] [bucket_psi.cc:operator():385] ECDH progress 60%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.204 [info] [bucket_psi.cc:operator():385] ECDH progress 70%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.217 [info] [bucket_psi.cc:operator():385] ECDH progress 80%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.230 [info] [bucket_psi.cc:operator():385] ECDH progress 90%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.234 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.234 [info] [bucket_psi.cc:operator():385] ECDH progress 100%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.234 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.235 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690
[6]:
[{'party': 'alice', 'original_count': 40690, 'intersection_count': 36634},
 {'party': 'bob', 'original_count': 40690, 'intersection_count': 36634}]

方式二:将求交结果保存至VDataFrame#

VDataFrame是隐语中保存垂直切分数据的数据结构,在接下来的任务中,我们将会不断使用VDataFrame的数据结构。

由于在本次实验中,经过隐私求交之后,我们还有后续操作,所以我们在这里使用 data.vertical.read_csv 来将原始数据隐私求交之后的结果直接转化为VDataFrame。

请阅读data.vertical.read_csv的文档。很多参数和psi_csv是一致的,这里不再赘述。

[7]:
from secretflow.data.vertical import read_csv as v_read_csv

vdf = v_read_csv(
    {alice: alice_path, bob: bob_path},
    spu=spu,
    keys="uid",
    drop_keys="uid",
    psi_protocl="ECDH_PSI_2PC",
)
vdf.columns
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.242 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.257 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.260 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034257569380 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673034257569380
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.275 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034257569380 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673034257569380, ret=0
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.276 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpalym9ko6, out=/tmp/tmplvme8bi7
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.243 [info] [bucket_psi.cc:operator():385] ECDH progress 100%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.243 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.258 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.261 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034258490369 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673034258490369
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.275 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034258490369 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673034258490369, ret=0
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.275 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpr7dudg52, out=/tmp/tmpagzpomeb
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpalym9ko6, precheck_switch:true
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.829 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034819762103 | LC_ALL=C uniq -d > duplicate-keys.1696673034819762103
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.840 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpalym9ko6, size=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.840 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.840 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.840 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpr7dudg52, precheck_switch:true
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.830 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034819646273 | LC_ALL=C uniq -d > duplicate-keys.1696673034819646273
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.840 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpr7dudg52, size=40690
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.841 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.841 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.841 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.871 [info] [bucket_psi.cc:operator():385] ECDH progress 10%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.891 [info] [bucket_psi.cc:operator():385] ECDH progress 20%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.915 [info] [bucket_psi.cc:operator():385] ECDH progress 30%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.934 [info] [bucket_psi.cc:operator():385] ECDH progress 40%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.955 [info] [bucket_psi.cc:operator():385] ECDH progress 50%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.884 [info] [bucket_psi.cc:operator():385] ECDH progress 10%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.905 [info] [bucket_psi.cc:operator():385] ECDH progress 20%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.921 [info] [bucket_psi.cc:operator():385] ECDH progress 30%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.942 [info] [bucket_psi.cc:operator():385] ECDH progress 40%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.957 [info] [bucket_psi.cc:operator():385] ECDH progress 50%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:54.992 [info] [bucket_psi.cc:operator():385] ECDH progress 60%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.003 [info] [bucket_psi.cc:operator():385] ECDH progress 70%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.017 [info] [bucket_psi.cc:operator():385] ECDH progress 80%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.029 [info] [bucket_psi.cc:operator():385] ECDH progress 90%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.031 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.042 [info] [bucket_psi.cc:operator():385] ECDH progress 100%
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.042 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.043 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.058 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.061 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035058333794 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673035058333794
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:54.990 [info] [bucket_psi.cc:operator():385] ECDH progress 60%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.007 [info] [bucket_psi.cc:operator():385] ECDH progress 70%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.022 [info] [bucket_psi.cc:operator():385] ECDH progress 80%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.023 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.038 [info] [bucket_psi.cc:operator():385] ECDH progress 90%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.041 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.044 [info] [bucket_psi.cc:operator():385] ECDH progress 100%
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.044 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.059 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.062 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035059340444 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673035059340444
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.076 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035058333794 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673035058333794, ret=0
(SPURuntime(device_id=None, party=alice) pid=3812799) 2023-10-07 18:03:55.076 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpalym9ko6, out=/tmp/tmpalym9ko6.psi_output_97135
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.076 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035059340444 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673035059340444, ret=0
(SPURuntime(device_id=None, party=bob) pid=3812800) 2023-10-07 18:03:55.076 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpr7dudg52, out=/tmp/tmpr7dudg52.psi_output_97135
[7]:
['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'y']

更多#

我们在这里展示的是两方单键的隐私求交,隐语也支持三方和多键的隐私求交技术,想要了解更多信息,你可以:

  • 阅读这篇文档了解隐语SPU的隐私求交能力。

  • 阅读该教程了解使用的例子。

特征预处理#

一般情况下,我们都需要对用于建模的数据进行预处理,合理的预处理对模型训练效果非常关键。

在开始特征预处理之前,我们先使用 stats.table_statistics.table_statistics 来查看一下特征总体情况,我们会在后面专门讨论全表统计模块。

[8]:
from secretflow.stats.table_statistics import table_statistics

pd.set_option('display.max_rows', None)
data_stats = table_statistics(vdf)
data_stats
(_run pid=3805540) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
(_run pid=3805540)   return fn(*args, **kwargs)
(_run pid=3805540) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
(_run pid=3805540)   return fn(*args, **kwargs)
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
(_run pid=3805590)   return fn(*args, **kwargs)
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
(_run pid=3805590)   return fn(*args, **kwargs)
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
(_run pid=3805590)   return fn(*args, **kwargs)
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
(_run pid=3805590)   return fn(*args, **kwargs)
[8]:
datatype total_count count(non-NA count) count_na(NA count) na_ratio min max mean var(variance) std(standard deviation) ... moment_2 moment_3 moment_4 central_moment_2 central_moment_3 central_moment_4 sum sum_2 sum_3 sum_4
age int64 36634 36634 0 0.0 18.0 95.0 40.938636 1.125268e+02 10.607867 ... 1.788496e+03 8.324620e+04 4.115919e+06 1.125238e+02 8.144866e+02 4.214078e+04 1499746.0 6.551975e+07 3.049641e+09 1.507826e+11
job object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
marital object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
education object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
default object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
balance int64 36634 36634 0 0.0 -3372.0 98417.0 1366.720424 9.148037e+06 3024.572264 ... 1.101571e+07 2.605110e+11 4.548695e+15 9.147788e+06 2.204507e+11 1.079063e+16 50068436.0 4.035496e+11 9.543560e+15 6.161953e+17
housing object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
loan object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
contact object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
day int64 36634 36634 0 0.0 1.0 31.0 15.798930 6.932877e+01 8.326390 ... 3.189331e+02 7.282283e+03 1.787904e+05 6.932688e+01 5.290034e+01 9.317523e+03 578778.0 1.168379e+07 2.667791e+08 6.549806e+09
month object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
duration int64 36634 36634 0 0.0 0.0 4918.0 257.611563 6.625854e+04 257.407347 ... 1.326205e+05 1.224237e+08 1.824202e+11 6.625673e+04 5.412212e+07 9.586385e+10 9437342.0 4.858418e+09 4.484869e+12 6.682781e+15
campaign int64 36634 36634 0 0.0 1.0 63.0 2.762379 9.535351e+00 3.087936 ... 1.716583e+01 2.437880e+02 5.893892e+03 9.535091e+00 1.436904e+02 3.811396e+03 101197.0 6.288530e+05 8.930929e+06 2.159168e+08
pdays int64 36634 36634 0 0.0 -1.0 871.0 40.435634 1.010338e+04 100.515581 ... 1.173815e+04 3.951940e+06 1.554830e+09 1.010311e+04 2.660249e+06 1.022767e+09 1481319.0 4.300153e+08 1.447754e+11 5.695965e+13
previous int64 36634 36634 0 0.0 0.0 275.0 0.590189 5.856919e+00 2.420107 ... 6.205083e+00 6.336154e+02 1.579490e+05 5.856759e+00 6.230400e+02 1.564658e+05 21621.0 2.273170e+05 2.321187e+07 5.786303e+09
poutcome object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
y object 36634 36634 0 0.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

17 rows × 26 columns

[9]:
pd.reset_option('display.max_rows')

在接下来,我们将会展示隐语以下特征预处理能力:

  • 值替换

  • 缺失值填充

  • WOE分组/分箱转换

  • one-hot编码

  • 标准化

值替换#

我们先对以下特征做值替换:

feature

描述

取值和值替换规则

education

教育状况

‘tertiary’ -> 3, ‘secondary’ -> 2, ‘unknown’ -> 0, ‘primary’ -> 1

default

是否有不良信用记录

‘no’ -> 0,‘yes’ -> 1,‘unknown’ -> NaN

housing

是否有房贷

‘no’ -> 0,‘yes’ -> 1,‘unknown’ -> NaN

loan

是否有个人贷款

‘no’ -> 0,‘yes’ -> 1,‘unknown’ -> NaN

month

上次联系月份

‘jan’ -> 1, ‘feb’ -> 2, ‘mar’ -> 3, …, ‘nov’ -> 11, ‘dec’ ->12

y

label

‘yes’ -> 1,‘no’ -> 0

替换完之后,我们使用 sf.reveal 来查看效果,请注意在生产中,sf.reveal 将会直接泄露数据,需要严格限制和进行审计。

在生产中,请严格限制sf.reveal的使用。

[10]:
vdf['education'] = vdf['education'].replace(
    {'tertiary': 3, 'secondary': 2, 'primary': 1, 'unknown': np.NaN}
)

vdf['default'] = vdf['default'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})

vdf['housing'] = vdf['housing'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})

vdf['loan'] = vdf['loan'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})

vdf['month'] = vdf['month'].replace(
    {
        'jan': 1,
        'feb': 2,
        'mar': 3,
        'apr': 4,
        'may': 5,
        'jun': 6,
        'jul': 7,
        'aug': 8,
        'sep': 9,
        'oct': 10,
        'nov': 11,
        'dec': 12,
    }
)

vdf['y'] = vdf['y'].replace(
    {
        'no': 0,
        'yes': 1,
    }
)

print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
       age            job  marital  education  default  balance  housing  loan
0       58     management  married        3.0        0     2143        1     0
1       46     management  married        3.0        0      229        1     0
2       33     technician   single        2.0        0       56        0     0
3       42     technician  married        2.0        0     8036        0     0
4       38         admin.  married        1.0        0     1487        0     0
...    ...            ...      ...        ...      ...      ...      ...   ...
36629   38    blue-collar  married        1.0        0     3289        0     0
36630   36  self-employed   single        3.0        0     4844        0     0
36631   49     technician  married        2.0        0      378        0     0
36632   40    blue-collar  married        1.0        0       48        0     0
36633   46       services  married        3.0        0      474        0     0

[36634 rows x 8 columns]
       contact  day  month  duration  campaign  pdays  previous poutcome  y
0      unknown    5      5       261         1     -1         0  unknown  0
1      unknown    5      5       197         1     -1         0  unknown  0
2      unknown    7      5       236         2     -1         0  unknown  0
3      unknown    9      6       948         5     -1         0  unknown  0
4      unknown    9      6       332         2     -1         0  unknown  0
...        ...  ...    ...       ...       ...    ...       ...      ... ..
36629  unknown    9      6       553         2     -1         0  unknown  0
36630  unknown    9      6      1137         3     -1         0  unknown  1
36631  unknown    9      6       189         2     -1         0  unknown  0
36632  unknown    9      6       100         5     -1         0  unknown  0
36633  unknown    9      6       445         2     -1         0  unknown  0

[36634 rows x 9 columns]

安全性讨论#

值替换操作由数据所有者的PYU Device执行,不会泄露数据。

缺失值填充#

接下来我们对缺失值进行填充。我们在这里均填充了众数,其他可选的策略还包括平均数、中位数等。

其他可能的处理方法包括删除缺省的行, 或者可以使用数据完整的行作为训练集,以此来预测缺失值。

替换完之后,我们使用 sf.reveal 来查看效果。

[11]:
vdf["education"] = vdf["education"].fillna(vdf["education"].mode())
vdf["default"] = vdf["default"].fillna(vdf["default"].mode())
vdf["housing"] = vdf["housing"].fillna(vdf["housing"].mode())
vdf["loan"] = vdf["loan"].fillna(vdf["loan"].mode())

print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
       age            job  marital  education  default  balance  housing  loan
0       58     management  married        3.0        0     2143        1     0
1       46     management  married        3.0        0      229        1     0
2       33     technician   single        2.0        0       56        0     0
3       42     technician  married        2.0        0     8036        0     0
4       38         admin.  married        1.0        0     1487        0     0
...    ...            ...      ...        ...      ...      ...      ...   ...
36629   38    blue-collar  married        1.0        0     3289        0     0
36630   36  self-employed   single        3.0        0     4844        0     0
36631   49     technician  married        2.0        0      378        0     0
36632   40    blue-collar  married        1.0        0       48        0     0
36633   46       services  married        3.0        0      474        0     0

[36634 rows x 8 columns]
       contact  day  month  duration  campaign  pdays  previous poutcome  y
0      unknown    5      5       261         1     -1         0  unknown  0
1      unknown    5      5       197         1     -1         0  unknown  0
2      unknown    7      5       236         2     -1         0  unknown  0
3      unknown    9      6       948         5     -1         0  unknown  0
4      unknown    9      6       332         2     -1         0  unknown  0
...        ...  ...    ...       ...       ...    ...       ...      ... ..
36629  unknown    9      6       553         2     -1         0  unknown  0
36630  unknown    9      6      1137         3     -1         0  unknown  1
36631  unknown    9      6       189         2     -1         0  unknown  0
36632  unknown    9      6       100         5     -1         0  unknown  0
36633  unknown    9      6       445         2     -1         0  unknown  0

[36634 rows x 9 columns]

安全性讨论#

所填充的缺失值由属于数据所有者的PYU Device执行,并在接下来的缺失值操作中由数据所有者的PYU Device使用,不会泄露数据。

woe分箱#

woe分箱用于将连续值替换为离散值。

将连续型特征离散化的一个好处是可以有效地克服数据中隐藏的缺陷: 使模型结果更加稳定。例如,数据中的极端值是影响模型效果的一个重要因素。极端值导致模型参数过高或过低,或导致模型被虚假现象“迷惑”,把原来不存在的关系作为重要模式来学习。而离散化可以有效地减弱极端值和异常值的影响。

变量duration的75%分位数远小于最大值,而且该变量的标准差相对也比较大。因此需要对变量duration进行离散化。

[12]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution

binning = VertWoeBinning(spu)
bin_rules = binning.binning(
    vdf,
    binning_method="chimerge",
    bin_num=4,
    bin_names={alice: [], bob: ["duration"]},
    label_name="y",
)

woe_sub = VertBinSubstitution()
vdf = woe_sub.substitution(vdf, bin_rules)

print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party bob.
(_run pid=3805540) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3805540) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3805540) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3805540) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3805540) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=3805590) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3805590) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3805590) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3805590) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3805590) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party bob.
       age            job  marital  education  default  balance  housing  loan
0       58     management  married        3.0        0     2143        1     0
1       46     management  married        3.0        0      229        1     0
2       33     technician   single        2.0        0       56        0     0
3       42     technician  married        2.0        0     8036        0     0
4       38         admin.  married        1.0        0     1487        0     0
...    ...            ...      ...        ...      ...      ...      ...   ...
36629   38    blue-collar  married        1.0        0     3289        0     0
36630   36  self-employed   single        3.0        0     4844        0     0
36631   49     technician  married        2.0        0      378        0     0
36632   40    blue-collar  married        1.0        0       48        0     0
36633   46       services  married        3.0        0      474        0     0

[36634 rows x 8 columns]
       contact  day  month  duration  campaign  pdays  previous poutcome  y
0      unknown    5      5  0.355090         1     -1         0  unknown  0
1      unknown    5      5 -1.136022         1     -1         0  unknown  0
2      unknown    7      5 -0.061611         2     -1         0  unknown  0
3      unknown    9      6  2.328878         5     -1         0  unknown  0
4      unknown    9      6  0.202048         2     -1         0  unknown  0
...        ...  ...    ...       ...       ...    ...       ...      ... ..
36629  unknown    9      6  1.131077         2     -1         0  unknown  0
36630  unknown    9      6  2.328878         3     -1         0  unknown  1
36631  unknown    9      6 -0.394791         2     -1         0  unknown  0
36632  unknown    9      6 -1.708979         5     -1         0  unknown  0
36633  unknown    9      6  0.813627         2     -1         0  unknown  0

[36634 rows x 9 columns]

安全性讨论#

woe分桶需要利用alice和bob两边的数据,因此相关的计算需要使用SPU device确保原始数据不被泄露。

One Hot编码#

one-hot编码适用于将类型编码转化为数值编码。 对于job、marital等特征我们需要one-hot编码。

[13]:
from secretflow.preprocessing.encoder import OneHotEncoder

encoder = OneHotEncoder()
# for vif and correlation only
vdf_hat = vdf.drop(columns=["job", "marital", "contact", "month", "day", "poutcome"])

tranformed_df = encoder.fit_transform(vdf['job'])
vdf[tranformed_df.columns] = tranformed_df

tranformed_df = encoder.fit_transform(vdf['marital'])
vdf[tranformed_df.columns] = tranformed_df

tranformed_df = encoder.fit_transform(vdf['contact'])
vdf[tranformed_df.columns] = tranformed_df

tranformed_df = encoder.fit_transform(vdf['month'])
vdf[tranformed_df.columns] = tranformed_df

tranformed_df = encoder.fit_transform(vdf['day'])
vdf[tranformed_df.columns] = tranformed_df

tranformed_df = encoder.fit_transform(vdf['poutcome'])
vdf[tranformed_df.columns] = tranformed_df

vdf = vdf.drop(columns=["job", "marital", "contact", "month", "day", "poutcome"])

print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
       age  education  default  balance  housing  loan  job_admin.  \
0       58        3.0        0     2143        1     0         0.0
1       46        3.0        0      229        1     0         0.0
2       33        2.0        0       56        0     0         0.0
3       42        2.0        0     8036        0     0         0.0
4       38        1.0        0     1487        0     0         1.0
...    ...        ...      ...      ...      ...   ...         ...
36629   38        1.0        0     3289        0     0         0.0
36630   36        3.0        0     4844        0     0         0.0
36631   49        2.0        0      378        0     0         0.0
36632   40        1.0        0       48        0     0         0.0
36633   46        3.0        0      474        0     0         0.0

       job_blue-collar  job_entrepreneur  job_housemaid  ...  job_retired  \
0                  0.0               0.0            0.0  ...          0.0
1                  0.0               0.0            0.0  ...          0.0
2                  0.0               0.0            0.0  ...          0.0
3                  0.0               0.0            0.0  ...          0.0
4                  0.0               0.0            0.0  ...          0.0
...                ...               ...            ...  ...          ...
36629              1.0               0.0            0.0  ...          0.0
36630              0.0               0.0            0.0  ...          0.0
36631              0.0               0.0            0.0  ...          0.0
36632              1.0               0.0            0.0  ...          0.0
36633              0.0               0.0            0.0  ...          0.0

       job_self-employed  job_services  job_student  job_technician  \
0                    0.0           0.0          0.0             0.0
1                    0.0           0.0          0.0             0.0
2                    0.0           0.0          0.0             1.0
3                    0.0           0.0          0.0             1.0
4                    0.0           0.0          0.0             0.0
...                  ...           ...          ...             ...
36629                0.0           0.0          0.0             0.0
36630                1.0           0.0          0.0             0.0
36631                0.0           0.0          0.0             1.0
36632                0.0           0.0          0.0             0.0
36633                0.0           1.0          0.0             0.0

       job_unemployed  job_unknown  marital_divorced  marital_married  \
0                 0.0          0.0               0.0              1.0
1                 0.0          0.0               0.0              1.0
2                 0.0          0.0               0.0              0.0
3                 0.0          0.0               0.0              1.0
4                 0.0          0.0               0.0              1.0
...               ...          ...               ...              ...
36629             0.0          0.0               0.0              1.0
36630             0.0          0.0               0.0              0.0
36631             0.0          0.0               0.0              1.0
36632             0.0          0.0               0.0              1.0
36633             0.0          0.0               0.0              1.0

       marital_single
0                 0.0
1                 0.0
2                 1.0
3                 0.0
4                 0.0
...               ...
36629             0.0
36630             1.0
36631             0.0
36632             0.0
36633             0.0

[36634 rows x 21 columns]
       duration  campaign  pdays  previous  y  contact_cellular  \
0      0.355090         1     -1         0  0               0.0
1     -1.136022         1     -1         0  0               0.0
2     -0.061611         2     -1         0  0               0.0
3      2.328878         5     -1         0  0               0.0
4      0.202048         2     -1         0  0               0.0
...         ...       ...    ...       ... ..               ...
36629  1.131077         2     -1         0  0               0.0
36630  2.328878         3     -1         0  1               0.0
36631 -0.394791         2     -1         0  0               0.0
36632 -1.708979         5     -1         0  0               0.0
36633  0.813627         2     -1         0  0               0.0

       contact_telephone  contact_unknown  month_1.0  month_2.0  ...  \
0                    0.0              1.0        0.0        0.0  ...
1                    0.0              1.0        0.0        0.0  ...
2                    0.0              1.0        0.0        0.0  ...
3                    0.0              1.0        0.0        0.0  ...
4                    0.0              1.0        0.0        0.0  ...
...                  ...              ...        ...        ...  ...
36629                0.0              1.0        0.0        0.0  ...
36630                0.0              1.0        0.0        0.0  ...
36631                0.0              1.0        0.0        0.0  ...
36632                0.0              1.0        0.0        0.0  ...
36633                0.0              1.0        0.0        0.0  ...

       day_26.0  day_27.0  day_28.0  day_29.0  day_30.0  day_31.0  \
0           0.0       0.0       0.0       0.0       0.0       0.0
1           0.0       0.0       0.0       0.0       0.0       0.0
2           0.0       0.0       0.0       0.0       0.0       0.0
3           0.0       0.0       0.0       0.0       0.0       0.0
4           0.0       0.0       0.0       0.0       0.0       0.0
...         ...       ...       ...       ...       ...       ...
36629       0.0       0.0       0.0       0.0       0.0       0.0
36630       0.0       0.0       0.0       0.0       0.0       0.0
36631       0.0       0.0       0.0       0.0       0.0       0.0
36632       0.0       0.0       0.0       0.0       0.0       0.0
36633       0.0       0.0       0.0       0.0       0.0       0.0

       poutcome_failure  poutcome_other  poutcome_success  poutcome_unknown
0                   0.0             0.0               0.0               1.0
1                   0.0             0.0               0.0               1.0
2                   0.0             0.0               0.0               1.0
3                   0.0             0.0               0.0               1.0
4                   0.0             0.0               0.0               1.0
...                 ...             ...               ...               ...
36629               0.0             0.0               0.0               1.0
36630               0.0             0.0               0.0               1.0
36631               0.0             0.0               0.0               1.0
36632               0.0             0.0               0.0               1.0
36633               0.0             0.0               0.0               1.0

[36634 rows x 55 columns]

安全性讨论#

one-hot编码操作由数据所有者的PYU Device执行,不会泄露数据。

标准化#

特征之间数值差距太大会使得模型收敛困难,我们一般先对数值进行标准化。

[14]:
from secretflow.preprocessing import StandardScaler

X = vdf.drop(columns=['y'])
y = vdf['y']
scaler = StandardScaler()
X = scaler.fit_transform(X)
vdf[X.columns] = X
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
(_run pid=3805540) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=3805540)   warnings.warn(
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=3805590)   warnings.warn(
            age  education   default   balance   housing      loan  \
0      1.608391   1.314552 -0.135554  0.256661  0.893944 -0.436771
1      0.477140   1.314552 -0.135554 -0.376164  0.893944 -0.436771
2     -0.748383  -0.217705 -0.135554 -0.433363 -1.118638 -0.436771
3      0.100056  -0.217705 -0.135554  2.205062 -1.118638 -0.436771
4     -0.277028  -1.749962 -0.135554  0.039768 -1.118638 -0.436771
...         ...        ...       ...       ...       ...       ...
36629 -0.277028  -1.749962 -0.135554  0.635563 -1.118638 -0.436771
36630 -0.465570   1.314552 -0.135554  1.149692 -1.118638 -0.436771
36631  0.759952  -0.217705 -0.135554 -0.326900 -1.118638 -0.436771
36632 -0.088486  -1.749962 -0.135554 -0.436008 -1.118638 -0.436771
36633  0.477140   1.314552 -0.135554 -0.295160 -1.118638 -0.436771

       job_admin.  job_blue-collar  job_entrepreneur  job_housemaid  ...  \
0       -0.360094        -0.525105         -0.185291       -0.16795  ...
1       -0.360094        -0.525105         -0.185291       -0.16795  ...
2       -0.360094        -0.525105         -0.185291       -0.16795  ...
3       -0.360094        -0.525105         -0.185291       -0.16795  ...
4        2.777051        -0.525105         -0.185291       -0.16795  ...
...           ...              ...               ...            ...  ...
36629   -0.360094         1.904383         -0.185291       -0.16795  ...
36630   -0.360094        -0.525105         -0.185291       -0.16795  ...
36631   -0.360094        -0.525105         -0.185291       -0.16795  ...
36632   -0.360094         1.904383         -0.185291       -0.16795  ...
36633   -0.360094        -0.525105         -0.185291       -0.16795  ...

       job_retired  job_self-employed  job_services  job_student  \
0        -0.227915          -0.189196     -0.316887    -0.145356
1        -0.227915          -0.189196     -0.316887    -0.145356
2        -0.227915          -0.189196     -0.316887    -0.145356
3        -0.227915          -0.189196     -0.316887    -0.145356
4        -0.227915          -0.189196     -0.316887    -0.145356
...            ...                ...           ...          ...
36629    -0.227915          -0.189196     -0.316887    -0.145356
36630    -0.227915           5.285528     -0.316887    -0.145356
36631    -0.227915          -0.189196     -0.316887    -0.145356
36632    -0.227915          -0.189196     -0.316887    -0.145356
36633    -0.227915          -0.189196      3.155697    -0.145356

       job_technician  job_unemployed  job_unknown  marital_divorced  \
0           -0.449906       -0.172953    -0.080522         -0.362219
1           -0.449906       -0.172953    -0.080522         -0.362219
2            2.222685       -0.172953    -0.080522         -0.362219
3            2.222685       -0.172953    -0.080522         -0.362219
4           -0.449906       -0.172953    -0.080522         -0.362219
...               ...             ...          ...               ...
36629       -0.449906       -0.172953    -0.080522         -0.362219
36630       -0.449906       -0.172953    -0.080522         -0.362219
36631        2.222685       -0.172953    -0.080522         -0.362219
36632       -0.449906       -0.172953    -0.080522         -0.362219
36633       -0.449906       -0.172953    -0.080522         -0.362219

       marital_married  marital_single
0             0.815216       -0.628656
1             0.815216       -0.628656
2            -1.226669        1.590694
3             0.815216       -0.628656
4             0.815216       -0.628656
...                ...             ...
36629         0.815216       -0.628656
36630        -1.226669        1.590694
36631         0.815216       -0.628656
36632         0.815216       -0.628656
36633         0.815216       -0.628656

[36634 rows x 21 columns]
       duration  campaign     pdays  previous  y  contact_cellular  \
0      0.646636 -0.570738 -0.412237 -0.243872  0         -1.359335
1     -0.174118 -0.570738 -0.412237 -0.243872  0         -1.359335
2      0.417271 -0.246893 -0.412237 -0.243872  0         -1.359335
3      1.733068  0.724643 -0.412237 -0.243872  0         -1.359335
4      0.562397 -0.246893 -0.412237 -0.243872  0         -1.359335
...         ...       ...       ...       ... ..               ...
36629  1.073762 -0.246893 -0.412237 -0.243872  0         -1.359335
36630  1.733068  0.076952 -0.412237 -0.243872  1         -1.359335
36631  0.233878 -0.246893 -0.412237 -0.243872  0         -1.359335
36632 -0.489491  0.724643 -0.412237 -0.243872  0         -1.359335
36633  0.899028 -0.246893 -0.412237 -0.243872  0         -1.359335

       contact_telephone  contact_unknown  month_1.0  month_2.0  ...  \
0              -0.261573         1.575748  -0.178158  -0.249263  ...
1              -0.261573         1.575748  -0.178158  -0.249263  ...
2              -0.261573         1.575748  -0.178158  -0.249263  ...
3              -0.261573         1.575748  -0.178158  -0.249263  ...
4              -0.261573         1.575748  -0.178158  -0.249263  ...
...                  ...              ...        ...        ...  ...
36629          -0.261573         1.575748  -0.178158  -0.249263  ...
36630          -0.261573         1.575748  -0.178158  -0.249263  ...
36631          -0.261573         1.575748  -0.178158  -0.249263  ...
36632          -0.261573         1.575748  -0.178158  -0.249263  ...
36633          -0.261573         1.575748  -0.178158  -0.249263  ...

       day_26.0  day_27.0  day_28.0  day_29.0  day_30.0  day_31.0  \
0      -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
1      -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
2      -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
3      -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
4      -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
...         ...       ...       ...       ...       ...       ...
36629  -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
36630  -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
36631  -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
36632  -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812
36633  -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812

       poutcome_failure  poutcome_other  poutcome_success  poutcome_unknown
0             -0.350446       -0.204025          -0.18733          0.473664
1             -0.350446       -0.204025          -0.18733          0.473664
2             -0.350446       -0.204025          -0.18733          0.473664
3             -0.350446       -0.204025          -0.18733          0.473664
4             -0.350446       -0.204025          -0.18733          0.473664
...                 ...             ...               ...               ...
36629         -0.350446       -0.204025          -0.18733          0.473664
36630         -0.350446       -0.204025          -0.18733          0.473664
36631         -0.350446       -0.204025          -0.18733          0.473664
36632         -0.350446       -0.204025          -0.18733          0.473664
36633         -0.350446       -0.204025          -0.18733          0.473664

[36634 rows x 55 columns]

安全性讨论#

标准化操作由数据所有者的PYU Device执行,不会泄露数据。

更多#

隐语还支持其他更多的特征预处理能力,请参考这篇文档.

至此,我们已经完成了所有特征预处理工作。

本文主要目的是为了展示隐语的预处理能力,本文对于数据预处理方法的使用可能是有争议的,敬请谅解。

数据分析#

在建模之前,我们有必要分析一下我们所使用的数据,以便确认是否需要重复特征预处理的过程。

下面我们将会展示隐语以下数据分析能力:

  • 全表统计

  • 相关系数矩阵

  • VIF指标计算

全表统计#

我们提供了类似于 pd.DataFrame.describe 来展示所有特征的基本统计信息。

在特征预处理的过程中,你可以不断调用全表统计来关注预处理效果。

[15]:
from secretflow.stats.table_statistics import table_statistics

pd.set_option('display.max_rows', None)
data_stats = table_statistics(vdf)
data_stats
[15]:
datatype total_count count(non-NA count) count_na(NA count) na_ratio min max mean var(variance) std(standard deviation) ... moment_2 moment_3 moment_4 central_moment_2 central_moment_3 central_moment_4 sum sum_2 sum_3 sum_4
age float64 36634 36634 0 0.0 -2.162447 5.096416 -1.877506e-16 1.000027 1.000014 ... 1.000000 0.682366 3.328235 1.000000 0.682366 3.328235 -6.878054e-12 36634.0 2.499780e+04 1.219266e+05
education float64 36634 36634 0 0.0 -1.749962 1.314552 9.309945e-18 1.000027 1.000014 ... 1.000000 -0.152303 2.305096 1.000000 -0.152303 2.305096 3.410605e-13 36634.0 -5.579460e+03 8.444490e+04
default float64 36634 36634 0 0.0 -0.135554 7.377133 4.965304e-17 1.000027 1.000014 ... 1.000000 7.241579 53.440463 1.000000 7.241579 53.440463 1.818989e-12 36634.0 2.652880e+05 1.957738e+06
balance float64 36634 36634 0 0.0 -1.566762 32.087712 -1.396492e-17 1.000027 1.000014 ... 1.000000 7.967780 128.947991 1.000000 7.967780 128.947991 -5.115908e-13 36634.0 2.918916e+05 4.723881e+06
housing float64 36634 36634 0 0.0 -1.118638 0.893944 -3.103315e-17 1.000027 1.000014 ... 1.000000 -0.224695 1.050488 1.000000 -0.224695 1.050488 -1.136868e-12 36634.0 -8.231462e+03 3.848356e+04
loan float64 36634 36634 0 0.0 -0.436771 2.289530 5.779924e-17 1.000027 1.000014 ... 1.000000 1.852760 4.432718 1.000000 1.852760 4.432718 2.117417e-12 36634.0 6.787399e+04 1.623882e+05
job_admin. float64 36634 36634 0 0.0 -0.360094 2.777051 -2.288695e-17 1.000027 1.000014 ... 1.000000 2.416956 6.841677 1.000000 2.416956 6.841677 -8.384404e-13 36634.0 8.854277e+04 2.506380e+05
job_blue-collar float64 36634 36634 0 0.0 -0.525105 1.904383 3.103315e-17 1.000027 1.000014 ... 1.000000 1.379278 2.902408 1.000000 1.379278 2.902408 1.136868e-12 36634.0 5.052848e+04 1.063268e+05
job_entrepreneur float64 36634 36634 0 0.0 -0.185291 5.396911 -9.697859e-18 1.000027 1.000014 ... 1.000000 5.211619 28.160978 1.000000 5.211619 28.160978 -3.552714e-13 36634.0 1.909225e+05 1.031649e+06
job_housemaid float64 36634 36634 0 0.0 -0.167950 5.954136 5.430801e-17 1.000027 1.000014 ... 1.000000 5.786186 34.479949 1.000000 5.786186 34.479949 1.989520e-12 36634.0 2.119711e+05 1.263138e+06
job_management float64 36634 36634 0 0.0 -0.513622 1.946956 5.430801e-17 1.000027 1.000014 ... 1.000000 1.433333 3.054445 1.000000 1.433333 3.054445 1.989520e-12 36634.0 5.250874e+04 1.118965e+05
job_retired float64 36634 36634 0 0.0 -0.227915 4.387592 6.361796e-17 1.000027 1.000014 ... 1.000000 4.159677 18.302913 1.000000 4.159677 18.302913 2.330580e-12 36634.0 1.523856e+05 6.705089e+05
job_self-employed float64 36634 36634 0 0.0 -0.189196 5.285528 -4.267058e-17 1.000027 1.000014 ... 1.000000 5.096332 26.972604 1.000000 5.096332 26.972604 -1.563194e-12 36634.0 1.866990e+05 9.881144e+05
job_services float64 36634 36634 0 0.0 -0.316887 3.155697 6.361796e-17 1.000027 1.000014 ... 1.000000 2.838809 9.058838 1.000000 2.838809 9.058838 2.330580e-12 36634.0 1.039969e+05 3.318615e+05
job_student float64 36634 36634 0 0.0 -0.145356 6.879667 -9.309945e-18 1.000027 1.000014 ... 1.000000 6.734311 46.350944 1.000000 6.734311 46.350944 -3.410605e-13 36634.0 2.467047e+05 1.698020e+06
job_technician float64 36634 36634 0 0.0 -0.449906 2.222685 -1.861989e-17 1.000027 1.000014 ... 1.000000 1.772778 4.142743 1.000000 1.772778 4.142743 -6.821210e-13 36634.0 6.494396e+04 1.517653e+05
job_unemployed float64 36634 36634 0 0.0 -0.172953 5.781907 1.474075e-17 1.000027 1.000014 ... 1.000000 5.608954 32.460364 1.000000 5.608954 32.460364 5.400125e-13 36634.0 2.054784e+05 1.189153e+06
job_unknown float64 36634 36634 0 0.0 -0.080522 12.418889 -1.512866e-17 1.000027 1.000014 ... 1.000000 12.338367 153.235297 1.000000 12.338367 153.235297 -5.542233e-13 36634.0 4.520037e+05 5.613622e+06
marital_divorced float64 36634 36634 0 0.0 -0.362219 2.760760 2.754192e-17 1.000027 1.000014 ... 1.000000 2.398540 6.752996 1.000000 2.398540 6.752996 1.008971e-12 36634.0 8.786813e+04 2.473893e+05
marital_married float64 36634 36634 0 0.0 -1.226669 0.815216 -1.334425e-16 1.000027 1.000014 ... 1.000000 -0.411454 1.169294 1.000000 -0.411454 1.169294 -4.888534e-12 36634.0 -1.507319e+04 4.283592e+04
marital_single float64 36634 36634 0 0.0 -0.628656 1.590694 6.012673e-17 1.000027 1.000014 ... 1.000000 0.962038 1.925516 1.000000 0.962038 1.925516 2.202682e-12 36634.0 3.524328e+04 7.053936e+04
duration float64 36634 36634 0 0.0 -2.809886 1.846421 6.633336e-17 1.000027 1.000014 ... 1.000000 -0.976752 4.099032 1.000000 -0.976752 4.099032 2.430056e-12 36634.0 -3.578234e+04 1.501639e+05
campaign float64 36634 36634 0 0.0 -0.570738 19.507670 4.034309e-17 1.000027 1.000014 ... 1.000000 4.880232 41.921269 1.000000 4.880232 41.921269 1.477929e-12 36634.0 1.787824e+05 1.535744e+06
pdays float64 36634 36634 0 0.0 -0.412237 8.263154 -6.206630e-18 1.000027 1.000014 ... 1.000000 2.619630 10.019985 1.000000 2.619630 10.019985 -2.273737e-13 36634.0 9.596753e+04 3.670721e+05
previous float64 36634 36634 0 0.0 -0.243872 113.389007 -4.034309e-17 1.000027 1.000014 ... 1.000000 43.957189 4561.467798 1.000000 43.957189 4561.467798 -1.477929e-12 36634.0 1.610328e+06 1.671048e+08
y int64 36634 36634 0 0.0 0.000000 1.000000 1.174592e-01 0.103665 0.321971 ... 0.117459 0.117459 0.117459 0.103663 0.079310 0.071425 4.303000e+03 4303.0 4.303000e+03 4.303000e+03
contact_cellular float64 36634 36634 0 0.0 -1.359335 0.735654 -4.965304e-17 1.000027 1.000014 ... 1.000000 -0.623682 1.388979 1.000000 -0.623682 1.388979 -1.818989e-12 36634.0 -2.284795e+04 5.088384e+04
contact_telephone float64 36634 36634 0 0.0 -0.261573 3.823024 5.896298e-17 1.000027 1.000014 ... 1.000000 3.561451 13.683936 1.000000 3.561451 13.683936 2.160050e-12 36634.0 1.304702e+05 5.012973e+05
contact_unknown float64 36634 36634 0 0.0 -0.634619 1.575748 -1.241326e-16 1.000027 1.000014 ... 1.000000 0.941129 1.885723 1.000000 0.941129 1.885723 -4.547474e-12 36634.0 3.447731e+04 6.908158e+04
month_1.0 float64 36634 36634 0 0.0 -0.178158 5.613000 -2.482652e-17 1.000027 1.000014 ... 1.000000 5.434842 30.537508 1.000000 5.434842 30.537508 -9.094947e-13 36634.0 1.991000e+05 1.118711e+06
month_2.0 float64 36634 36634 0 0.0 -0.249263 4.011823 8.378950e-17 1.000027 1.000014 ... 1.000000 3.762560 15.156859 1.000000 3.762560 15.156859 3.069545e-12 36634.0 1.378376e+05 5.552564e+05
month_3.0 float64 36634 36634 0 0.0 -0.103058 9.703260 -5.275635e-17 1.000027 1.000014 ... 1.000000 9.600201 93.163868 1.000000 9.600201 93.163868 -1.932676e-12 36634.0 3.516938e+05 3.412965e+06
month_4.0 float64 36634 36634 0 0.0 -0.265247 3.770074 1.861989e-16 1.000027 1.000014 ... 1.000000 3.504827 13.283811 1.000000 3.504827 13.283811 6.821210e-12 36634.0 1.283958e+05 4.866391e+05
month_5.0 float64 36634 36634 0 0.0 -0.658775 1.517969 -1.241326e-17 1.000027 1.000014 ... 1.000000 0.859194 1.738215 1.000000 0.859194 1.738215 -4.547474e-13 36634.0 3.147572e+04 6.367775e+04
month_6.0 float64 36634 36634 0 0.0 -0.366401 2.729249 -7.447956e-17 1.000027 1.000014 ... 1.000000 2.362848 6.583051 1.000000 2.362848 6.583051 -2.728484e-12 36634.0 8.656057e+04 2.411635e+05
month_7.0 float64 36634 36634 0 0.0 -0.425328 2.351127 -1.551657e-16 1.000027 1.000014 ... 1.000000 1.925799 4.708701 1.000000 1.925799 4.708701 -5.684342e-12 36634.0 7.054972e+04 1.724986e+05
month_8.0 float64 36634 36634 0 0.0 -0.402386 2.485176 -1.861989e-17 1.000027 1.000014 ... 1.000000 2.082791 5.338016 1.000000 2.082791 5.338016 -6.821210e-13 36634.0 7.630095e+04 1.955529e+05
month_9.0 float64 36634 36634 0 0.0 -0.112144 8.917078 -1.706823e-17 1.000027 1.000014 ... 1.000000 8.804934 78.526862 1.000000 8.804934 78.526862 -6.252776e-13 36634.0 3.225600e+05 2.876753e+06
month_10.0 float64 36634 36634 0 0.0 -0.127389 7.849982 -9.309945e-18 1.000027 1.000014 ... 1.000000 7.722593 60.638450 1.000000 7.722593 60.638450 -3.410605e-13 36634.0 2.829095e+05 2.221429e+06
month_11.0 float64 36634 36634 0 0.0 -0.310483 3.220790 5.585967e-17 1.000027 1.000014 ... 1.000000 2.910307 9.469886 1.000000 2.910307 9.469886 2.046363e-12 36634.0 1.066162e+05 3.469198e+05
month_12.0 float64 36634 36634 0 0.0 -0.068280 14.645618 7.758287e-18 1.000027 1.000014 ... 1.000000 14.577338 213.498780 1.000000 14.577338 213.498780 2.842171e-13 36634.0 5.340262e+05 7.821314e+06
day_1.0 float64 36634 36634 0 0.0 -0.086489 11.562172 3.103315e-18 1.000027 1.000014 ... 1.000000 11.475683 132.691304 1.000000 11.475683 132.691304 1.136868e-13 36634.0 4.204002e+05 4.861013e+06
day_2.0 float64 36634 36634 0 0.0 -0.171018 5.847321 -5.585967e-17 1.000027 1.000014 ... 1.000000 5.676302 33.220410 1.000000 5.676302 33.220410 -2.046363e-12 36634.0 2.079457e+05 1.216996e+06
day_3.0 float64 36634 36634 0 0.0 -0.158339 6.315549 0.000000e+00 1.000027 1.000014 ... 1.000000 6.157210 38.911232 1.000000 6.157210 38.911232 0.000000e+00 36634.0 2.255632e+05 1.425474e+06
day_4.0 float64 36634 36634 0 0.0 -0.180429 5.542360 -3.258481e-17 1.000027 1.000014 ... 1.000000 5.361931 29.750303 1.000000 5.361931 29.750303 -1.193712e-12 36634.0 1.964290e+05 1.089873e+06
day_5.0 float64 36634 36634 0 0.0 -0.211179 4.735322 -7.447956e-17 1.000027 1.000014 ... 1.000000 4.524143 21.467870 1.000000 4.524143 21.467870 -2.728484e-12 36634.0 1.657375e+05 7.864540e+05
day_6.0 float64 36634 36634 0 0.0 -0.212024 4.716452 6.827293e-17 1.000027 1.000014 ... 1.000000 4.504429 21.289878 1.000000 4.504429 21.289878 2.501110e-12 36634.0 1.650152e+05 7.799334e+05
day_7.0 float64 36634 36634 0 0.0 -0.203808 4.906588 -4.344641e-17 1.000027 1.000014 ... 1.000000 4.702780 23.116144 1.000000 4.702780 23.116144 -1.591616e-12 36634.0 1.722817e+05 8.468368e+05
day_8.0 float64 36634 36634 0 0.0 -0.204967 4.878830 -8.999613e-17 1.000027 1.000014 ... 1.000000 4.673862 22.844991 1.000000 4.673862 22.844991 -3.296918e-12 36634.0 1.712223e+05 8.369034e+05
day_9.0 float64 36634 36634 0 0.0 -0.189814 5.268311 1.861989e-17 1.000027 1.000014 ... 1.000000 5.078497 26.791131 1.000000 5.078497 26.791131 6.821210e-13 36634.0 1.860457e+05 9.814663e+05
day_10.0 float64 36634 36634 0 0.0 -0.109110 9.165025 -4.034309e-17 1.000027 1.000014 ... 1.000000 9.055914 83.009585 1.000000 9.055914 83.009585 -1.477929e-12 36634.0 3.317544e+05 3.040973e+06
day_11.0 float64 36634 36634 0 0.0 -0.183073 5.462298 -4.344641e-17 1.000027 1.000014 ... 1.000000 5.279225 28.870216 1.000000 5.279225 28.870216 -1.591616e-12 36634.0 1.933991e+05 1.057631e+06
day_12.0 float64 36634 36634 0 0.0 -0.191352 5.225961 -9.309945e-17 1.000027 1.000014 ... 1.000000 5.034608 26.347280 1.000000 5.034608 26.347280 -3.410605e-12 36634.0 1.844378e+05 9.652063e+05
day_13.0 float64 36634 36634 0 0.0 -0.190661 5.244897 4.034309e-17 1.000027 1.000014 ... 1.000000 5.054236 26.545301 1.000000 5.054236 26.545301 1.477929e-12 36634.0 1.851569e+05 9.724606e+05
day_14.0 float64 36634 36634 0 0.0 -0.206696 4.838016 -1.551657e-17 1.000027 1.000014 ... 1.000000 4.631319 22.449119 1.000000 4.631319 22.449119 -5.684342e-13 36634.0 1.696638e+05 8.224010e+05
day_15.0 float64 36634 36634 0 0.0 -0.195980 5.102564 6.827293e-17 1.000027 1.000014 ... 1.000000 4.906584 25.074570 1.000000 4.906584 25.074570 2.501110e-12 36634.0 1.797478e+05 9.185818e+05
day_16.0 float64 36634 36634 0 0.0 -0.178728 5.595097 9.309945e-18 1.000027 1.000014 ... 1.000000 5.416369 30.337058 1.000000 5.416369 30.337058 3.410605e-13 36634.0 1.984233e+05 1.111368e+06
day_17.0 float64 36634 36634 0 0.0 -0.213286 4.688543 8.689282e-17 1.000027 1.000014 ... 1.000000 4.475257 21.027925 1.000000 4.475257 21.027925 3.183231e-12 36634.0 1.639466e+05 7.703370e+05
day_18.0 float64 36634 36634 0 0.0 -0.231079 4.327530 2.482652e-17 1.000027 1.000014 ... 1.000000 4.096451 17.780915 1.000000 4.096451 17.780915 9.094947e-13 36634.0 1.500694e+05 6.513860e+05
day_19.0 float64 36634 36634 0 0.0 -0.201472 4.963478 -3.723978e-17 1.000027 1.000014 ... 1.000000 4.762006 23.676700 1.000000 4.762006 23.676700 -1.364242e-12 36634.0 1.744513e+05 8.673722e+05
day_20.0 float64 36634 36634 0 0.0 -0.256473 3.899047 -4.344641e-17 1.000027 1.000014 ... 1.000000 3.642574 14.268344 1.000000 3.642574 14.268344 -1.591616e-12 36634.0 1.334420e+05 5.227065e+05
day_21.0 float64 36634 36634 0 0.0 -0.214333 4.665638 -6.206630e-17 1.000027 1.000014 ... 1.000000 4.451305 20.814118 1.000000 4.451305 20.814118 -2.273737e-12 36634.0 1.630691e+05 7.625044e+05
day_22.0 float64 36634 36634 0 0.0 -0.142988 6.993574 -6.206630e-18 1.000027 1.000014 ... 1.000000 6.850586 47.930527 1.000000 6.850586 47.930527 -2.273737e-13 36634.0 2.509644e+05 1.755887e+06
day_23.0 float64 36634 36634 0 0.0 -0.146526 6.824707 -3.103315e-17 1.000027 1.000014 ... 1.000000 6.678180 45.598093 1.000000 6.678180 45.598093 -1.136868e-12 36634.0 2.446485e+05 1.670441e+06
day_24.0 float64 36634 36634 0 0.0 -0.100040 9.996005 2.017155e-17 1.000027 1.000014 ... 1.000000 9.895965 98.930118 1.000000 9.895965 98.930118 7.389644e-13 36634.0 3.625288e+05 3.624206e+06
day_25.0 float64 36634 36634 0 0.0 -0.138142 7.238946 -5.896298e-17 1.000027 1.000014 ... 1.000000 7.100804 51.421415 1.000000 7.100804 51.421415 -2.160050e-12 36634.0 2.601308e+05 1.883772e+06
day_26.0 float64 36634 36634 0 0.0 -0.153750 6.504045 -1.861989e-17 1.000027 1.000014 ... 1.000000 6.350294 41.326240 1.000000 6.350294 41.326240 -6.821210e-13 36634.0 2.326367e+05 1.513945e+06
day_27.0 float64 36634 36634 0 0.0 -0.160947 6.213238 2.482652e-17 1.000027 1.000014 ... 1.000000 6.052291 37.630228 1.000000 6.052291 37.630228 9.094947e-13 36634.0 2.217196e+05 1.378546e+06
day_28.0 float64 36634 36634 0 0.0 -0.204243 4.896126 3.103315e-17 1.000027 1.000014 ... 1.000000 4.691883 23.013767 1.000000 4.691883 23.013767 1.136868e-12 36634.0 1.718824e+05 8.430863e+05
day_29.0 float64 36634 36634 0 0.0 -0.200148 4.996313 -1.861989e-17 1.000027 1.000014 ... 1.000000 4.796166 24.003206 1.000000 4.796166 24.003206 -6.821210e-13 36634.0 1.757027e+05 8.793334e+05
day_30.0 float64 36634 36634 0 0.0 -0.188032 5.318249 3.103315e-17 1.000027 1.000014 ... 1.000000 5.130217 27.319129 1.000000 5.130217 27.319129 1.136868e-12 36634.0 1.879404e+05 1.000809e+06
day_31.0 float64 36634 36634 0 0.0 -0.120812 8.277332 3.103315e-17 1.000027 1.000014 ... 1.000000 8.156521 67.528827 1.000000 8.156521 67.528827 1.136868e-12 36634.0 2.988060e+05 2.473851e+06
poutcome_failure float64 36634 36634 0 0.0 -0.350446 2.853507 -3.103315e-17 1.000027 1.000014 ... 1.000000 2.503061 7.265313 1.000000 2.503061 7.265313 -1.136868e-12 36634.0 9.169713e+04 2.661575e+05
poutcome_other float64 36634 36634 0 0.0 -0.204025 4.901349 -7.447956e-17 1.000027 1.000014 ... 1.000000 4.697324 23.064850 1.000000 4.697324 23.064850 -2.728484e-12 36634.0 1.720818e+05 8.449577e+05
poutcome_success float64 36634 36634 0 0.0 -0.187330 5.338162 -3.413646e-17 1.000027 1.000014 ... 1.000000 5.150832 27.531067 1.000000 5.150832 27.531067 -1.250555e-12 36634.0 1.886956e+05 1.008573e+06
poutcome_unknown float64 36634 36634 0 0.0 -2.111202 0.473664 -1.241326e-17 1.000027 1.000014 ... 1.000000 -1.637538 3.681530 1.000000 -1.637538 3.681530 -4.547474e-13 36634.0 -5.998956e+04 1.348692e+05

76 rows × 26 columns

[16]:
pd.reset_option('display.max_rows')

安全性讨论#

请注意,全表统计会暴露数据整体统计结果,其背后实际上蕴含了sf.reveal,请谨慎使用。

相关系数矩阵#

我们接下来计算特征和特征之间,特征和标签之间的相关系数矩阵。

计算相关系数矩阵时,one-hot编码各列无需参与计算。

[17]:
from secretflow.stats.ss_pearsonr_v import PearsonR

pearson_r_calculator = PearsonR(spu)
corr_matrix = pearson_r_calculator.pearsonr(vdf_hat)

import numpy as np

np.set_printoptions(formatter={'float': lambda x: "{0:0.3f}".format(x)})
corr_matrix
(_run pid=3805540) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=3805540)   warnings.warn(
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=3805590)   warnings.warn(
(_run pid=3807059) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3807059) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3807059) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3807059) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3807059) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=3805590) [2023-10-07 18:04:03.441] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=3807059) [2023-10-07 18:04:03.419] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[17]:
array([[1.000, -0.166, -0.020, 0.098, -0.185, -0.014, -0.010, 0.007,
        -0.026, 0.000, 0.023],
       [-0.166, 1.000, -0.010, 0.066, -0.079, -0.025, -0.002, 0.003,
        0.004, 0.023, 0.070],
       [-0.020, -0.010, 1.000, -0.066, -0.005, 0.078, -0.006, 0.019,
        -0.031, -0.016, -0.025],
       [0.098, 0.066, -0.066, 1.000, -0.070, -0.084, 0.017, -0.019,
        0.003, 0.014, 0.054],
       [-0.185, -0.079, -0.005, -0.070, 1.000, 0.042, 0.002, -0.027,
        0.127, 0.039, -0.136],
       [-0.014, -0.025, 0.078, -0.084, 0.042, 1.000, -0.008, 0.011,
        -0.020, -0.008, -0.067],
       [-0.010, -0.002, -0.006, 0.017, 0.002, -0.008, 1.000, -0.181,
        0.018, 0.009, 0.319],
       [0.007, 0.003, 0.019, -0.019, -0.027, 0.011, -0.181, 1.000,
        -0.089, -0.032, -0.070],
       [-0.026, 0.004, -0.031, 0.003, 0.127, -0.020, 0.018, -0.089,
        1.000, 0.440, 0.102],
       [0.000, 0.023, -0.016, 0.014, 0.039, -0.008, 0.009, -0.032, 0.440,
        1.000, 0.087],
       [0.023, 0.070, -0.025, 0.054, -0.136, -0.067, 0.319, -0.070,
        0.102, 0.087, 1.000]], dtype=float32)

安全性讨论#

相关系数矩阵的计算需要利用alice和bob两边的数据,因此相关的计算需要使用SPU device确保原始数据不被泄露。

VIF指标计算#

隐语还支持VIF的计算来进行多重共线性检验。

计算VIF指标时,one-hot编码各列无需参与计算。

[18]:
from secretflow.stats.ss_vif_v import VIF

vif_calculator = VIF(spu)
vif_results = vif_calculator.vif(vdf_hat)
print(vdf_hat.columns)
print(vif_results)
(_run pid=3805590) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=3805590)   warnings.warn(
(_run pid=3807059) /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=3807059)   warnings.warn(
['age', 'education', 'default', 'balance', 'housing', 'loan', 'duration', 'campaign', 'pdays', 'previous', 'y']
[1.084 1.053 1.012 1.031 1.093 1.018 1.150 1.043 1.279 1.244 1.168]

安全性讨论#

VIF指标的计算需要利用alice和bob两边的数据,因此相关的计算需要使用SPU device确保原始数据不被泄露。

模型训练#

接下来,我们将会分别训练一个逻辑回归模型和一个XGB模型。

随机分割#

在训练之前,我们需要将数据分割为训练集和验证集。

其中train_x和train_y为训练集的特征和标签。test_x和test_y为训练集的特征和标签。

[19]:
from secretflow.data.split import train_test_split

random_state = 1234

train_vdf, test_vdf = train_test_split(vdf, train_size=0.8, random_state=random_state)

train_x = train_vdf.drop(columns=['y'])
train_y = train_vdf['y']

test_x = test_vdf.drop(columns=['y'])
test_y = test_vdf['y']

安全性讨论#

随机分割时,每一方会共享随机数种子,并由每一方数据的owner分别执行各自的数据分割并且确保最终分割结果仍然是对齐的。

PSI(人群稳定性分析)#

样本稳定指数是衡量样本变化所产生的偏移量的一种重要指标,通常用来衡量样本的稳定程度,比如样本在两个月份之间的变化是否稳定。通常变量的PSI值在0.1以下表示变化不太显著,在0.1到0.25之间表示有比较显著的变化,大于0.25表示变量变化比较剧烈,需要特殊关注。

接下来以balance为例子,确认两次抽样的样本分布是否接近。

根据业务需求,PSI分析也可以在数据分析或者特征预处理的时候进行。

[20]:
stats_df = table_statistics(train_x['balance'])
[21]:
min_val, max_val = stats_df['min'], stats_df['max']
[22]:
from secretflow.stats import psi_eval
from secretflow.stats.core.utils import equal_range
import jax.numpy as jnp

split_points = equal_range(jnp.array([min_val, max_val]), 3)
balance_psi_score = psi_eval(train_x['balance'], test_x['balance'], split_points)

sf.reveal(balance_psi_score)
INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[22]:
Array(inf, dtype=float32)

安全性讨论#

PSI分析是一个单方运算,由数据owner的PYU Device执行计算。

逻辑回归模型#

使用 ml.linear.ss_sgd.SSRegression 可以进行密态逻辑回归模型的训练。

请参考相关的API文档。

[23]:
from secretflow.ml.linear.ss_sgd import SSRegression

lr_model = SSRegression(spu)
lr_model.fit(
    x=train_x,
    y=train_y,
    epochs=3,
    learning_rate=0.1,
    batch_size=1024,
    sig_type='t1',
    reg_type='logistic',
    penalty='l2',
    l2_norm=0.5,
)
INFO:root:epoch 1 times: 1.159546136856079s
INFO:root:epoch 2 times: 0.9041106700897217s
INFO:root:epoch 3 times: 0.8574354648590088s

你可能会对为何上面的语句很快就执行完毕感到困惑,原因是在隐语中,语句都是lazy evaluation的,在上面的例子中,直到lr_model被真正被使用的时候,lr_model.fit才会被执行。

安全性讨论#

SSRegression的训练基于SPU Device,双方的原始数据将会被保护。

XGBoost模型#

使用 ml.boost.ss_xgb_v.Xgb 可以进行密态XGBoost模型的训练。

请参考相关的API文档。

[24]:
from secretflow.ml.boost.ss_xgb_v import Xgb

xgb = Xgb(spu)
params = {
    'num_boost_round': 3,
    'max_depth': 5,
    'sketch_eps': 0.25,
    'objective': 'logistic',
    'reg_lambda': 0.2,
    'subsample': 1,
    'colsample_by_tree': 1,
    'base_score': 0.5,
}
xgb_model = xgb.train(params=params, dtrain=train_x, label=train_y)
INFO:root:Create proxy actor <class 'secretflow.ml.boost.ss_xgb_v.core.tree_worker.XgbTreeWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.boost.ss_xgb_v.core.tree_worker.XgbTreeWorker'> with party bob.
INFO:root:fragment_count 1
INFO:root:prepare time 0.0977473258972168s
(_run pid=3806453) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3806453) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3806453) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3806453) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3806453) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
INFO:root:global_setup time 1.0207350254058838s
(_run pid=3806453) [2023-10-07 18:04:08.905] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
INFO:root:build & infeed bucket_map fragments [0, 0]
INFO:root:build & infeed bucket_map time 0.47269535064697266s
INFO:root:init_pred time 0.01999831199645996s
INFO:root:epoch 0 tree_setup time 0.06561398506164551s
INFO:root:fragment[0, 0] gradient sum time 0.4283578395843506s
INFO:root:level 0 time 0.633394718170166s
INFO:root:fragment[0, 0] gradient sum time 0.7379474639892578s
INFO:root:level 1 time 0.8940839767456055s
INFO:root:fragment[0, 0] gradient sum time 1.374051809310913s
INFO:root:level 2 time 1.6118881702423096s
INFO:root:fragment[0, 0] gradient sum time 2.799717903137207s
INFO:root:level 3 time 3.1707115173339844s
INFO:root:fragment[0, 0] gradient sum time 5.533723831176758s
INFO:root:level 4 time 5.947847843170166s
(XgbTreeWorker pid=3820043) [2023-10-07 18:04:21.826] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(XgbTreeWorker pid=3820042) [2023-10-07 18:04:21.809] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
INFO:root:epoch 0 time 12.676203489303589s
INFO:root:epoch 1 tree_setup time 0.32253217697143555s
INFO:root:fragment[0, 0] gradient sum time 0.6513419151306152s
INFO:root:level 0 time 0.8437290191650391s
INFO:root:fragment[0, 0] gradient sum time 0.7146239280700684s
INFO:root:level 1 time 0.8953475952148438s
INFO:root:fragment[0, 0] gradient sum time 1.4147369861602783s
INFO:root:level 2 time 1.6838665008544922s
INFO:root:fragment[0, 0] gradient sum time 2.8456201553344727s
INFO:root:level 3 time 3.1713719367980957s
INFO:root:fragment[0, 0] gradient sum time 5.684190511703491s
INFO:root:level 4 time 6.1044602394104s
INFO:root:epoch 1 time 13.149580478668213s
INFO:root:epoch 2 tree_setup time 0.3217151165008545s
INFO:root:fragment[0, 0] gradient sum time 0.6628327369689941s
INFO:root:level 0 time 0.8769242763519287s
INFO:root:fragment[0, 0] gradient sum time 0.7444479465484619s
INFO:root:level 1 time 0.9811975955963135s
INFO:root:fragment[0, 0] gradient sum time 1.4514589309692383s
INFO:root:level 2 time 1.7142736911773682s
INFO:root:fragment[0, 0] gradient sum time 2.8057940006256104s
INFO:root:level 3 time 3.1123876571655273s
INFO:root:fragment[0, 0] gradient sum time 5.516232490539551s
INFO:root:level 4 time 5.923535585403442s
INFO:root:epoch 2 time 12.82395052909851s

Xgb.train将会直接执行,请耐心等待。

安全性讨论#

Xgb的训练基于SPU Device,双方的原始数据将会被保护。

模型预测#

接下来,我们将会分别利用刚刚训练好的模型来预测测试集。

逻辑回归模型#

由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果reveal给bob.

[25]:
lr_y_hat = lr_model.predict(x=test_x, batch_size=1024, to_pyu=bob)

安全性讨论#

逻辑回归的预测基于SPU Device,双方的原始数据将会被保护。

当设置to_pyu,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。

XGBoost模型#

由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果reveal给bob.

[26]:
xgb_y_hat = xgb_model.predict(dtrain=test_x, to_pyu=bob)
INFO:root:Create proxy actor <class 'secretflow.ml.boost.ss_xgb_v.core.tree_worker.XgbTreeWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.boost.ss_xgb_v.core.tree_worker.XgbTreeWorker'> with party bob.
(XgbTreeWorker pid=3821892) [2023-10-07 18:04:49.886] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(XgbTreeWorker pid=3821893) [2023-10-07 18:04:49.900] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63

安全性讨论#

XGBoost模型的预测基于SPU Device,双方的原始数据将会被保护。

当设置to_pyu,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。

模型评估#

接下来,我们将利用测试数据集对模型效果进行评估,包括:

  • 二分类评估

  • PVA

  • P-Value

  • 评分卡转换

二分类评估#

隐语中对二分类的评估有集成的支持。

BiClassificationEval 将计算 AUC, KS, F1 Score, Lift, K-S, Gain, Precision, Recall 等统计数值, 并提供(基于prediction score的)等频和等距分箱的统计报告和总报告。

不同分桶中评估模型的预测的threshold不同。总报告中依赖threshold的统计取的是各个分桶的最佳值。

详情可以参考API文档。

[27]:
from secretflow.stats.biclassification_eval import BiClassificationEval

biclassification_evaluator = BiClassificationEval(
    y_true=test_y, y_score=lr_y_hat, bucket_size=20
)
lr_report = sf.reveal(biclassification_evaluator.get_all_reports())
[28]:
print(f'positive_samples: {lr_report.summary_report.positive_samples}')
print(f'negative_samples: {lr_report.summary_report.negative_samples}')
print(f'total_samples: {lr_report.summary_report.total_samples}')
print(f'auc: {lr_report.summary_report.auc}')
print(f'ks: {lr_report.summary_report.ks}')
print(f'f1_score: {lr_report.summary_report.f1_score}')
positive_samples: 884.0
negative_samples: 6443.0
total_samples: 7327.0
auc: 0.8957031965255737
ks: 0.6429727077484131
f1_score: 0.547467052936554
[29]:
biclassification_evaluator = BiClassificationEval(
    y_true=test_y, y_score=xgb_y_hat, bucket_size=20
)
xgb_report = sf.reveal(biclassification_evaluator.get_all_reports())
[30]:
print(f'positive_samples: {xgb_report.summary_report.positive_samples}')
print(f'negative_samples: {xgb_report.summary_report.negative_samples}')
print(f'total_samples: {xgb_report.summary_report.total_samples}')
print(f'auc: {xgb_report.summary_report.auc}')
print(f'ks: {xgb_report.summary_report.ks}')
print(f'f1_score: {xgb_report.summary_report.f1_score}')
positive_samples: 884.0
negative_samples: 6443.0
total_samples: 7327.0
auc: 0.8601635694503784
ks: 0.5916707515716553
f1_score: 0.4953111708164215

预测偏差#

结果由abs(mean(Acutal) - mean(Prediction))计算获得, 值越小越好。

[31]:
from secretflow.stats import prediction_bias_eval

prediction_bias = prediction_bias_eval(
    test_y, lr_y_hat, bucket_num=4, absolute=True, bucket_method='equal_width'
)

sf.reveal(prediction_bias)
[31]:
PredictionBiasReport(buckets=[BucketPredictionBiasReport(left_endpoint=0.0, left_closed=True, right_endpoint=0.25, right_closed=False, isna=False, avg_prediction=0.0, avg_label=0.14190296828746796, bias=0.14190296828746796, absolute=True), BucketPredictionBiasReport(left_endpoint=0.25, left_closed=True, right_endpoint=0.5, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.5, left_closed=True, right_endpoint=0.75, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.75, left_closed=True, right_endpoint=1.0, right_closed=True, isna=False, avg_prediction=1.0, avg_label=0.35504791140556335, bias=0.6449520587921143, absolute=True)])
[32]:
xgb_pva_score = prediction_bias_eval(
    test_y, xgb_y_hat, bucket_num=4, absolute=True, bucket_method='equal_width'
)

sf.reveal(xgb_pva_score)
[32]:
PredictionBiasReport(buckets=[BucketPredictionBiasReport(left_endpoint=0.0, left_closed=True, right_endpoint=0.25, right_closed=False, isna=False, avg_prediction=0.0, avg_label=0.1611528992652893, bias=0.1611528992652893, absolute=True), BucketPredictionBiasReport(left_endpoint=0.25, left_closed=True, right_endpoint=0.5, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.5, left_closed=True, right_endpoint=0.75, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.75, left_closed=True, right_endpoint=1.0, right_closed=True, isna=False, avg_prediction=1.0, avg_label=0.36208581924438477, bias=0.6379141807556152, absolute=True)])

P-Value#

双方可通过p-value的值来判断参数是否显著,即该自变量是否可以有效预测因变量的变异, 从而判定对应的解释变量是否应包括在模型中。

[33]:
from secretflow.stats import SSPValue

model = lr_model.save_model()
sspv = SSPValue(spu)
pvalues = sspv.pvalues(test_x, test_y, model)

pvalues
[33]:
array([0.773, 0.324, 0.785, 0.513, 0.011, 0.155, 0.992, 0.991, 0.980,
       0.983, 0.999, 0.957, 0.995, 0.990, 0.939, 0.997, 0.996, 0.993,
       0.999, 0.992, 0.990, 0.000, 0.701, 0.828, 0.891, 0.982, 0.986,
       0.976, 0.953, 0.991, 0.745, 0.969, 0.983, 0.976, 0.973, 0.973,
       0.843, 0.839, 0.986, 0.859, 0.940, 0.993, 0.984, 1.000, 0.988,
       0.992, 0.979, 0.999, 0.993, 0.972, 0.997, 0.979, 0.976, 0.999,
       0.970, 0.995, 0.970, 0.994, 0.964, 0.974, 0.998, 0.994, 0.984,
       0.999, 0.972, 0.998, 0.980, 0.998, 0.983, 0.972, 0.975, 0.976,
       0.993, 0.824, 0.979, 0.000])

评分卡转换#

严格来说,评分卡转化是对预测结果的后续处理,并不属于模型评估。

我们将 y = 1 的概率设为podds = p / (1 - p), 评分卡设定的分值刻度可以通过将分值表示为比率对数的线性表达式来定义,即可表示为下式:

Score = A - B log(odds), A 和 B 是可以设定的常数。隐语中提供了评分卡转换功能,详情可以参考API文档。

[34]:
from secretflow.stats import BiClassificationEval, ScoreCard

sc = ScoreCard(20, 600, 20)
score = sc.transform(xgb_y_hat)

sf.reveal(score.partitions[bob])
[34]:
array([[489.410],
       [452.695],
       [453.148],
       ...,
       [453.148],
       [480.906],
       [453.148]])

安全性讨论#

以上所有模型评估的方法均为单方运算,由label拥有者的PYU Device进行运算。

实验结束#

最后,我们需要清理临时文件,并关闭隐语cluster。

[35]:
import os

try:
    os.remove(alice_path)
    os.remove(alice_psi_path)
    os.remove(bob_path)
    os.remove(bob_psi_path)
except OSError:
    pass

sf.shutdown()

恭喜!你已经完成了隐语金融风控全链路的全部实验内容。

如果你对本实验有任何建议和问题,请在Github Issues上联系我们。