{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 隐语SecretFlow金融风控全链路能力展示\n", "\n", "> This tutorial is only available in Chinese." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> Last updated: Oct 7, 2023\n", ">\n", "> 请使用v0.8.3或以上版本的隐语进行实验。\n", ">\n", "> 以下代码仅作为示例,请勿在生产环境直接使用。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "本次实验将会展示如何使用隐语进行在风控领域常用的Logistic Regeression模型和XGB模型的模型研发工作。\n", "\n", "隐语接下来将会开放模型部署和在线/离线模型预测功能,敬请期待。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 实验目标\n", "\n", "在本次实验中,我们将会利用一个开源数据集训练一个金融风控场景常用的线性回归和XGB模型。在此过程中将包含以下步骤:\n", "\n", "- 样本对齐\n", "- 特征预处理\n", "- 数据分析\n", "- 模型训练\n", "- 模型预测\n", "- 模型评估\n", "\n", "请依次执行所有步骤确保实验可以顺利完成。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 实验前置工作" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 初始化隐语框架\n", "\n", "在本次实验中,我们将会包含两个节点:**alice** 和 **bob** . 在真实业务场景,他们将会代表两个不同实体,他们之间的原始数据不被允许直接相互传输,但是他们的原始数据将会被一起用以研发一个模型。\n", "\n", "在下面的代码中,我们建立了一个 **SecretFlow Cluster**, 基于 **alice** 和 **bob** 两个节点,我们还创建了三个device:\n", "\n", "- alice: PYU device, 负责在alice侧的本地计算,计算输入、计算过程和计算结果仅alice可见\n", "- bob: PYU device, 负责在bob侧的本地计算,计算输入、计算过程和计算结果仅bob可见\n", "- spu: SPU device, 负责alice和bob之间的密态计算,计算输入和计算结果为密态,由alice和bob各掌握一个分片,计算过程为MPC计算,由alice和bob各自的SPU Runtime一起执行。\n", "\n", ">  如果你尚未理解以上的一些概念,比如SPU设备,请参考这篇[文档](../developer/design/architecture.md).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The version of SecretFlow: 1.2.0.dev20231007\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-10-07 18:03:47,500\tINFO worker.py:1538 -- Started a local Ray instance.\n" ] } ], "source": [ "import secretflow as sf\n", "\n", "# Check the version of your SecretFlow\n", "print('The version of SecretFlow: {}'.format(sf.__version__))\n", "\n", "sf.shutdown()\n", "sf.init(['alice', 'bob'], address='local')\n", "alice, bob = sf.PYU('alice'), sf.PYU('bob')\n", "spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "在上面的log中,你应该发现,在**spu**的创建过程中,alice和bob两边都各有一个 **SPURuntime** 被建立并互相创建连接。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 数据集\n", "\n", "本次实验我们采用的原始数据是来自UCI的[Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing). 这个数据集汇集了一家葡萄牙银行机构电话营销的结果。\n", "\n", "我们添加了**uid**这一列用于接下来隐私求交的实验。\n", "\n", "我们首先看一下数据集所包含的信息。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomeyuid
058managementmarriedtertiaryno2143yesnounknown5may2611-10unknownno1
144techniciansinglesecondaryno29yesnounknown5may1511-10unknownno2
233entrepreneurmarriedsecondaryno2yesyesunknown5may761-10unknownno3
347blue-collarmarriedunknownno1506yesnounknown5may921-10unknownno4
433unknownsingleunknownno1nonounknown5may1981-10unknownno5
.........................................................
4520651technicianmarriedtertiaryno825nonocellular17nov9773-10unknownyes45207
4520771retireddivorcedprimaryno1729nonocellular17nov4562-10unknownyes45208
4520872retiredmarriedsecondaryno5715nonocellular17nov112751843successyes45209
4520957blue-collarmarriedsecondaryno668nonotelephone17nov5084-10unknownno45210
4521037entrepreneurmarriedsecondaryno2971nonocellular17nov361218811otherno45211
\n", "

45211 rows × 18 columns

\n", "
" ], "text/plain": [ " age job marital education default balance housing loan \\\n", "0 58 management married tertiary no 2143 yes no \n", "1 44 technician single secondary no 29 yes no \n", "2 33 entrepreneur married secondary no 2 yes yes \n", "3 47 blue-collar married unknown no 1506 yes no \n", "4 33 unknown single unknown no 1 no no \n", "... ... ... ... ... ... ... ... ... \n", "45206 51 technician married tertiary no 825 no no \n", "45207 71 retired divorced primary no 1729 no no \n", "45208 72 retired married secondary no 5715 no no \n", "45209 57 blue-collar married secondary no 668 no no \n", "45210 37 entrepreneur married secondary no 2971 no no \n", "\n", " contact day month duration campaign pdays previous poutcome \\\n", "0 unknown 5 may 261 1 -1 0 unknown \n", "1 unknown 5 may 151 1 -1 0 unknown \n", "2 unknown 5 may 76 1 -1 0 unknown \n", "3 unknown 5 may 92 1 -1 0 unknown \n", "4 unknown 5 may 198 1 -1 0 unknown \n", "... ... ... ... ... ... ... ... ... \n", "45206 cellular 17 nov 977 3 -1 0 unknown \n", "45207 cellular 17 nov 456 2 -1 0 unknown \n", "45208 cellular 17 nov 1127 5 184 3 success \n", "45209 telephone 17 nov 508 4 -1 0 unknown \n", "45210 cellular 17 nov 361 2 188 11 other \n", "\n", " y uid \n", "0 no 1 \n", "1 no 2 \n", "2 no 3 \n", "3 no 4 \n", "4 no 5 \n", "... ... ... \n", "45206 yes 45207 \n", "45207 yes 45208 \n", "45208 yes 45209 \n", "45209 no 45210 \n", "45210 no 45211 \n", "\n", "[45211 rows x 18 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# secretflow.utils.simulation.datasets contains mirrors of some popular open dataset.\n", "from secretflow.utils.simulation.datasets import dataset\n", "\n", "df = pd.read_csv(dataset('bank_marketing_full'), sep=';')\n", "df['uid'] = df.index + 1\n", "\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "该数据集包含了45211个样本,每一个样本代表了一个目标客户。\n", "\n", "每个样本包含16个feature,我们这里简单描述一下这个数据集所有的feature。\n", "\n", "\n", "| feature | 描述 | 取值 |\n", "| :-----| :---- | :---- |\n", "| uid | 客户编码 | 数字 |\n", "| age | 年龄 | 数字 |\n", "| job | 工作类型 | 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown' |\n", "| marital | 婚姻状况 | 'divorced','married','single','unknown' |\n", "| education | 教育状况 | 'tertiary', 'secondary', 'unknown', 'primary' |\n", "| default | 是否有不良信用记录 | 'no','yes','unknown' |\n", "| housing | 是否有房贷 | 'no','yes','unknown' |\n", "| loan | 是否有个人贷款 | 'no','yes','unknown' |\n", "| contact | 联系方式 | 'cellular','telephone' |\n", "| month | 上次联系月份 | 'jan', 'feb', 'mar', ..., 'nov', 'dec' |\n", "| day | 上次联系月日 |数字|\n", "| duration | 上次沟通时间 | 数字 |\n", "| campaign | 本次活动已经沟通的次数 | 数字 |\n", "| pdays | 距离上次沟通经过的天数 | 数字 |\n", "| previous | 在本次活动之前已经沟通的次数 | 数字 |\n", "| poutcome | 之前活动的结果 | 'unknown', 'failure', 'other', 'success' | \n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "每个样本的label - y表示对于目标客户的营销结果(是否签订了定额存款合同),取值是'yes','no'。\n", "\n", "我们假定以上16个feature由两个机构分别掌握,具体如下。\n", "\n", "- alice: age, job, marital, education, default, balance, housing, loan\n", "- bob: contact, day, month, duration, campaign, pdays, previous, poutcome, y\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "在真实业务场景中, alice和bob所掌握的数据可能是没有对齐的,为了模拟这种情况,我们将数据集shuffle之后,再随机各取90%来模拟这个状况。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloanuid
2971355managementsingletertiaryno-350nono29714
3744149servicesmarriedsecondaryno2409yesno37442
1682837entrepreneurmarriedtertiaryno1514noyes16829
2061355blue-collarmarriedsecondaryno388nono20614
2558254admin.marriedsecondaryno3136yesno25583
..............................
4293434technicianmarriedsecondaryno3000yesyes42935
414135techniciansinglesecondaryno-206yesno4142
547831admin.singletertiaryno701yesno5479
625733blue-collarmarriedsecondaryno-241yesyes6258
277640blue-collarmarriedsecondaryno-31yesno2777
\n", "

40690 rows × 9 columns

\n", "
" ], "text/plain": [ " age job marital education default balance housing loan \\\n", "29713 55 management single tertiary no -350 no no \n", "37441 49 services married secondary no 2409 yes no \n", "16828 37 entrepreneur married tertiary no 1514 no yes \n", "20613 55 blue-collar married secondary no 388 no no \n", "25582 54 admin. married secondary no 3136 yes no \n", "... ... ... ... ... ... ... ... ... \n", "42934 34 technician married secondary no 3000 yes yes \n", "4141 35 technician single secondary no -206 yes no \n", "5478 31 admin. single tertiary no 701 yes no \n", "6257 33 blue-collar married secondary no -241 yes yes \n", "2776 40 blue-collar married secondary no -31 yes no \n", "\n", " uid \n", "29713 29714 \n", "37441 37442 \n", "16828 16829 \n", "20613 20614 \n", "25582 25583 \n", "... ... \n", "42934 42935 \n", "4141 4142 \n", "5478 5479 \n", "6257 6258 \n", "2776 2777 \n", "\n", "[40690 rows x 9 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "df_alice = df.iloc[:, np.r_[0:8, -1]].sample(frac=0.9)\n", "\n", "df_alice" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contactdaymonthdurationcampaignpdayspreviouspoutcomeyuid
35491cellular7may600233017failureno35492
371unknown6may1951-10unknownno372
33382cellular20apr1981-10unknownno33383
26770cellular20nov2071-10unknownno26771
33620cellular20apr204-10unknownno33621
.................................
28643cellular29jan942-10unknownno28644
16449cellular23jul1223-10unknownno16450
29287cellular2feb1752-10unknownno29288
33126telephone20apr7333253failureno33127
34736cellular6may871-10unknownno34737
\n", "

40690 rows × 10 columns

\n", "
" ], "text/plain": [ " contact day month duration campaign pdays previous poutcome y \\\n", "35491 cellular 7 may 600 2 330 17 failure no \n", "371 unknown 6 may 195 1 -1 0 unknown no \n", "33382 cellular 20 apr 198 1 -1 0 unknown no \n", "26770 cellular 20 nov 207 1 -1 0 unknown no \n", "33620 cellular 20 apr 20 4 -1 0 unknown no \n", "... ... ... ... ... ... ... ... ... .. \n", "28643 cellular 29 jan 94 2 -1 0 unknown no \n", "16449 cellular 23 jul 122 3 -1 0 unknown no \n", "29287 cellular 2 feb 175 2 -1 0 unknown no \n", "33126 telephone 20 apr 73 3 325 3 failure no \n", "34736 cellular 6 may 87 1 -1 0 unknown no \n", "\n", " uid \n", "35491 35492 \n", "371 372 \n", "33382 33383 \n", "26770 26771 \n", "33620 33621 \n", "... ... \n", "28643 28644 \n", "16449 16450 \n", "29287 29288 \n", "33126 33127 \n", "34736 34737 \n", "\n", "[40690 rows x 10 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_bob = df.iloc[:, 8:].sample(frac=0.9)\n", "\n", "df_bob" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "我们这里将df_alice和df_bob保存为文件,作为alice和bob两方的原始输入。\n", "\n", "至此,我们完成了所有实验准备工作。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import tempfile\n", "\n", "_, alice_path = tempfile.mkstemp()\n", "_, bob_path = tempfile.mkstemp()\n", "df_alice.reset_index(drop=True).to_csv(alice_path, index=False)\n", "df_bob.reset_index(drop=True).to_csv(bob_path, index=False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 样本对齐(隐私求交)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "显然,第一步我们需要将两边的数据对齐。\n", "隐私求交([Private Set Intersection](https://en.wikipedia.org/wiki/Private_set_intersection))是一种密码学方法,可以获取两个集合的交集,而不泄露任何其他信息。\n", "在隐语中,SPU设备支持三种隐私求交算法:\n", "\n", "- [ECDH](https://ieeexplore.ieee.org/document/6234849/):半诚实模型, 基于公钥密码学,原本适用于小数据集,但是隐语优化后已经能支持10亿量级的数据。\n", "- [KKRT](https://eprint.iacr.org/2016/799.pdf):半诚实模型, 基于布谷鸟哈希(Cuckoo Hashing)以及高效不经意传输扩展(OT Extension),适用于大数据集(比如千万数据集)。\n", "- [BC22PCG](https://eprint.iacr.org/2022/334):半诚实模型, 基于随机相关函数生成器,适用于大数据集。\n", "\n", "由于我们这里的数据集较小,我们这里采用的是ECDH方法。\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 方式一:将隐私求交结果保存至文件\n", "\n", "在一些应用场景场景中,alice和bob可能在隐私求交之后将结果直接保存至文件中,之后再进行后续操作。这个时候,请调用**psi_csv**接口。\n", "\n", "在以下代码中,我们分别制定了两边需要求交的key以及输入和输出路径。\n", "\n", "我们需要指定双方的输入文件和输出文件路径。对于ECDH来说,由于双方的地位是平等的,receiver并没有实际含义,你可以任意指定。我们需要设定正确的protocol。sort设为true之后,join的结果将会被排序。\n", "\n", "> 请阅读 psi_csv 的文档。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime pid=3812799)\u001b[0m 2023-10-07 18:03:49.999 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry\n", "\u001b[2m\u001b[36m(SPURuntime pid=3812799)\u001b[0m 2023-10-07 18:03:50.999 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:33409} (0x0x3bbdcc0): Connection refused [R1][E112]Not connected to 127.0.0.1:33409 yet, server_id=0'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3812799)\u001b[0m 2023-10-07 18:03:50.999 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry\n", "\u001b[2m\u001b[36m(SPURuntime pid=3812799)\u001b[0m 2023-10-07 18:03:51.999 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:33409} (0x0x3bbdcc0): Connection refused [R1][E112]Not connected to 127.0.0.1:33409 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:33409 yet, server_id=0'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3812799)\u001b[0m 2023-10-07 18:03:51.999 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry\n", "\u001b[2m\u001b[36m(SPURuntime pid=3812800)\u001b[0m 2023-10-07 18:03:52.009 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:41217'\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.012 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.013 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpalym9ko6, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.012 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.012 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpr7dudg52, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.022 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034013188158 | LC_ALL=C uniq -d > duplicate-keys.1696673034013188158\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.023 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034012852841 | LC_ALL=C uniq -d > duplicate-keys.1696673034012852841\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.032 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpalym9ko6, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.032 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.032 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.032 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.036 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.078 [info] [bucket_psi.cc:operator():385] ECDH progress 10%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.102 [info] [bucket_psi.cc:operator():385] ECDH progress 20%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.122 [info] [bucket_psi.cc:operator():385] ECDH progress 30%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.032 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpr7dudg52, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.033 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.033 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.033 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.036 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.079 [info] [bucket_psi.cc:operator():385] ECDH progress 10%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.097 [info] [bucket_psi.cc:operator():385] ECDH progress 20%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.115 [info] [bucket_psi.cc:operator():385] ECDH progress 30%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.133 [info] [bucket_psi.cc:operator():385] ECDH progress 40%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.146 [info] [bucket_psi.cc:operator():385] ECDH progress 50%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.171 [info] [bucket_psi.cc:operator():385] ECDH progress 60%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.184 [info] [bucket_psi.cc:operator():385] ECDH progress 70%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.205 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.209 [info] [bucket_psi.cc:operator():385] ECDH progress 80%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.225 [info] [bucket_psi.cc:operator():385] ECDH progress 90%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.135 [info] [bucket_psi.cc:operator():385] ECDH progress 40%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.159 [info] [bucket_psi.cc:operator():385] ECDH progress 50%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.185 [info] [bucket_psi.cc:operator():385] ECDH progress 60%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.204 [info] [bucket_psi.cc:operator():385] ECDH progress 70%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.217 [info] [bucket_psi.cc:operator():385] ECDH progress 80%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.230 [info] [bucket_psi.cc:operator():385] ECDH progress 90%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.234 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.234 [info] [bucket_psi.cc:operator():385] ECDH progress 100%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.234 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.235 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690\n" ] }, { "data": { "text/plain": [ "[{'party': 'alice', 'original_count': 40690, 'intersection_count': 36634},\n", " {'party': 'bob', 'original_count': 40690, 'intersection_count': 36634}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_, alice_psi_path = tempfile.mkstemp()\n", "_, bob_psi_path = tempfile.mkstemp()\n", "\n", "spu.psi_csv(\n", " key=\"uid\",\n", " input_path={alice: alice_path, bob: bob_path},\n", " output_path={alice: alice_psi_path, bob: bob_psi_path},\n", " receiver=\"alice\",\n", " protocol=\"ECDH_PSI_2PC\",\n", " sort=True,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 方式二:将求交结果保存至VDataFrame\n", "\n", "VDataFrame是隐语中保存垂直切分数据的数据结构,在接下来的任务中,我们将会不断使用VDataFrame的数据结构。\n", "\n", "由于在本次实验中,经过隐私求交之后,我们还有后续操作,所以我们在这里使用 **data.vertical.read_csv** 来将原始数据隐私求交之后的结果直接转化为VDataFrame。\n", "\n", "> 请阅读data.vertical.read_csv的文档。很多参数和psi_csv是一致的,这里不再赘述。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.242 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.257 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.260 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034257569380 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673034257569380\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.275 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034257569380 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673034257569380, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.276 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpalym9ko6, out=/tmp/tmplvme8bi7\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.243 [info] [bucket_psi.cc:operator():385] ECDH progress 100%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.243 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.258 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.261 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034258490369 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673034258490369\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.275 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673034258490369 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673034258490369, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.275 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpr7dudg52, out=/tmp/tmpagzpomeb\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpalym9ko6, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.829 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034819762103 | LC_ALL=C uniq -d > duplicate-keys.1696673034819762103\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.840 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpalym9ko6, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.840 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.840 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.840 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:Init:315] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.819 [info] [bucket_psi.cc:CheckInput:229] Begin sanity check for input file: /tmp/tmpr7dudg52, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.830 [info] [csv_checker.cc:CsvChecker:121] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1696673034819646273 | LC_ALL=C uniq -d > duplicate-keys.1696673034819646273\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.840 [info] [bucket_psi.cc:CheckInput:246] End sanity check for input file: /tmp/tmpr7dudg52, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.841 [info] [bucket_psi.cc:RunPsi:348] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.841 [info] [cryptor_selector.cc:GetSodiumCryptor:46] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.841 [info] [cipher_store.cc:DiskCipherStore:33] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.871 [info] [bucket_psi.cc:operator():385] ECDH progress 10%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.891 [info] [bucket_psi.cc:operator():385] ECDH progress 20%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.915 [info] [bucket_psi.cc:operator():385] ECDH progress 30%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.934 [info] [bucket_psi.cc:operator():385] ECDH progress 40%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.955 [info] [bucket_psi.cc:operator():385] ECDH progress 50%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.884 [info] [bucket_psi.cc:operator():385] ECDH progress 10%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.905 [info] [bucket_psi.cc:operator():385] ECDH progress 20%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.921 [info] [bucket_psi.cc:operator():385] ECDH progress 30%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.942 [info] [bucket_psi.cc:operator():385] ECDH progress 40%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.957 [info] [bucket_psi.cc:operator():385] ECDH progress 50%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:54.992 [info] [bucket_psi.cc:operator():385] ECDH progress 60%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.003 [info] [bucket_psi.cc:operator():385] ECDH progress 70%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.017 [info] [bucket_psi.cc:operator():385] ECDH progress 80%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.029 [info] [bucket_psi.cc:operator():385] ECDH progress 90%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.031 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.042 [info] [bucket_psi.cc:operator():385] ECDH progress 100%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.042 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.043 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.058 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.061 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035058333794 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673035058333794\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:54.990 [info] [bucket_psi.cc:operator():385] ECDH progress 60%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.007 [info] [bucket_psi.cc:operator():385] ECDH progress 70%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.022 [info] [bucket_psi.cc:operator():385] ECDH progress 80%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.023 [info] [ecdh_psi.cc:MaskSelf:75] MaskSelf:root--finished, batch_count=10, self_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.038 [info] [bucket_psi.cc:operator():385] ECDH progress 90%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.041 [info] [ecdh_psi.cc:MaskPeer:120] MaskPeer:root--finished, batch_count=10, peer_item_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.044 [info] [bucket_psi.cc:operator():385] ECDH progress 100%\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.044 [info] [ecdh_psi.cc:RecvDualMaskedSelf:149] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.059 [info] [bucket_psi.cc:ProduceOutput:267] Begin post filtering, indices.size=36634, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.062 [info] [utils.cc:MultiKeySort:88] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035059340444 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673035059340444\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.076 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035058333794 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1696673035058333794, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3812799)\u001b[0m 2023-10-07 18:03:55.076 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpalym9ko6, out=/tmp/tmpalym9ko6.psi_output_97135\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.076 [info] [utils.cc:MultiKeySort:90] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1696673035059340444 | LC_ALL=C sort --buffer-size=3G --parallel=8 --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1696673035059340444, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3812800)\u001b[0m 2023-10-07 18:03:55.076 [info] [bucket_psi.cc:ProduceOutput:305] End post filtering, in=/tmp/tmpr7dudg52, out=/tmp/tmpr7dudg52.psi_output_97135\n" ] }, { "data": { "text/plain": [ "['age',\n", " 'job',\n", " 'marital',\n", " 'education',\n", " 'default',\n", " 'balance',\n", " 'housing',\n", " 'loan',\n", " 'contact',\n", " 'day',\n", " 'month',\n", " 'duration',\n", " 'campaign',\n", " 'pdays',\n", " 'previous',\n", " 'poutcome',\n", " 'y']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.data.vertical import read_csv as v_read_csv\n", "\n", "vdf = v_read_csv(\n", " {alice: alice_path, bob: bob_path},\n", " spu=spu,\n", " keys=\"uid\",\n", " drop_keys=\"uid\",\n", " psi_protocl=\"ECDH_PSI_2PC\",\n", ")\n", "vdf.columns" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 更多\n", "\n", "我们在这里展示的是两方单键的隐私求交,隐语也支持三方和多键的隐私求交技术,想要了解更多信息,你可以:\n", "\n", "- 阅读这篇[文档](https://www.secretflow.org.cn/docs/spu/en/development/psi.html)了解隐语SPU的隐私求交能力。\n", "- 阅读该[教程](./PSI_On_SPU.ipynb)了解使用的例子。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 特征预处理\n", "\n", "一般情况下,我们都需要对用于建模的数据进行预处理,合理的预处理对模型训练效果非常关键。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "在开始特征预处理之前,我们先使用 **stats.table_statistics.table_statistics** 来查看一下特征总体情况,我们会在后面专门讨论全表统计模块。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m return fn(*args, **kwargs)\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m return fn(*args, **kwargs)\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m return fn(*args, **kwargs)\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m return fn(*args, **kwargs)\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m return fn(*args, **kwargs)\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/secretflow/device/device/pyu.py:154: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m return fn(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datatypetotal_countcount(non-NA count)count_na(NA count)na_ratiominmaxmeanvar(variance)std(standard deviation)...moment_2moment_3moment_4central_moment_2central_moment_3central_moment_4sumsum_2sum_3sum_4
ageint64366343663400.018.095.040.9386361.125268e+0210.607867...1.788496e+038.324620e+044.115919e+061.125238e+028.144866e+024.214078e+041499746.06.551975e+073.049641e+091.507826e+11
jobobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
maritalobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
educationobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
defaultobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
balanceint64366343663400.0-3372.098417.01366.7204249.148037e+063024.572264...1.101571e+072.605110e+114.548695e+159.147788e+062.204507e+111.079063e+1650068436.04.035496e+119.543560e+156.161953e+17
housingobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
loanobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
contactobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
dayint64366343663400.01.031.015.7989306.932877e+018.326390...3.189331e+027.282283e+031.787904e+056.932688e+015.290034e+019.317523e+03578778.01.168379e+072.667791e+086.549806e+09
monthobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
durationint64366343663400.00.04918.0257.6115636.625854e+04257.407347...1.326205e+051.224237e+081.824202e+116.625673e+045.412212e+079.586385e+109437342.04.858418e+094.484869e+126.682781e+15
campaignint64366343663400.01.063.02.7623799.535351e+003.087936...1.716583e+012.437880e+025.893892e+039.535091e+001.436904e+023.811396e+03101197.06.288530e+058.930929e+062.159168e+08
pdaysint64366343663400.0-1.0871.040.4356341.010338e+04100.515581...1.173815e+043.951940e+061.554830e+091.010311e+042.660249e+061.022767e+091481319.04.300153e+081.447754e+115.695965e+13
previousint64366343663400.00.0275.00.5901895.856919e+002.420107...6.205083e+006.336154e+021.579490e+055.856759e+006.230400e+021.564658e+0521621.02.273170e+052.321187e+075.786303e+09
poutcomeobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
yobject366343663400.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

17 rows × 26 columns

\n", "
" ], "text/plain": [ " datatype total_count count(non-NA count) count_na(NA count) \\\n", "age int64 36634 36634 0 \n", "job object 36634 36634 0 \n", "marital object 36634 36634 0 \n", "education object 36634 36634 0 \n", "default object 36634 36634 0 \n", "balance int64 36634 36634 0 \n", "housing object 36634 36634 0 \n", "loan object 36634 36634 0 \n", "contact object 36634 36634 0 \n", "day int64 36634 36634 0 \n", "month object 36634 36634 0 \n", "duration int64 36634 36634 0 \n", "campaign int64 36634 36634 0 \n", "pdays int64 36634 36634 0 \n", "previous int64 36634 36634 0 \n", "poutcome object 36634 36634 0 \n", "y object 36634 36634 0 \n", "\n", " na_ratio min max mean var(variance) \\\n", "age 0.0 18.0 95.0 40.938636 1.125268e+02 \n", "job 0.0 NaN NaN NaN NaN \n", "marital 0.0 NaN NaN NaN NaN \n", "education 0.0 NaN NaN NaN NaN \n", "default 0.0 NaN NaN NaN NaN \n", "balance 0.0 -3372.0 98417.0 1366.720424 9.148037e+06 \n", "housing 0.0 NaN NaN NaN NaN \n", "loan 0.0 NaN NaN NaN NaN \n", "contact 0.0 NaN NaN NaN NaN \n", "day 0.0 1.0 31.0 15.798930 6.932877e+01 \n", "month 0.0 NaN NaN NaN NaN \n", "duration 0.0 0.0 4918.0 257.611563 6.625854e+04 \n", "campaign 0.0 1.0 63.0 2.762379 9.535351e+00 \n", "pdays 0.0 -1.0 871.0 40.435634 1.010338e+04 \n", "previous 0.0 0.0 275.0 0.590189 5.856919e+00 \n", "poutcome 0.0 NaN NaN NaN NaN \n", "y 0.0 NaN NaN NaN NaN \n", "\n", " std(standard deviation) ... moment_2 moment_3 \\\n", "age 10.607867 ... 1.788496e+03 8.324620e+04 \n", "job NaN ... NaN NaN \n", "marital NaN ... NaN NaN \n", "education NaN ... NaN NaN \n", "default NaN ... NaN NaN \n", "balance 3024.572264 ... 1.101571e+07 2.605110e+11 \n", "housing NaN ... NaN NaN \n", "loan NaN ... NaN NaN \n", "contact NaN ... NaN NaN \n", "day 8.326390 ... 3.189331e+02 7.282283e+03 \n", "month NaN ... NaN NaN \n", "duration 257.407347 ... 1.326205e+05 1.224237e+08 \n", "campaign 3.087936 ... 1.716583e+01 2.437880e+02 \n", "pdays 100.515581 ... 1.173815e+04 3.951940e+06 \n", "previous 2.420107 ... 6.205083e+00 6.336154e+02 \n", "poutcome NaN ... NaN NaN \n", "y NaN ... NaN NaN \n", "\n", " moment_4 central_moment_2 central_moment_3 central_moment_4 \\\n", "age 4.115919e+06 1.125238e+02 8.144866e+02 4.214078e+04 \n", "job NaN NaN NaN NaN \n", "marital NaN NaN NaN NaN \n", "education NaN NaN NaN NaN \n", "default NaN NaN NaN NaN \n", "balance 4.548695e+15 9.147788e+06 2.204507e+11 1.079063e+16 \n", "housing NaN NaN NaN NaN \n", "loan NaN NaN NaN NaN \n", "contact NaN NaN NaN NaN \n", "day 1.787904e+05 6.932688e+01 5.290034e+01 9.317523e+03 \n", "month NaN NaN NaN NaN \n", "duration 1.824202e+11 6.625673e+04 5.412212e+07 9.586385e+10 \n", "campaign 5.893892e+03 9.535091e+00 1.436904e+02 3.811396e+03 \n", "pdays 1.554830e+09 1.010311e+04 2.660249e+06 1.022767e+09 \n", "previous 1.579490e+05 5.856759e+00 6.230400e+02 1.564658e+05 \n", "poutcome NaN NaN NaN NaN \n", "y NaN NaN NaN NaN \n", "\n", " sum sum_2 sum_3 sum_4 \n", "age 1499746.0 6.551975e+07 3.049641e+09 1.507826e+11 \n", "job NaN NaN NaN NaN \n", "marital NaN NaN NaN NaN \n", "education NaN NaN NaN NaN \n", "default NaN NaN NaN NaN \n", "balance 50068436.0 4.035496e+11 9.543560e+15 6.161953e+17 \n", "housing NaN NaN NaN NaN \n", "loan NaN NaN NaN NaN \n", "contact NaN NaN NaN NaN \n", "day 578778.0 1.168379e+07 2.667791e+08 6.549806e+09 \n", "month NaN NaN NaN NaN \n", "duration 9437342.0 4.858418e+09 4.484869e+12 6.682781e+15 \n", "campaign 101197.0 6.288530e+05 8.930929e+06 2.159168e+08 \n", "pdays 1481319.0 4.300153e+08 1.447754e+11 5.695965e+13 \n", "previous 21621.0 2.273170e+05 2.321187e+07 5.786303e+09 \n", "poutcome NaN NaN NaN NaN \n", "y NaN NaN NaN NaN \n", "\n", "[17 rows x 26 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats.table_statistics import table_statistics\n", "\n", "pd.set_option('display.max_rows', None)\n", "data_stats = table_statistics(vdf)\n", "data_stats" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "pd.reset_option('display.max_rows')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "在接下来,我们将会展示隐语以下特征预处理能力:\n", "\n", "- 值替换\n", "- 缺失值填充\n", "- WOE分组/分箱转换\n", "- one-hot编码\n", "- 标准化" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 值替换\n", "\n", "我们先对以下特征做值替换:\n", "\n", "| feature | 描述 | 取值和值替换规则 |\n", "| :-----| :---- | :---- |\n", "| education | 教育状况 | 'tertiary' -> 3, 'secondary' -> 2, 'unknown' -> 0, 'primary' -> 1 |\n", "| default | 是否有不良信用记录 | 'no' -> 0,'yes' -> 1,'unknown' -> NaN |\n", "| housing | 是否有房贷 | 'no' -> 0,'yes' -> 1,'unknown' -> NaN |\n", "| loan | 是否有个人贷款 | 'no' -> 0,'yes' -> 1,'unknown' -> NaN |\n", "| month | 上次联系月份 | 'jan' -> 1, 'feb' -> 2, 'mar' -> 3, ..., 'nov' -> 11, 'dec' ->12 |\n", "| y | label | 'yes' -> 1,'no' -> 0 |\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "替换完之后,我们使用 **sf.reveal** 来查看效果,请注意在生产中,**sf.reveal** 将会直接泄露数据,需要严格限制和进行审计。\n", "\n", "> 在生产中,请严格限制**sf.reveal**的使用。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " age job marital education default balance housing loan\n", "0 58 management married 3.0 0 2143 1 0\n", "1 46 management married 3.0 0 229 1 0\n", "2 33 technician single 2.0 0 56 0 0\n", "3 42 technician married 2.0 0 8036 0 0\n", "4 38 admin. married 1.0 0 1487 0 0\n", "... ... ... ... ... ... ... ... ...\n", "36629 38 blue-collar married 1.0 0 3289 0 0\n", "36630 36 self-employed single 3.0 0 4844 0 0\n", "36631 49 technician married 2.0 0 378 0 0\n", "36632 40 blue-collar married 1.0 0 48 0 0\n", "36633 46 services married 3.0 0 474 0 0\n", "\n", "[36634 rows x 8 columns]\n", " contact day month duration campaign pdays previous poutcome y\n", "0 unknown 5 5 261 1 -1 0 unknown 0\n", "1 unknown 5 5 197 1 -1 0 unknown 0\n", "2 unknown 7 5 236 2 -1 0 unknown 0\n", "3 unknown 9 6 948 5 -1 0 unknown 0\n", "4 unknown 9 6 332 2 -1 0 unknown 0\n", "... ... ... ... ... ... ... ... ... ..\n", "36629 unknown 9 6 553 2 -1 0 unknown 0\n", "36630 unknown 9 6 1137 3 -1 0 unknown 1\n", "36631 unknown 9 6 189 2 -1 0 unknown 0\n", "36632 unknown 9 6 100 5 -1 0 unknown 0\n", "36633 unknown 9 6 445 2 -1 0 unknown 0\n", "\n", "[36634 rows x 9 columns]\n" ] } ], "source": [ "vdf['education'] = vdf['education'].replace(\n", " {'tertiary': 3, 'secondary': 2, 'primary': 1, 'unknown': np.NaN}\n", ")\n", "\n", "vdf['default'] = vdf['default'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})\n", "\n", "vdf['housing'] = vdf['housing'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})\n", "\n", "vdf['loan'] = vdf['loan'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})\n", "\n", "vdf['month'] = vdf['month'].replace(\n", " {\n", " 'jan': 1,\n", " 'feb': 2,\n", " 'mar': 3,\n", " 'apr': 4,\n", " 'may': 5,\n", " 'jun': 6,\n", " 'jul': 7,\n", " 'aug': 8,\n", " 'sep': 9,\n", " 'oct': 10,\n", " 'nov': 11,\n", " 'dec': 12,\n", " }\n", ")\n", "\n", "vdf['y'] = vdf['y'].replace(\n", " {\n", " 'no': 0,\n", " 'yes': 1,\n", " }\n", ")\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "值替换操作由数据所有者的PYU Device执行,不会泄露数据。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 缺失值填充\n", "\n", "接下来我们对缺失值进行填充。我们在这里均填充了众数,其他可选的策略还包括平均数、中位数等。\n", "\n", "其他可能的处理方法包括删除缺省的行, 或者可以使用数据完整的行作为训练集,以此来预测缺失值。\n", "\n", "替换完之后,我们使用 **sf.reveal** 来查看效果。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " age job marital education default balance housing loan\n", "0 58 management married 3.0 0 2143 1 0\n", "1 46 management married 3.0 0 229 1 0\n", "2 33 technician single 2.0 0 56 0 0\n", "3 42 technician married 2.0 0 8036 0 0\n", "4 38 admin. married 1.0 0 1487 0 0\n", "... ... ... ... ... ... ... ... ...\n", "36629 38 blue-collar married 1.0 0 3289 0 0\n", "36630 36 self-employed single 3.0 0 4844 0 0\n", "36631 49 technician married 2.0 0 378 0 0\n", "36632 40 blue-collar married 1.0 0 48 0 0\n", "36633 46 services married 3.0 0 474 0 0\n", "\n", "[36634 rows x 8 columns]\n", " contact day month duration campaign pdays previous poutcome y\n", "0 unknown 5 5 261 1 -1 0 unknown 0\n", "1 unknown 5 5 197 1 -1 0 unknown 0\n", "2 unknown 7 5 236 2 -1 0 unknown 0\n", "3 unknown 9 6 948 5 -1 0 unknown 0\n", "4 unknown 9 6 332 2 -1 0 unknown 0\n", "... ... ... ... ... ... ... ... ... ..\n", "36629 unknown 9 6 553 2 -1 0 unknown 0\n", "36630 unknown 9 6 1137 3 -1 0 unknown 1\n", "36631 unknown 9 6 189 2 -1 0 unknown 0\n", "36632 unknown 9 6 100 5 -1 0 unknown 0\n", "36633 unknown 9 6 445 2 -1 0 unknown 0\n", "\n", "[36634 rows x 9 columns]\n" ] } ], "source": [ "vdf[\"education\"] = vdf[\"education\"].fillna(vdf[\"education\"].mode())\n", "vdf[\"default\"] = vdf[\"default\"].fillna(vdf[\"default\"].mode())\n", "vdf[\"housing\"] = vdf[\"housing\"].fillna(vdf[\"housing\"].mode())\n", "vdf[\"loan\"] = vdf[\"loan\"].fillna(vdf[\"loan\"].mode())\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "所填充的缺失值由属于数据所有者的PYU Device执行,并在接下来的缺失值操作中由数据所有者的PYU Device使用,不会泄露数据。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### woe分箱\n", "\n", "woe分箱用于将连续值替换为离散值。\n", "\n", "将连续型特征离散化的一个好处是可以有效地克服数据中隐藏的缺陷: 使模型结果更加稳定。例如,数据中的极端值是影响模型效果的一个重要因素。极端值导致模型参数过高或过低,或导致模型被虚假现象\"迷惑\",把原来不存在的关系作为重要模式来学习。而离散化可以有效地减弱极端值和异常值的影响。\n", "\n", "变量duration的75%分位数远小于最大值,而且该变量的标准差相对也比较大。因此需要对变量duration进行离散化。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " age job marital education default balance housing loan\n", "0 58 management married 3.0 0 2143 1 0\n", "1 46 management married 3.0 0 229 1 0\n", "2 33 technician single 2.0 0 56 0 0\n", "3 42 technician married 2.0 0 8036 0 0\n", "4 38 admin. married 1.0 0 1487 0 0\n", "... ... ... ... ... ... ... ... ...\n", "36629 38 blue-collar married 1.0 0 3289 0 0\n", "36630 36 self-employed single 3.0 0 4844 0 0\n", "36631 49 technician married 2.0 0 378 0 0\n", "36632 40 blue-collar married 1.0 0 48 0 0\n", "36633 46 services married 3.0 0 474 0 0\n", "\n", "[36634 rows x 8 columns]\n", " contact day month duration campaign pdays previous poutcome y\n", "0 unknown 5 5 0.355090 1 -1 0 unknown 0\n", "1 unknown 5 5 -1.136022 1 -1 0 unknown 0\n", "2 unknown 7 5 -0.061611 2 -1 0 unknown 0\n", "3 unknown 9 6 2.328878 5 -1 0 unknown 0\n", "4 unknown 9 6 0.202048 2 -1 0 unknown 0\n", "... ... ... ... ... ... ... ... ... ..\n", "36629 unknown 9 6 1.131077 2 -1 0 unknown 0\n", "36630 unknown 9 6 2.328878 3 -1 0 unknown 1\n", "36631 unknown 9 6 -0.394791 2 -1 0 unknown 0\n", "36632 unknown 9 6 -1.708979 5 -1 0 unknown 0\n", "36633 unknown 9 6 0.813627 2 -1 0 unknown 0\n", "\n", "[36634 rows x 9 columns]\n" ] } ], "source": [ "from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning\n", "from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution\n", "\n", "binning = VertWoeBinning(spu)\n", "bin_rules = binning.binning(\n", " vdf,\n", " binning_method=\"chimerge\",\n", " bin_num=4,\n", " bin_names={alice: [], bob: [\"duration\"]},\n", " label_name=\"y\",\n", ")\n", "\n", "woe_sub = VertBinSubstitution()\n", "vdf = woe_sub.substitution(vdf, bin_rules)\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "woe分桶需要利用alice和bob两边的数据,因此相关的计算需要使用**SPU device**确保原始数据不被泄露。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### One Hot编码" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "one-hot编码适用于将类型编码转化为数值编码。 对于job、marital等特征我们需要one-hot编码。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " age education default balance housing loan job_admin. \\\n", "0 58 3.0 0 2143 1 0 0.0 \n", "1 46 3.0 0 229 1 0 0.0 \n", "2 33 2.0 0 56 0 0 0.0 \n", "3 42 2.0 0 8036 0 0 0.0 \n", "4 38 1.0 0 1487 0 0 1.0 \n", "... ... ... ... ... ... ... ... \n", "36629 38 1.0 0 3289 0 0 0.0 \n", "36630 36 3.0 0 4844 0 0 0.0 \n", "36631 49 2.0 0 378 0 0 0.0 \n", "36632 40 1.0 0 48 0 0 0.0 \n", "36633 46 3.0 0 474 0 0 0.0 \n", "\n", " job_blue-collar job_entrepreneur job_housemaid ... job_retired \\\n", "0 0.0 0.0 0.0 ... 0.0 \n", "1 0.0 0.0 0.0 ... 0.0 \n", "2 0.0 0.0 0.0 ... 0.0 \n", "3 0.0 0.0 0.0 ... 0.0 \n", "4 0.0 0.0 0.0 ... 0.0 \n", "... ... ... ... ... ... \n", "36629 1.0 0.0 0.0 ... 0.0 \n", "36630 0.0 0.0 0.0 ... 0.0 \n", "36631 0.0 0.0 0.0 ... 0.0 \n", "36632 1.0 0.0 0.0 ... 0.0 \n", "36633 0.0 0.0 0.0 ... 0.0 \n", "\n", " job_self-employed job_services job_student job_technician \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 1.0 \n", "3 0.0 0.0 0.0 1.0 \n", "4 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... \n", "36629 0.0 0.0 0.0 0.0 \n", "36630 1.0 0.0 0.0 0.0 \n", "36631 0.0 0.0 0.0 1.0 \n", "36632 0.0 0.0 0.0 0.0 \n", "36633 0.0 1.0 0.0 0.0 \n", "\n", " job_unemployed job_unknown marital_divorced marital_married \\\n", "0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 1.0 \n", "4 0.0 0.0 0.0 1.0 \n", "... ... ... ... ... \n", "36629 0.0 0.0 0.0 1.0 \n", "36630 0.0 0.0 0.0 0.0 \n", "36631 0.0 0.0 0.0 1.0 \n", "36632 0.0 0.0 0.0 1.0 \n", "36633 0.0 0.0 0.0 1.0 \n", "\n", " marital_single \n", "0 0.0 \n", "1 0.0 \n", "2 1.0 \n", "3 0.0 \n", "4 0.0 \n", "... ... \n", "36629 0.0 \n", "36630 1.0 \n", "36631 0.0 \n", "36632 0.0 \n", "36633 0.0 \n", "\n", "[36634 rows x 21 columns]\n", " duration campaign pdays previous y contact_cellular \\\n", "0 0.355090 1 -1 0 0 0.0 \n", "1 -1.136022 1 -1 0 0 0.0 \n", "2 -0.061611 2 -1 0 0 0.0 \n", "3 2.328878 5 -1 0 0 0.0 \n", "4 0.202048 2 -1 0 0 0.0 \n", "... ... ... ... ... .. ... \n", "36629 1.131077 2 -1 0 0 0.0 \n", "36630 2.328878 3 -1 0 1 0.0 \n", "36631 -0.394791 2 -1 0 0 0.0 \n", "36632 -1.708979 5 -1 0 0 0.0 \n", "36633 0.813627 2 -1 0 0 0.0 \n", "\n", " contact_telephone contact_unknown month_1.0 month_2.0 ... \\\n", "0 0.0 1.0 0.0 0.0 ... \n", "1 0.0 1.0 0.0 0.0 ... \n", "2 0.0 1.0 0.0 0.0 ... \n", "3 0.0 1.0 0.0 0.0 ... \n", "4 0.0 1.0 0.0 0.0 ... \n", "... ... ... ... ... ... \n", "36629 0.0 1.0 0.0 0.0 ... \n", "36630 0.0 1.0 0.0 0.0 ... \n", "36631 0.0 1.0 0.0 0.0 ... \n", "36632 0.0 1.0 0.0 0.0 ... \n", "36633 0.0 1.0 0.0 0.0 ... \n", "\n", " day_26.0 day_27.0 day_28.0 day_29.0 day_30.0 day_31.0 \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... \n", "36629 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36630 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36631 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36632 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36633 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " poutcome_failure poutcome_other poutcome_success poutcome_unknown \n", "0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 1.0 \n", "3 0.0 0.0 0.0 1.0 \n", "4 0.0 0.0 0.0 1.0 \n", "... ... ... ... ... \n", "36629 0.0 0.0 0.0 1.0 \n", "36630 0.0 0.0 0.0 1.0 \n", "36631 0.0 0.0 0.0 1.0 \n", "36632 0.0 0.0 0.0 1.0 \n", "36633 0.0 0.0 0.0 1.0 \n", "\n", "[36634 rows x 55 columns]\n" ] } ], "source": [ "from secretflow.preprocessing.encoder import OneHotEncoder\n", "\n", "encoder = OneHotEncoder()\n", "# for vif and correlation only\n", "vdf_hat = vdf.drop(columns=[\"job\", \"marital\", \"contact\", \"month\", \"day\", \"poutcome\"])\n", "\n", "tranformed_df = encoder.fit_transform(vdf['job'])\n", "vdf[tranformed_df.columns] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['marital'])\n", "vdf[tranformed_df.columns] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['contact'])\n", "vdf[tranformed_df.columns] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['month'])\n", "vdf[tranformed_df.columns] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['day'])\n", "vdf[tranformed_df.columns] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['poutcome'])\n", "vdf[tranformed_df.columns] = tranformed_df\n", "\n", "vdf = vdf.drop(columns=[\"job\", \"marital\", \"contact\", \"month\", \"day\", \"poutcome\"])\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "one-hot编码操作由数据所有者的PYU Device执行,不会泄露数据。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 标准化 \n", "特征之间数值差距太大会使得模型收敛困难,我们一般先对数值进行标准化。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " age education default balance housing loan \\\n", "0 1.608391 1.314552 -0.135554 0.256661 0.893944 -0.436771 \n", "1 0.477140 1.314552 -0.135554 -0.376164 0.893944 -0.436771 \n", "2 -0.748383 -0.217705 -0.135554 -0.433363 -1.118638 -0.436771 \n", "3 0.100056 -0.217705 -0.135554 2.205062 -1.118638 -0.436771 \n", "4 -0.277028 -1.749962 -0.135554 0.039768 -1.118638 -0.436771 \n", "... ... ... ... ... ... ... \n", "36629 -0.277028 -1.749962 -0.135554 0.635563 -1.118638 -0.436771 \n", "36630 -0.465570 1.314552 -0.135554 1.149692 -1.118638 -0.436771 \n", "36631 0.759952 -0.217705 -0.135554 -0.326900 -1.118638 -0.436771 \n", "36632 -0.088486 -1.749962 -0.135554 -0.436008 -1.118638 -0.436771 \n", "36633 0.477140 1.314552 -0.135554 -0.295160 -1.118638 -0.436771 \n", "\n", " job_admin. job_blue-collar job_entrepreneur job_housemaid ... \\\n", "0 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "1 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "2 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "3 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "4 2.777051 -0.525105 -0.185291 -0.16795 ... \n", "... ... ... ... ... ... \n", "36629 -0.360094 1.904383 -0.185291 -0.16795 ... \n", "36630 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "36631 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "36632 -0.360094 1.904383 -0.185291 -0.16795 ... \n", "36633 -0.360094 -0.525105 -0.185291 -0.16795 ... \n", "\n", " job_retired job_self-employed job_services job_student \\\n", "0 -0.227915 -0.189196 -0.316887 -0.145356 \n", "1 -0.227915 -0.189196 -0.316887 -0.145356 \n", "2 -0.227915 -0.189196 -0.316887 -0.145356 \n", "3 -0.227915 -0.189196 -0.316887 -0.145356 \n", "4 -0.227915 -0.189196 -0.316887 -0.145356 \n", "... ... ... ... ... \n", "36629 -0.227915 -0.189196 -0.316887 -0.145356 \n", "36630 -0.227915 5.285528 -0.316887 -0.145356 \n", "36631 -0.227915 -0.189196 -0.316887 -0.145356 \n", "36632 -0.227915 -0.189196 -0.316887 -0.145356 \n", "36633 -0.227915 -0.189196 3.155697 -0.145356 \n", "\n", " job_technician job_unemployed job_unknown marital_divorced \\\n", "0 -0.449906 -0.172953 -0.080522 -0.362219 \n", "1 -0.449906 -0.172953 -0.080522 -0.362219 \n", "2 2.222685 -0.172953 -0.080522 -0.362219 \n", "3 2.222685 -0.172953 -0.080522 -0.362219 \n", "4 -0.449906 -0.172953 -0.080522 -0.362219 \n", "... ... ... ... ... \n", "36629 -0.449906 -0.172953 -0.080522 -0.362219 \n", "36630 -0.449906 -0.172953 -0.080522 -0.362219 \n", "36631 2.222685 -0.172953 -0.080522 -0.362219 \n", "36632 -0.449906 -0.172953 -0.080522 -0.362219 \n", "36633 -0.449906 -0.172953 -0.080522 -0.362219 \n", "\n", " marital_married marital_single \n", "0 0.815216 -0.628656 \n", "1 0.815216 -0.628656 \n", "2 -1.226669 1.590694 \n", "3 0.815216 -0.628656 \n", "4 0.815216 -0.628656 \n", "... ... ... \n", "36629 0.815216 -0.628656 \n", "36630 -1.226669 1.590694 \n", "36631 0.815216 -0.628656 \n", "36632 0.815216 -0.628656 \n", "36633 0.815216 -0.628656 \n", "\n", "[36634 rows x 21 columns]\n", " duration campaign pdays previous y contact_cellular \\\n", "0 0.646636 -0.570738 -0.412237 -0.243872 0 -1.359335 \n", "1 -0.174118 -0.570738 -0.412237 -0.243872 0 -1.359335 \n", "2 0.417271 -0.246893 -0.412237 -0.243872 0 -1.359335 \n", "3 1.733068 0.724643 -0.412237 -0.243872 0 -1.359335 \n", "4 0.562397 -0.246893 -0.412237 -0.243872 0 -1.359335 \n", "... ... ... ... ... .. ... \n", "36629 1.073762 -0.246893 -0.412237 -0.243872 0 -1.359335 \n", "36630 1.733068 0.076952 -0.412237 -0.243872 1 -1.359335 \n", "36631 0.233878 -0.246893 -0.412237 -0.243872 0 -1.359335 \n", "36632 -0.489491 0.724643 -0.412237 -0.243872 0 -1.359335 \n", "36633 0.899028 -0.246893 -0.412237 -0.243872 0 -1.359335 \n", "\n", " contact_telephone contact_unknown month_1.0 month_2.0 ... \\\n", "0 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "1 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "2 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "3 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "4 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "... ... ... ... ... ... \n", "36629 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "36630 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "36631 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "36632 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "36633 -0.261573 1.575748 -0.178158 -0.249263 ... \n", "\n", " day_26.0 day_27.0 day_28.0 day_29.0 day_30.0 day_31.0 \\\n", "0 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "1 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "2 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "3 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "4 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "... ... ... ... ... ... ... \n", "36629 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "36630 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "36631 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "36632 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "36633 -0.15375 -0.160947 -0.204243 -0.200148 -0.188032 -0.120812 \n", "\n", " poutcome_failure poutcome_other poutcome_success poutcome_unknown \n", "0 -0.350446 -0.204025 -0.18733 0.473664 \n", "1 -0.350446 -0.204025 -0.18733 0.473664 \n", "2 -0.350446 -0.204025 -0.18733 0.473664 \n", "3 -0.350446 -0.204025 -0.18733 0.473664 \n", "4 -0.350446 -0.204025 -0.18733 0.473664 \n", "... ... ... ... ... \n", "36629 -0.350446 -0.204025 -0.18733 0.473664 \n", "36630 -0.350446 -0.204025 -0.18733 0.473664 \n", "36631 -0.350446 -0.204025 -0.18733 0.473664 \n", "36632 -0.350446 -0.204025 -0.18733 0.473664 \n", "36633 -0.350446 -0.204025 -0.18733 0.473664 \n", "\n", "[36634 rows x 55 columns]\n" ] } ], "source": [ "from secretflow.preprocessing import StandardScaler\n", "\n", "X = vdf.drop(columns=['y'])\n", "y = vdf['y']\n", "scaler = StandardScaler()\n", "X = scaler.fit_transform(X)\n", "vdf[X.columns] = X\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "标准化操作由数据所有者的PYU Device执行,不会泄露数据。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 更多\n", "\n", "隐语还支持其他更多的特征预处理能力,请参考这篇[文档](./data_preprocessing_with_data_frame.ipynb)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "至此,我们已经完成了所有特征预处理工作。\n", "\n", "> 本文主要目的是为了展示隐语的预处理能力,本文对于数据预处理方法的使用可能是有争议的,敬请谅解。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 数据分析" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "在建模之前,我们有必要分析一下我们所使用的数据,以便确认是否需要重复特征预处理的过程。\n", "\n", "下面我们将会展示隐语以下数据分析能力:\n", "\n", "- 全表统计\n", "- 相关系数矩阵\n", "- VIF指标计算\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 全表统计\n", "\n", "我们提供了类似于 **pd.DataFrame.describe** 来展示所有特征的基本统计信息。\n", "\n", "> 在特征预处理的过程中,你可以不断调用全表统计来关注预处理效果。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datatypetotal_countcount(non-NA count)count_na(NA count)na_ratiominmaxmeanvar(variance)std(standard deviation)...moment_2moment_3moment_4central_moment_2central_moment_3central_moment_4sumsum_2sum_3sum_4
agefloat64366343663400.0-2.1624475.096416-1.877506e-161.0000271.000014...1.0000000.6823663.3282351.0000000.6823663.328235-6.878054e-1236634.02.499780e+041.219266e+05
educationfloat64366343663400.0-1.7499621.3145529.309945e-181.0000271.000014...1.000000-0.1523032.3050961.000000-0.1523032.3050963.410605e-1336634.0-5.579460e+038.444490e+04
defaultfloat64366343663400.0-0.1355547.3771334.965304e-171.0000271.000014...1.0000007.24157953.4404631.0000007.24157953.4404631.818989e-1236634.02.652880e+051.957738e+06
balancefloat64366343663400.0-1.56676232.087712-1.396492e-171.0000271.000014...1.0000007.967780128.9479911.0000007.967780128.947991-5.115908e-1336634.02.918916e+054.723881e+06
housingfloat64366343663400.0-1.1186380.893944-3.103315e-171.0000271.000014...1.000000-0.2246951.0504881.000000-0.2246951.050488-1.136868e-1236634.0-8.231462e+033.848356e+04
loanfloat64366343663400.0-0.4367712.2895305.779924e-171.0000271.000014...1.0000001.8527604.4327181.0000001.8527604.4327182.117417e-1236634.06.787399e+041.623882e+05
job_admin.float64366343663400.0-0.3600942.777051-2.288695e-171.0000271.000014...1.0000002.4169566.8416771.0000002.4169566.841677-8.384404e-1336634.08.854277e+042.506380e+05
job_blue-collarfloat64366343663400.0-0.5251051.9043833.103315e-171.0000271.000014...1.0000001.3792782.9024081.0000001.3792782.9024081.136868e-1236634.05.052848e+041.063268e+05
job_entrepreneurfloat64366343663400.0-0.1852915.396911-9.697859e-181.0000271.000014...1.0000005.21161928.1609781.0000005.21161928.160978-3.552714e-1336634.01.909225e+051.031649e+06
job_housemaidfloat64366343663400.0-0.1679505.9541365.430801e-171.0000271.000014...1.0000005.78618634.4799491.0000005.78618634.4799491.989520e-1236634.02.119711e+051.263138e+06
job_managementfloat64366343663400.0-0.5136221.9469565.430801e-171.0000271.000014...1.0000001.4333333.0544451.0000001.4333333.0544451.989520e-1236634.05.250874e+041.118965e+05
job_retiredfloat64366343663400.0-0.2279154.3875926.361796e-171.0000271.000014...1.0000004.15967718.3029131.0000004.15967718.3029132.330580e-1236634.01.523856e+056.705089e+05
job_self-employedfloat64366343663400.0-0.1891965.285528-4.267058e-171.0000271.000014...1.0000005.09633226.9726041.0000005.09633226.972604-1.563194e-1236634.01.866990e+059.881144e+05
job_servicesfloat64366343663400.0-0.3168873.1556976.361796e-171.0000271.000014...1.0000002.8388099.0588381.0000002.8388099.0588382.330580e-1236634.01.039969e+053.318615e+05
job_studentfloat64366343663400.0-0.1453566.879667-9.309945e-181.0000271.000014...1.0000006.73431146.3509441.0000006.73431146.350944-3.410605e-1336634.02.467047e+051.698020e+06
job_technicianfloat64366343663400.0-0.4499062.222685-1.861989e-171.0000271.000014...1.0000001.7727784.1427431.0000001.7727784.142743-6.821210e-1336634.06.494396e+041.517653e+05
job_unemployedfloat64366343663400.0-0.1729535.7819071.474075e-171.0000271.000014...1.0000005.60895432.4603641.0000005.60895432.4603645.400125e-1336634.02.054784e+051.189153e+06
job_unknownfloat64366343663400.0-0.08052212.418889-1.512866e-171.0000271.000014...1.00000012.338367153.2352971.00000012.338367153.235297-5.542233e-1336634.04.520037e+055.613622e+06
marital_divorcedfloat64366343663400.0-0.3622192.7607602.754192e-171.0000271.000014...1.0000002.3985406.7529961.0000002.3985406.7529961.008971e-1236634.08.786813e+042.473893e+05
marital_marriedfloat64366343663400.0-1.2266690.815216-1.334425e-161.0000271.000014...1.000000-0.4114541.1692941.000000-0.4114541.169294-4.888534e-1236634.0-1.507319e+044.283592e+04
marital_singlefloat64366343663400.0-0.6286561.5906946.012673e-171.0000271.000014...1.0000000.9620381.9255161.0000000.9620381.9255162.202682e-1236634.03.524328e+047.053936e+04
durationfloat64366343663400.0-2.8098861.8464216.633336e-171.0000271.000014...1.000000-0.9767524.0990321.000000-0.9767524.0990322.430056e-1236634.0-3.578234e+041.501639e+05
campaignfloat64366343663400.0-0.57073819.5076704.034309e-171.0000271.000014...1.0000004.88023241.9212691.0000004.88023241.9212691.477929e-1236634.01.787824e+051.535744e+06
pdaysfloat64366343663400.0-0.4122378.263154-6.206630e-181.0000271.000014...1.0000002.61963010.0199851.0000002.61963010.019985-2.273737e-1336634.09.596753e+043.670721e+05
previousfloat64366343663400.0-0.243872113.389007-4.034309e-171.0000271.000014...1.00000043.9571894561.4677981.00000043.9571894561.467798-1.477929e-1236634.01.610328e+061.671048e+08
yint64366343663400.00.0000001.0000001.174592e-010.1036650.321971...0.1174590.1174590.1174590.1036630.0793100.0714254.303000e+034303.04.303000e+034.303000e+03
contact_cellularfloat64366343663400.0-1.3593350.735654-4.965304e-171.0000271.000014...1.000000-0.6236821.3889791.000000-0.6236821.388979-1.818989e-1236634.0-2.284795e+045.088384e+04
contact_telephonefloat64366343663400.0-0.2615733.8230245.896298e-171.0000271.000014...1.0000003.56145113.6839361.0000003.56145113.6839362.160050e-1236634.01.304702e+055.012973e+05
contact_unknownfloat64366343663400.0-0.6346191.575748-1.241326e-161.0000271.000014...1.0000000.9411291.8857231.0000000.9411291.885723-4.547474e-1236634.03.447731e+046.908158e+04
month_1.0float64366343663400.0-0.1781585.613000-2.482652e-171.0000271.000014...1.0000005.43484230.5375081.0000005.43484230.537508-9.094947e-1336634.01.991000e+051.118711e+06
month_2.0float64366343663400.0-0.2492634.0118238.378950e-171.0000271.000014...1.0000003.76256015.1568591.0000003.76256015.1568593.069545e-1236634.01.378376e+055.552564e+05
month_3.0float64366343663400.0-0.1030589.703260-5.275635e-171.0000271.000014...1.0000009.60020193.1638681.0000009.60020193.163868-1.932676e-1236634.03.516938e+053.412965e+06
month_4.0float64366343663400.0-0.2652473.7700741.861989e-161.0000271.000014...1.0000003.50482713.2838111.0000003.50482713.2838116.821210e-1236634.01.283958e+054.866391e+05
month_5.0float64366343663400.0-0.6587751.517969-1.241326e-171.0000271.000014...1.0000000.8591941.7382151.0000000.8591941.738215-4.547474e-1336634.03.147572e+046.367775e+04
month_6.0float64366343663400.0-0.3664012.729249-7.447956e-171.0000271.000014...1.0000002.3628486.5830511.0000002.3628486.583051-2.728484e-1236634.08.656057e+042.411635e+05
month_7.0float64366343663400.0-0.4253282.351127-1.551657e-161.0000271.000014...1.0000001.9257994.7087011.0000001.9257994.708701-5.684342e-1236634.07.054972e+041.724986e+05
month_8.0float64366343663400.0-0.4023862.485176-1.861989e-171.0000271.000014...1.0000002.0827915.3380161.0000002.0827915.338016-6.821210e-1336634.07.630095e+041.955529e+05
month_9.0float64366343663400.0-0.1121448.917078-1.706823e-171.0000271.000014...1.0000008.80493478.5268621.0000008.80493478.526862-6.252776e-1336634.03.225600e+052.876753e+06
month_10.0float64366343663400.0-0.1273897.849982-9.309945e-181.0000271.000014...1.0000007.72259360.6384501.0000007.72259360.638450-3.410605e-1336634.02.829095e+052.221429e+06
month_11.0float64366343663400.0-0.3104833.2207905.585967e-171.0000271.000014...1.0000002.9103079.4698861.0000002.9103079.4698862.046363e-1236634.01.066162e+053.469198e+05
month_12.0float64366343663400.0-0.06828014.6456187.758287e-181.0000271.000014...1.00000014.577338213.4987801.00000014.577338213.4987802.842171e-1336634.05.340262e+057.821314e+06
day_1.0float64366343663400.0-0.08648911.5621723.103315e-181.0000271.000014...1.00000011.475683132.6913041.00000011.475683132.6913041.136868e-1336634.04.204002e+054.861013e+06
day_2.0float64366343663400.0-0.1710185.847321-5.585967e-171.0000271.000014...1.0000005.67630233.2204101.0000005.67630233.220410-2.046363e-1236634.02.079457e+051.216996e+06
day_3.0float64366343663400.0-0.1583396.3155490.000000e+001.0000271.000014...1.0000006.15721038.9112321.0000006.15721038.9112320.000000e+0036634.02.255632e+051.425474e+06
day_4.0float64366343663400.0-0.1804295.542360-3.258481e-171.0000271.000014...1.0000005.36193129.7503031.0000005.36193129.750303-1.193712e-1236634.01.964290e+051.089873e+06
day_5.0float64366343663400.0-0.2111794.735322-7.447956e-171.0000271.000014...1.0000004.52414321.4678701.0000004.52414321.467870-2.728484e-1236634.01.657375e+057.864540e+05
day_6.0float64366343663400.0-0.2120244.7164526.827293e-171.0000271.000014...1.0000004.50442921.2898781.0000004.50442921.2898782.501110e-1236634.01.650152e+057.799334e+05
day_7.0float64366343663400.0-0.2038084.906588-4.344641e-171.0000271.000014...1.0000004.70278023.1161441.0000004.70278023.116144-1.591616e-1236634.01.722817e+058.468368e+05
day_8.0float64366343663400.0-0.2049674.878830-8.999613e-171.0000271.000014...1.0000004.67386222.8449911.0000004.67386222.844991-3.296918e-1236634.01.712223e+058.369034e+05
day_9.0float64366343663400.0-0.1898145.2683111.861989e-171.0000271.000014...1.0000005.07849726.7911311.0000005.07849726.7911316.821210e-1336634.01.860457e+059.814663e+05
day_10.0float64366343663400.0-0.1091109.165025-4.034309e-171.0000271.000014...1.0000009.05591483.0095851.0000009.05591483.009585-1.477929e-1236634.03.317544e+053.040973e+06
day_11.0float64366343663400.0-0.1830735.462298-4.344641e-171.0000271.000014...1.0000005.27922528.8702161.0000005.27922528.870216-1.591616e-1236634.01.933991e+051.057631e+06
day_12.0float64366343663400.0-0.1913525.225961-9.309945e-171.0000271.000014...1.0000005.03460826.3472801.0000005.03460826.347280-3.410605e-1236634.01.844378e+059.652063e+05
day_13.0float64366343663400.0-0.1906615.2448974.034309e-171.0000271.000014...1.0000005.05423626.5453011.0000005.05423626.5453011.477929e-1236634.01.851569e+059.724606e+05
day_14.0float64366343663400.0-0.2066964.838016-1.551657e-171.0000271.000014...1.0000004.63131922.4491191.0000004.63131922.449119-5.684342e-1336634.01.696638e+058.224010e+05
day_15.0float64366343663400.0-0.1959805.1025646.827293e-171.0000271.000014...1.0000004.90658425.0745701.0000004.90658425.0745702.501110e-1236634.01.797478e+059.185818e+05
day_16.0float64366343663400.0-0.1787285.5950979.309945e-181.0000271.000014...1.0000005.41636930.3370581.0000005.41636930.3370583.410605e-1336634.01.984233e+051.111368e+06
day_17.0float64366343663400.0-0.2132864.6885438.689282e-171.0000271.000014...1.0000004.47525721.0279251.0000004.47525721.0279253.183231e-1236634.01.639466e+057.703370e+05
day_18.0float64366343663400.0-0.2310794.3275302.482652e-171.0000271.000014...1.0000004.09645117.7809151.0000004.09645117.7809159.094947e-1336634.01.500694e+056.513860e+05
day_19.0float64366343663400.0-0.2014724.963478-3.723978e-171.0000271.000014...1.0000004.76200623.6767001.0000004.76200623.676700-1.364242e-1236634.01.744513e+058.673722e+05
day_20.0float64366343663400.0-0.2564733.899047-4.344641e-171.0000271.000014...1.0000003.64257414.2683441.0000003.64257414.268344-1.591616e-1236634.01.334420e+055.227065e+05
day_21.0float64366343663400.0-0.2143334.665638-6.206630e-171.0000271.000014...1.0000004.45130520.8141181.0000004.45130520.814118-2.273737e-1236634.01.630691e+057.625044e+05
day_22.0float64366343663400.0-0.1429886.993574-6.206630e-181.0000271.000014...1.0000006.85058647.9305271.0000006.85058647.930527-2.273737e-1336634.02.509644e+051.755887e+06
day_23.0float64366343663400.0-0.1465266.824707-3.103315e-171.0000271.000014...1.0000006.67818045.5980931.0000006.67818045.598093-1.136868e-1236634.02.446485e+051.670441e+06
day_24.0float64366343663400.0-0.1000409.9960052.017155e-171.0000271.000014...1.0000009.89596598.9301181.0000009.89596598.9301187.389644e-1336634.03.625288e+053.624206e+06
day_25.0float64366343663400.0-0.1381427.238946-5.896298e-171.0000271.000014...1.0000007.10080451.4214151.0000007.10080451.421415-2.160050e-1236634.02.601308e+051.883772e+06
day_26.0float64366343663400.0-0.1537506.504045-1.861989e-171.0000271.000014...1.0000006.35029441.3262401.0000006.35029441.326240-6.821210e-1336634.02.326367e+051.513945e+06
day_27.0float64366343663400.0-0.1609476.2132382.482652e-171.0000271.000014...1.0000006.05229137.6302281.0000006.05229137.6302289.094947e-1336634.02.217196e+051.378546e+06
day_28.0float64366343663400.0-0.2042434.8961263.103315e-171.0000271.000014...1.0000004.69188323.0137671.0000004.69188323.0137671.136868e-1236634.01.718824e+058.430863e+05
day_29.0float64366343663400.0-0.2001484.996313-1.861989e-171.0000271.000014...1.0000004.79616624.0032061.0000004.79616624.003206-6.821210e-1336634.01.757027e+058.793334e+05
day_30.0float64366343663400.0-0.1880325.3182493.103315e-171.0000271.000014...1.0000005.13021727.3191291.0000005.13021727.3191291.136868e-1236634.01.879404e+051.000809e+06
day_31.0float64366343663400.0-0.1208128.2773323.103315e-171.0000271.000014...1.0000008.15652167.5288271.0000008.15652167.5288271.136868e-1236634.02.988060e+052.473851e+06
poutcome_failurefloat64366343663400.0-0.3504462.853507-3.103315e-171.0000271.000014...1.0000002.5030617.2653131.0000002.5030617.265313-1.136868e-1236634.09.169713e+042.661575e+05
poutcome_otherfloat64366343663400.0-0.2040254.901349-7.447956e-171.0000271.000014...1.0000004.69732423.0648501.0000004.69732423.064850-2.728484e-1236634.01.720818e+058.449577e+05
poutcome_successfloat64366343663400.0-0.1873305.338162-3.413646e-171.0000271.000014...1.0000005.15083227.5310671.0000005.15083227.531067-1.250555e-1236634.01.886956e+051.008573e+06
poutcome_unknownfloat64366343663400.0-2.1112020.473664-1.241326e-171.0000271.000014...1.000000-1.6375383.6815301.000000-1.6375383.681530-4.547474e-1336634.0-5.998956e+041.348692e+05
\n", "

76 rows × 26 columns

\n", "
" ], "text/plain": [ " datatype total_count count(non-NA count) \\\n", "age float64 36634 36634 \n", "education float64 36634 36634 \n", "default float64 36634 36634 \n", "balance float64 36634 36634 \n", "housing float64 36634 36634 \n", "loan float64 36634 36634 \n", "job_admin. float64 36634 36634 \n", "job_blue-collar float64 36634 36634 \n", "job_entrepreneur float64 36634 36634 \n", "job_housemaid float64 36634 36634 \n", "job_management float64 36634 36634 \n", "job_retired float64 36634 36634 \n", "job_self-employed float64 36634 36634 \n", "job_services float64 36634 36634 \n", "job_student float64 36634 36634 \n", "job_technician float64 36634 36634 \n", "job_unemployed float64 36634 36634 \n", "job_unknown float64 36634 36634 \n", "marital_divorced float64 36634 36634 \n", "marital_married float64 36634 36634 \n", "marital_single float64 36634 36634 \n", "duration float64 36634 36634 \n", "campaign float64 36634 36634 \n", "pdays float64 36634 36634 \n", "previous float64 36634 36634 \n", "y int64 36634 36634 \n", "contact_cellular float64 36634 36634 \n", "contact_telephone float64 36634 36634 \n", "contact_unknown float64 36634 36634 \n", "month_1.0 float64 36634 36634 \n", "month_2.0 float64 36634 36634 \n", "month_3.0 float64 36634 36634 \n", "month_4.0 float64 36634 36634 \n", "month_5.0 float64 36634 36634 \n", "month_6.0 float64 36634 36634 \n", "month_7.0 float64 36634 36634 \n", "month_8.0 float64 36634 36634 \n", "month_9.0 float64 36634 36634 \n", "month_10.0 float64 36634 36634 \n", "month_11.0 float64 36634 36634 \n", "month_12.0 float64 36634 36634 \n", "day_1.0 float64 36634 36634 \n", "day_2.0 float64 36634 36634 \n", "day_3.0 float64 36634 36634 \n", "day_4.0 float64 36634 36634 \n", "day_5.0 float64 36634 36634 \n", "day_6.0 float64 36634 36634 \n", "day_7.0 float64 36634 36634 \n", "day_8.0 float64 36634 36634 \n", "day_9.0 float64 36634 36634 \n", "day_10.0 float64 36634 36634 \n", "day_11.0 float64 36634 36634 \n", "day_12.0 float64 36634 36634 \n", "day_13.0 float64 36634 36634 \n", "day_14.0 float64 36634 36634 \n", "day_15.0 float64 36634 36634 \n", "day_16.0 float64 36634 36634 \n", "day_17.0 float64 36634 36634 \n", "day_18.0 float64 36634 36634 \n", "day_19.0 float64 36634 36634 \n", "day_20.0 float64 36634 36634 \n", "day_21.0 float64 36634 36634 \n", "day_22.0 float64 36634 36634 \n", "day_23.0 float64 36634 36634 \n", "day_24.0 float64 36634 36634 \n", "day_25.0 float64 36634 36634 \n", "day_26.0 float64 36634 36634 \n", "day_27.0 float64 36634 36634 \n", "day_28.0 float64 36634 36634 \n", "day_29.0 float64 36634 36634 \n", "day_30.0 float64 36634 36634 \n", "day_31.0 float64 36634 36634 \n", "poutcome_failure float64 36634 36634 \n", "poutcome_other float64 36634 36634 \n", "poutcome_success float64 36634 36634 \n", "poutcome_unknown float64 36634 36634 \n", "\n", " count_na(NA count) na_ratio min max \\\n", "age 0 0.0 -2.162447 5.096416 \n", "education 0 0.0 -1.749962 1.314552 \n", "default 0 0.0 -0.135554 7.377133 \n", "balance 0 0.0 -1.566762 32.087712 \n", "housing 0 0.0 -1.118638 0.893944 \n", "loan 0 0.0 -0.436771 2.289530 \n", "job_admin. 0 0.0 -0.360094 2.777051 \n", "job_blue-collar 0 0.0 -0.525105 1.904383 \n", "job_entrepreneur 0 0.0 -0.185291 5.396911 \n", "job_housemaid 0 0.0 -0.167950 5.954136 \n", "job_management 0 0.0 -0.513622 1.946956 \n", "job_retired 0 0.0 -0.227915 4.387592 \n", "job_self-employed 0 0.0 -0.189196 5.285528 \n", "job_services 0 0.0 -0.316887 3.155697 \n", "job_student 0 0.0 -0.145356 6.879667 \n", "job_technician 0 0.0 -0.449906 2.222685 \n", "job_unemployed 0 0.0 -0.172953 5.781907 \n", "job_unknown 0 0.0 -0.080522 12.418889 \n", "marital_divorced 0 0.0 -0.362219 2.760760 \n", "marital_married 0 0.0 -1.226669 0.815216 \n", "marital_single 0 0.0 -0.628656 1.590694 \n", "duration 0 0.0 -2.809886 1.846421 \n", "campaign 0 0.0 -0.570738 19.507670 \n", "pdays 0 0.0 -0.412237 8.263154 \n", "previous 0 0.0 -0.243872 113.389007 \n", "y 0 0.0 0.000000 1.000000 \n", "contact_cellular 0 0.0 -1.359335 0.735654 \n", "contact_telephone 0 0.0 -0.261573 3.823024 \n", "contact_unknown 0 0.0 -0.634619 1.575748 \n", "month_1.0 0 0.0 -0.178158 5.613000 \n", "month_2.0 0 0.0 -0.249263 4.011823 \n", "month_3.0 0 0.0 -0.103058 9.703260 \n", "month_4.0 0 0.0 -0.265247 3.770074 \n", "month_5.0 0 0.0 -0.658775 1.517969 \n", "month_6.0 0 0.0 -0.366401 2.729249 \n", "month_7.0 0 0.0 -0.425328 2.351127 \n", "month_8.0 0 0.0 -0.402386 2.485176 \n", "month_9.0 0 0.0 -0.112144 8.917078 \n", "month_10.0 0 0.0 -0.127389 7.849982 \n", "month_11.0 0 0.0 -0.310483 3.220790 \n", "month_12.0 0 0.0 -0.068280 14.645618 \n", "day_1.0 0 0.0 -0.086489 11.562172 \n", "day_2.0 0 0.0 -0.171018 5.847321 \n", "day_3.0 0 0.0 -0.158339 6.315549 \n", "day_4.0 0 0.0 -0.180429 5.542360 \n", "day_5.0 0 0.0 -0.211179 4.735322 \n", "day_6.0 0 0.0 -0.212024 4.716452 \n", "day_7.0 0 0.0 -0.203808 4.906588 \n", "day_8.0 0 0.0 -0.204967 4.878830 \n", "day_9.0 0 0.0 -0.189814 5.268311 \n", "day_10.0 0 0.0 -0.109110 9.165025 \n", "day_11.0 0 0.0 -0.183073 5.462298 \n", "day_12.0 0 0.0 -0.191352 5.225961 \n", "day_13.0 0 0.0 -0.190661 5.244897 \n", "day_14.0 0 0.0 -0.206696 4.838016 \n", "day_15.0 0 0.0 -0.195980 5.102564 \n", "day_16.0 0 0.0 -0.178728 5.595097 \n", "day_17.0 0 0.0 -0.213286 4.688543 \n", "day_18.0 0 0.0 -0.231079 4.327530 \n", "day_19.0 0 0.0 -0.201472 4.963478 \n", "day_20.0 0 0.0 -0.256473 3.899047 \n", "day_21.0 0 0.0 -0.214333 4.665638 \n", "day_22.0 0 0.0 -0.142988 6.993574 \n", "day_23.0 0 0.0 -0.146526 6.824707 \n", "day_24.0 0 0.0 -0.100040 9.996005 \n", "day_25.0 0 0.0 -0.138142 7.238946 \n", "day_26.0 0 0.0 -0.153750 6.504045 \n", "day_27.0 0 0.0 -0.160947 6.213238 \n", "day_28.0 0 0.0 -0.204243 4.896126 \n", "day_29.0 0 0.0 -0.200148 4.996313 \n", "day_30.0 0 0.0 -0.188032 5.318249 \n", "day_31.0 0 0.0 -0.120812 8.277332 \n", "poutcome_failure 0 0.0 -0.350446 2.853507 \n", "poutcome_other 0 0.0 -0.204025 4.901349 \n", "poutcome_success 0 0.0 -0.187330 5.338162 \n", "poutcome_unknown 0 0.0 -2.111202 0.473664 \n", "\n", " mean var(variance) std(standard deviation) ... \\\n", "age -1.877506e-16 1.000027 1.000014 ... \n", "education 9.309945e-18 1.000027 1.000014 ... \n", "default 4.965304e-17 1.000027 1.000014 ... \n", "balance -1.396492e-17 1.000027 1.000014 ... \n", "housing -3.103315e-17 1.000027 1.000014 ... \n", "loan 5.779924e-17 1.000027 1.000014 ... \n", "job_admin. -2.288695e-17 1.000027 1.000014 ... \n", "job_blue-collar 3.103315e-17 1.000027 1.000014 ... \n", "job_entrepreneur -9.697859e-18 1.000027 1.000014 ... \n", "job_housemaid 5.430801e-17 1.000027 1.000014 ... \n", "job_management 5.430801e-17 1.000027 1.000014 ... \n", "job_retired 6.361796e-17 1.000027 1.000014 ... \n", "job_self-employed -4.267058e-17 1.000027 1.000014 ... \n", "job_services 6.361796e-17 1.000027 1.000014 ... \n", "job_student -9.309945e-18 1.000027 1.000014 ... \n", "job_technician -1.861989e-17 1.000027 1.000014 ... \n", "job_unemployed 1.474075e-17 1.000027 1.000014 ... \n", "job_unknown -1.512866e-17 1.000027 1.000014 ... \n", "marital_divorced 2.754192e-17 1.000027 1.000014 ... \n", "marital_married -1.334425e-16 1.000027 1.000014 ... \n", "marital_single 6.012673e-17 1.000027 1.000014 ... \n", "duration 6.633336e-17 1.000027 1.000014 ... \n", "campaign 4.034309e-17 1.000027 1.000014 ... \n", "pdays -6.206630e-18 1.000027 1.000014 ... \n", "previous -4.034309e-17 1.000027 1.000014 ... \n", "y 1.174592e-01 0.103665 0.321971 ... \n", "contact_cellular -4.965304e-17 1.000027 1.000014 ... \n", "contact_telephone 5.896298e-17 1.000027 1.000014 ... \n", "contact_unknown -1.241326e-16 1.000027 1.000014 ... \n", "month_1.0 -2.482652e-17 1.000027 1.000014 ... \n", "month_2.0 8.378950e-17 1.000027 1.000014 ... \n", "month_3.0 -5.275635e-17 1.000027 1.000014 ... \n", "month_4.0 1.861989e-16 1.000027 1.000014 ... \n", "month_5.0 -1.241326e-17 1.000027 1.000014 ... \n", "month_6.0 -7.447956e-17 1.000027 1.000014 ... \n", "month_7.0 -1.551657e-16 1.000027 1.000014 ... \n", "month_8.0 -1.861989e-17 1.000027 1.000014 ... \n", "month_9.0 -1.706823e-17 1.000027 1.000014 ... \n", "month_10.0 -9.309945e-18 1.000027 1.000014 ... \n", "month_11.0 5.585967e-17 1.000027 1.000014 ... \n", "month_12.0 7.758287e-18 1.000027 1.000014 ... \n", "day_1.0 3.103315e-18 1.000027 1.000014 ... \n", "day_2.0 -5.585967e-17 1.000027 1.000014 ... \n", "day_3.0 0.000000e+00 1.000027 1.000014 ... \n", "day_4.0 -3.258481e-17 1.000027 1.000014 ... \n", "day_5.0 -7.447956e-17 1.000027 1.000014 ... \n", "day_6.0 6.827293e-17 1.000027 1.000014 ... \n", "day_7.0 -4.344641e-17 1.000027 1.000014 ... \n", "day_8.0 -8.999613e-17 1.000027 1.000014 ... \n", "day_9.0 1.861989e-17 1.000027 1.000014 ... \n", "day_10.0 -4.034309e-17 1.000027 1.000014 ... \n", "day_11.0 -4.344641e-17 1.000027 1.000014 ... \n", "day_12.0 -9.309945e-17 1.000027 1.000014 ... \n", "day_13.0 4.034309e-17 1.000027 1.000014 ... \n", "day_14.0 -1.551657e-17 1.000027 1.000014 ... \n", "day_15.0 6.827293e-17 1.000027 1.000014 ... \n", "day_16.0 9.309945e-18 1.000027 1.000014 ... \n", "day_17.0 8.689282e-17 1.000027 1.000014 ... \n", "day_18.0 2.482652e-17 1.000027 1.000014 ... \n", "day_19.0 -3.723978e-17 1.000027 1.000014 ... \n", "day_20.0 -4.344641e-17 1.000027 1.000014 ... \n", "day_21.0 -6.206630e-17 1.000027 1.000014 ... \n", "day_22.0 -6.206630e-18 1.000027 1.000014 ... \n", "day_23.0 -3.103315e-17 1.000027 1.000014 ... \n", "day_24.0 2.017155e-17 1.000027 1.000014 ... \n", "day_25.0 -5.896298e-17 1.000027 1.000014 ... \n", "day_26.0 -1.861989e-17 1.000027 1.000014 ... \n", "day_27.0 2.482652e-17 1.000027 1.000014 ... \n", "day_28.0 3.103315e-17 1.000027 1.000014 ... \n", "day_29.0 -1.861989e-17 1.000027 1.000014 ... \n", "day_30.0 3.103315e-17 1.000027 1.000014 ... \n", "day_31.0 3.103315e-17 1.000027 1.000014 ... \n", "poutcome_failure -3.103315e-17 1.000027 1.000014 ... \n", "poutcome_other -7.447956e-17 1.000027 1.000014 ... \n", "poutcome_success -3.413646e-17 1.000027 1.000014 ... \n", "poutcome_unknown -1.241326e-17 1.000027 1.000014 ... \n", "\n", " moment_2 moment_3 moment_4 central_moment_2 \\\n", "age 1.000000 0.682366 3.328235 1.000000 \n", "education 1.000000 -0.152303 2.305096 1.000000 \n", "default 1.000000 7.241579 53.440463 1.000000 \n", "balance 1.000000 7.967780 128.947991 1.000000 \n", "housing 1.000000 -0.224695 1.050488 1.000000 \n", "loan 1.000000 1.852760 4.432718 1.000000 \n", "job_admin. 1.000000 2.416956 6.841677 1.000000 \n", "job_blue-collar 1.000000 1.379278 2.902408 1.000000 \n", "job_entrepreneur 1.000000 5.211619 28.160978 1.000000 \n", "job_housemaid 1.000000 5.786186 34.479949 1.000000 \n", "job_management 1.000000 1.433333 3.054445 1.000000 \n", "job_retired 1.000000 4.159677 18.302913 1.000000 \n", "job_self-employed 1.000000 5.096332 26.972604 1.000000 \n", "job_services 1.000000 2.838809 9.058838 1.000000 \n", "job_student 1.000000 6.734311 46.350944 1.000000 \n", "job_technician 1.000000 1.772778 4.142743 1.000000 \n", "job_unemployed 1.000000 5.608954 32.460364 1.000000 \n", "job_unknown 1.000000 12.338367 153.235297 1.000000 \n", "marital_divorced 1.000000 2.398540 6.752996 1.000000 \n", "marital_married 1.000000 -0.411454 1.169294 1.000000 \n", "marital_single 1.000000 0.962038 1.925516 1.000000 \n", "duration 1.000000 -0.976752 4.099032 1.000000 \n", "campaign 1.000000 4.880232 41.921269 1.000000 \n", "pdays 1.000000 2.619630 10.019985 1.000000 \n", "previous 1.000000 43.957189 4561.467798 1.000000 \n", "y 0.117459 0.117459 0.117459 0.103663 \n", "contact_cellular 1.000000 -0.623682 1.388979 1.000000 \n", "contact_telephone 1.000000 3.561451 13.683936 1.000000 \n", "contact_unknown 1.000000 0.941129 1.885723 1.000000 \n", "month_1.0 1.000000 5.434842 30.537508 1.000000 \n", "month_2.0 1.000000 3.762560 15.156859 1.000000 \n", "month_3.0 1.000000 9.600201 93.163868 1.000000 \n", "month_4.0 1.000000 3.504827 13.283811 1.000000 \n", "month_5.0 1.000000 0.859194 1.738215 1.000000 \n", "month_6.0 1.000000 2.362848 6.583051 1.000000 \n", "month_7.0 1.000000 1.925799 4.708701 1.000000 \n", "month_8.0 1.000000 2.082791 5.338016 1.000000 \n", "month_9.0 1.000000 8.804934 78.526862 1.000000 \n", "month_10.0 1.000000 7.722593 60.638450 1.000000 \n", "month_11.0 1.000000 2.910307 9.469886 1.000000 \n", "month_12.0 1.000000 14.577338 213.498780 1.000000 \n", "day_1.0 1.000000 11.475683 132.691304 1.000000 \n", "day_2.0 1.000000 5.676302 33.220410 1.000000 \n", "day_3.0 1.000000 6.157210 38.911232 1.000000 \n", "day_4.0 1.000000 5.361931 29.750303 1.000000 \n", "day_5.0 1.000000 4.524143 21.467870 1.000000 \n", "day_6.0 1.000000 4.504429 21.289878 1.000000 \n", "day_7.0 1.000000 4.702780 23.116144 1.000000 \n", "day_8.0 1.000000 4.673862 22.844991 1.000000 \n", "day_9.0 1.000000 5.078497 26.791131 1.000000 \n", "day_10.0 1.000000 9.055914 83.009585 1.000000 \n", "day_11.0 1.000000 5.279225 28.870216 1.000000 \n", "day_12.0 1.000000 5.034608 26.347280 1.000000 \n", "day_13.0 1.000000 5.054236 26.545301 1.000000 \n", "day_14.0 1.000000 4.631319 22.449119 1.000000 \n", "day_15.0 1.000000 4.906584 25.074570 1.000000 \n", "day_16.0 1.000000 5.416369 30.337058 1.000000 \n", "day_17.0 1.000000 4.475257 21.027925 1.000000 \n", "day_18.0 1.000000 4.096451 17.780915 1.000000 \n", "day_19.0 1.000000 4.762006 23.676700 1.000000 \n", "day_20.0 1.000000 3.642574 14.268344 1.000000 \n", "day_21.0 1.000000 4.451305 20.814118 1.000000 \n", "day_22.0 1.000000 6.850586 47.930527 1.000000 \n", "day_23.0 1.000000 6.678180 45.598093 1.000000 \n", "day_24.0 1.000000 9.895965 98.930118 1.000000 \n", "day_25.0 1.000000 7.100804 51.421415 1.000000 \n", "day_26.0 1.000000 6.350294 41.326240 1.000000 \n", "day_27.0 1.000000 6.052291 37.630228 1.000000 \n", "day_28.0 1.000000 4.691883 23.013767 1.000000 \n", "day_29.0 1.000000 4.796166 24.003206 1.000000 \n", "day_30.0 1.000000 5.130217 27.319129 1.000000 \n", "day_31.0 1.000000 8.156521 67.528827 1.000000 \n", "poutcome_failure 1.000000 2.503061 7.265313 1.000000 \n", "poutcome_other 1.000000 4.697324 23.064850 1.000000 \n", "poutcome_success 1.000000 5.150832 27.531067 1.000000 \n", "poutcome_unknown 1.000000 -1.637538 3.681530 1.000000 \n", "\n", " central_moment_3 central_moment_4 sum sum_2 \\\n", "age 0.682366 3.328235 -6.878054e-12 36634.0 \n", "education -0.152303 2.305096 3.410605e-13 36634.0 \n", "default 7.241579 53.440463 1.818989e-12 36634.0 \n", "balance 7.967780 128.947991 -5.115908e-13 36634.0 \n", "housing -0.224695 1.050488 -1.136868e-12 36634.0 \n", "loan 1.852760 4.432718 2.117417e-12 36634.0 \n", "job_admin. 2.416956 6.841677 -8.384404e-13 36634.0 \n", "job_blue-collar 1.379278 2.902408 1.136868e-12 36634.0 \n", "job_entrepreneur 5.211619 28.160978 -3.552714e-13 36634.0 \n", "job_housemaid 5.786186 34.479949 1.989520e-12 36634.0 \n", "job_management 1.433333 3.054445 1.989520e-12 36634.0 \n", "job_retired 4.159677 18.302913 2.330580e-12 36634.0 \n", "job_self-employed 5.096332 26.972604 -1.563194e-12 36634.0 \n", "job_services 2.838809 9.058838 2.330580e-12 36634.0 \n", "job_student 6.734311 46.350944 -3.410605e-13 36634.0 \n", "job_technician 1.772778 4.142743 -6.821210e-13 36634.0 \n", "job_unemployed 5.608954 32.460364 5.400125e-13 36634.0 \n", "job_unknown 12.338367 153.235297 -5.542233e-13 36634.0 \n", "marital_divorced 2.398540 6.752996 1.008971e-12 36634.0 \n", "marital_married -0.411454 1.169294 -4.888534e-12 36634.0 \n", "marital_single 0.962038 1.925516 2.202682e-12 36634.0 \n", "duration -0.976752 4.099032 2.430056e-12 36634.0 \n", "campaign 4.880232 41.921269 1.477929e-12 36634.0 \n", "pdays 2.619630 10.019985 -2.273737e-13 36634.0 \n", "previous 43.957189 4561.467798 -1.477929e-12 36634.0 \n", "y 0.079310 0.071425 4.303000e+03 4303.0 \n", "contact_cellular -0.623682 1.388979 -1.818989e-12 36634.0 \n", "contact_telephone 3.561451 13.683936 2.160050e-12 36634.0 \n", "contact_unknown 0.941129 1.885723 -4.547474e-12 36634.0 \n", "month_1.0 5.434842 30.537508 -9.094947e-13 36634.0 \n", "month_2.0 3.762560 15.156859 3.069545e-12 36634.0 \n", "month_3.0 9.600201 93.163868 -1.932676e-12 36634.0 \n", "month_4.0 3.504827 13.283811 6.821210e-12 36634.0 \n", "month_5.0 0.859194 1.738215 -4.547474e-13 36634.0 \n", "month_6.0 2.362848 6.583051 -2.728484e-12 36634.0 \n", "month_7.0 1.925799 4.708701 -5.684342e-12 36634.0 \n", "month_8.0 2.082791 5.338016 -6.821210e-13 36634.0 \n", "month_9.0 8.804934 78.526862 -6.252776e-13 36634.0 \n", "month_10.0 7.722593 60.638450 -3.410605e-13 36634.0 \n", "month_11.0 2.910307 9.469886 2.046363e-12 36634.0 \n", "month_12.0 14.577338 213.498780 2.842171e-13 36634.0 \n", "day_1.0 11.475683 132.691304 1.136868e-13 36634.0 \n", "day_2.0 5.676302 33.220410 -2.046363e-12 36634.0 \n", "day_3.0 6.157210 38.911232 0.000000e+00 36634.0 \n", "day_4.0 5.361931 29.750303 -1.193712e-12 36634.0 \n", "day_5.0 4.524143 21.467870 -2.728484e-12 36634.0 \n", "day_6.0 4.504429 21.289878 2.501110e-12 36634.0 \n", "day_7.0 4.702780 23.116144 -1.591616e-12 36634.0 \n", "day_8.0 4.673862 22.844991 -3.296918e-12 36634.0 \n", "day_9.0 5.078497 26.791131 6.821210e-13 36634.0 \n", "day_10.0 9.055914 83.009585 -1.477929e-12 36634.0 \n", "day_11.0 5.279225 28.870216 -1.591616e-12 36634.0 \n", "day_12.0 5.034608 26.347280 -3.410605e-12 36634.0 \n", "day_13.0 5.054236 26.545301 1.477929e-12 36634.0 \n", "day_14.0 4.631319 22.449119 -5.684342e-13 36634.0 \n", "day_15.0 4.906584 25.074570 2.501110e-12 36634.0 \n", "day_16.0 5.416369 30.337058 3.410605e-13 36634.0 \n", "day_17.0 4.475257 21.027925 3.183231e-12 36634.0 \n", "day_18.0 4.096451 17.780915 9.094947e-13 36634.0 \n", "day_19.0 4.762006 23.676700 -1.364242e-12 36634.0 \n", "day_20.0 3.642574 14.268344 -1.591616e-12 36634.0 \n", "day_21.0 4.451305 20.814118 -2.273737e-12 36634.0 \n", "day_22.0 6.850586 47.930527 -2.273737e-13 36634.0 \n", "day_23.0 6.678180 45.598093 -1.136868e-12 36634.0 \n", "day_24.0 9.895965 98.930118 7.389644e-13 36634.0 \n", "day_25.0 7.100804 51.421415 -2.160050e-12 36634.0 \n", "day_26.0 6.350294 41.326240 -6.821210e-13 36634.0 \n", "day_27.0 6.052291 37.630228 9.094947e-13 36634.0 \n", "day_28.0 4.691883 23.013767 1.136868e-12 36634.0 \n", "day_29.0 4.796166 24.003206 -6.821210e-13 36634.0 \n", "day_30.0 5.130217 27.319129 1.136868e-12 36634.0 \n", "day_31.0 8.156521 67.528827 1.136868e-12 36634.0 \n", "poutcome_failure 2.503061 7.265313 -1.136868e-12 36634.0 \n", "poutcome_other 4.697324 23.064850 -2.728484e-12 36634.0 \n", "poutcome_success 5.150832 27.531067 -1.250555e-12 36634.0 \n", "poutcome_unknown -1.637538 3.681530 -4.547474e-13 36634.0 \n", "\n", " sum_3 sum_4 \n", "age 2.499780e+04 1.219266e+05 \n", "education -5.579460e+03 8.444490e+04 \n", "default 2.652880e+05 1.957738e+06 \n", "balance 2.918916e+05 4.723881e+06 \n", "housing -8.231462e+03 3.848356e+04 \n", "loan 6.787399e+04 1.623882e+05 \n", "job_admin. 8.854277e+04 2.506380e+05 \n", "job_blue-collar 5.052848e+04 1.063268e+05 \n", "job_entrepreneur 1.909225e+05 1.031649e+06 \n", "job_housemaid 2.119711e+05 1.263138e+06 \n", "job_management 5.250874e+04 1.118965e+05 \n", "job_retired 1.523856e+05 6.705089e+05 \n", "job_self-employed 1.866990e+05 9.881144e+05 \n", "job_services 1.039969e+05 3.318615e+05 \n", "job_student 2.467047e+05 1.698020e+06 \n", "job_technician 6.494396e+04 1.517653e+05 \n", "job_unemployed 2.054784e+05 1.189153e+06 \n", "job_unknown 4.520037e+05 5.613622e+06 \n", "marital_divorced 8.786813e+04 2.473893e+05 \n", "marital_married -1.507319e+04 4.283592e+04 \n", "marital_single 3.524328e+04 7.053936e+04 \n", "duration -3.578234e+04 1.501639e+05 \n", "campaign 1.787824e+05 1.535744e+06 \n", "pdays 9.596753e+04 3.670721e+05 \n", "previous 1.610328e+06 1.671048e+08 \n", "y 4.303000e+03 4.303000e+03 \n", "contact_cellular -2.284795e+04 5.088384e+04 \n", "contact_telephone 1.304702e+05 5.012973e+05 \n", "contact_unknown 3.447731e+04 6.908158e+04 \n", "month_1.0 1.991000e+05 1.118711e+06 \n", "month_2.0 1.378376e+05 5.552564e+05 \n", "month_3.0 3.516938e+05 3.412965e+06 \n", "month_4.0 1.283958e+05 4.866391e+05 \n", "month_5.0 3.147572e+04 6.367775e+04 \n", "month_6.0 8.656057e+04 2.411635e+05 \n", "month_7.0 7.054972e+04 1.724986e+05 \n", "month_8.0 7.630095e+04 1.955529e+05 \n", "month_9.0 3.225600e+05 2.876753e+06 \n", "month_10.0 2.829095e+05 2.221429e+06 \n", "month_11.0 1.066162e+05 3.469198e+05 \n", "month_12.0 5.340262e+05 7.821314e+06 \n", "day_1.0 4.204002e+05 4.861013e+06 \n", "day_2.0 2.079457e+05 1.216996e+06 \n", "day_3.0 2.255632e+05 1.425474e+06 \n", "day_4.0 1.964290e+05 1.089873e+06 \n", "day_5.0 1.657375e+05 7.864540e+05 \n", "day_6.0 1.650152e+05 7.799334e+05 \n", "day_7.0 1.722817e+05 8.468368e+05 \n", "day_8.0 1.712223e+05 8.369034e+05 \n", "day_9.0 1.860457e+05 9.814663e+05 \n", "day_10.0 3.317544e+05 3.040973e+06 \n", "day_11.0 1.933991e+05 1.057631e+06 \n", "day_12.0 1.844378e+05 9.652063e+05 \n", "day_13.0 1.851569e+05 9.724606e+05 \n", "day_14.0 1.696638e+05 8.224010e+05 \n", "day_15.0 1.797478e+05 9.185818e+05 \n", "day_16.0 1.984233e+05 1.111368e+06 \n", "day_17.0 1.639466e+05 7.703370e+05 \n", "day_18.0 1.500694e+05 6.513860e+05 \n", "day_19.0 1.744513e+05 8.673722e+05 \n", "day_20.0 1.334420e+05 5.227065e+05 \n", "day_21.0 1.630691e+05 7.625044e+05 \n", "day_22.0 2.509644e+05 1.755887e+06 \n", "day_23.0 2.446485e+05 1.670441e+06 \n", "day_24.0 3.625288e+05 3.624206e+06 \n", "day_25.0 2.601308e+05 1.883772e+06 \n", "day_26.0 2.326367e+05 1.513945e+06 \n", "day_27.0 2.217196e+05 1.378546e+06 \n", "day_28.0 1.718824e+05 8.430863e+05 \n", "day_29.0 1.757027e+05 8.793334e+05 \n", "day_30.0 1.879404e+05 1.000809e+06 \n", "day_31.0 2.988060e+05 2.473851e+06 \n", "poutcome_failure 9.169713e+04 2.661575e+05 \n", "poutcome_other 1.720818e+05 8.449577e+05 \n", "poutcome_success 1.886956e+05 1.008573e+06 \n", "poutcome_unknown -5.998956e+04 1.348692e+05 \n", "\n", "[76 rows x 26 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats.table_statistics import table_statistics\n", "\n", "pd.set_option('display.max_rows', None)\n", "data_stats = table_statistics(vdf)\n", "data_stats" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "pd.reset_option('display.max_rows')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "请注意,全表统计会暴露数据整体统计结果,其背后实际上蕴含了**sf.reveal**,请谨慎使用。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 相关系数矩阵\n", "\n", "我们接下来计算特征和特征之间,特征和标签之间的相关系数矩阵。\n", "\n", "> 计算相关系数矩阵时,one-hot编码各列无需参与计算。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3805540)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m [2023-10-07 18:04:03.441] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m [2023-10-07 18:04:03.419] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "data": { "text/plain": [ "array([[1.000, -0.166, -0.020, 0.098, -0.185, -0.014, -0.010, 0.007,\n", " -0.026, 0.000, 0.023],\n", " [-0.166, 1.000, -0.010, 0.066, -0.079, -0.025, -0.002, 0.003,\n", " 0.004, 0.023, 0.070],\n", " [-0.020, -0.010, 1.000, -0.066, -0.005, 0.078, -0.006, 0.019,\n", " -0.031, -0.016, -0.025],\n", " [0.098, 0.066, -0.066, 1.000, -0.070, -0.084, 0.017, -0.019,\n", " 0.003, 0.014, 0.054],\n", " [-0.185, -0.079, -0.005, -0.070, 1.000, 0.042, 0.002, -0.027,\n", " 0.127, 0.039, -0.136],\n", " [-0.014, -0.025, 0.078, -0.084, 0.042, 1.000, -0.008, 0.011,\n", " -0.020, -0.008, -0.067],\n", " [-0.010, -0.002, -0.006, 0.017, 0.002, -0.008, 1.000, -0.181,\n", " 0.018, 0.009, 0.319],\n", " [0.007, 0.003, 0.019, -0.019, -0.027, 0.011, -0.181, 1.000,\n", " -0.089, -0.032, -0.070],\n", " [-0.026, 0.004, -0.031, 0.003, 0.127, -0.020, 0.018, -0.089,\n", " 1.000, 0.440, 0.102],\n", " [0.000, 0.023, -0.016, 0.014, 0.039, -0.008, 0.009, -0.032, 0.440,\n", " 1.000, 0.087],\n", " [0.023, 0.070, -0.025, 0.054, -0.136, -0.067, 0.319, -0.070,\n", " 0.102, 0.087, 1.000]], dtype=float32)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats.ss_pearsonr_v import PearsonR\n", "\n", "pearson_r_calculator = PearsonR(spu)\n", "corr_matrix = pearson_r_calculator.pearsonr(vdf_hat)\n", "\n", "import numpy as np\n", "\n", "np.set_printoptions(formatter={'float': lambda x: \"{0:0.3f}\".format(x)})\n", "corr_matrix" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "相关系数矩阵的计算需要利用alice和bob两边的数据,因此相关的计算需要使用**SPU device**确保原始数据不被泄露。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### VIF指标计算\n", "\n", "隐语还支持VIF的计算来进行多重共线性检验。\n", "\n", "> 计算VIF指标时,one-hot编码各列无需参与计算。" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3805590)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3807059)\u001b[0m warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['age', 'education', 'default', 'balance', 'housing', 'loan', 'duration', 'campaign', 'pdays', 'previous', 'y']\n", "[1.084 1.053 1.012 1.031 1.093 1.018 1.150 1.043 1.279 1.244 1.168]\n" ] } ], "source": [ "from secretflow.stats.ss_vif_v import VIF\n", "\n", "vif_calculator = VIF(spu)\n", "vif_results = vif_calculator.vif(vdf_hat)\n", "print(vdf_hat.columns)\n", "print(vif_results)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "VIF指标的计算需要利用alice和bob两边的数据,因此相关的计算需要使用**SPU device**确保原始数据不被泄露。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 模型训练\n", "\n", "接下来,我们将会分别训练一个逻辑回归模型和一个XGB模型。\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 随机分割\n", "\n", "在训练之前,我们需要将数据分割为训练集和验证集。\n", "\n", "其中train_x和train_y为训练集的特征和标签。test_x和test_y为训练集的特征和标签。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from secretflow.data.split import train_test_split\n", "\n", "random_state = 1234\n", "\n", "train_vdf, test_vdf = train_test_split(vdf, train_size=0.8, random_state=random_state)\n", "\n", "train_x = train_vdf.drop(columns=['y'])\n", "train_y = train_vdf['y']\n", "\n", "test_x = test_vdf.drop(columns=['y'])\n", "test_y = test_vdf['y']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "随机分割时,每一方会共享随机数种子,并由每一方数据的owner分别执行各自的数据分割并且确保最终分割结果仍然是对齐的。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### PSI(人群稳定性分析)\n", "\n", "样本稳定指数是衡量样本变化所产生的偏移量的一种重要指标,通常用来衡量样本的稳定程度,比如样本在两个月份之间的变化是否稳定。通常变量的PSI值在0.1以下表示变化不太显著,在0.1到0.25之间表示有比较显著的变化,大于0.25表示变量变化比较剧烈,需要特殊关注。\n", "\n", "接下来以`balance`为例子,确认两次抽样的样本分布是否接近。\n", "\n", "> 根据业务需求,PSI分析也可以在数据分析或者特征预处理的时候进行。\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "stats_df = table_statistics(train_x['balance'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "min_val, max_val = stats_df['min'], stats_df['max']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "data": { "text/plain": [ "Array(inf, dtype=float32)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import psi_eval\n", "from secretflow.stats.core.utils import equal_range\n", "import jax.numpy as jnp\n", "\n", "split_points = equal_range(jnp.array([min_val, max_val]), 3)\n", "balance_psi_score = psi_eval(train_x['balance'], test_x['balance'], split_points)\n", "\n", "sf.reveal(balance_psi_score)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "PSI分析是一个单方运算,由数据owner的PYU Device执行计算。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 逻辑回归模型\n", "\n", "使用 **ml.linear.ss_sgd.SSRegression** 可以进行密态逻辑回归模型的训练。\n", "\n", "请参考相关的API文档。\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:epoch 1 times: 1.159546136856079s\n", "INFO:root:epoch 2 times: 0.9041106700897217s\n", "INFO:root:epoch 3 times: 0.8574354648590088s\n" ] } ], "source": [ "from secretflow.ml.linear.ss_sgd import SSRegression\n", "\n", "lr_model = SSRegression(spu)\n", "lr_model.fit(\n", " x=train_x,\n", " y=train_y,\n", " epochs=3,\n", " learning_rate=0.1,\n", " batch_size=1024,\n", " sig_type='t1',\n", " reg_type='logistic',\n", " penalty='l2',\n", " l2_norm=0.5,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "你可能会对为何上面的语句很快就执行完毕感到困惑,原因是在隐语中,语句都是lazy evaluation的,在上面的例子中,直到lr_model被真正被使用的时候,**lr_model.fit**才会被执行。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "SSRegression的训练基于SPU Device,双方的原始数据将会被保护。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### XGBoost模型" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "使用 **ml.boost.ss_xgb_v.Xgb** 可以进行密态XGBoost模型的训练。\n", "\n", "请参考相关的API文档。" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n", "INFO:root:fragment_count 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:prepare time 0.0977473258972168s\n", "\u001b[2m\u001b[36m(_run pid=3806453)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3806453)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3806453)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3806453)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3806453)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "INFO:root:global_setup time 1.0207350254058838s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3806453)\u001b[0m [2023-10-07 18:04:08.905] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:build & infeed bucket_map fragments [0, 0]\n", "INFO:root:build & infeed bucket_map time 0.47269535064697266s\n", "INFO:root:init_pred time 0.01999831199645996s\n", "INFO:root:epoch 0 tree_setup time 0.06561398506164551s\n", "INFO:root:fragment[0, 0] gradient sum time 0.4283578395843506s\n", "INFO:root:level 0 time 0.633394718170166s\n", "INFO:root:fragment[0, 0] gradient sum time 0.7379474639892578s\n", "INFO:root:level 1 time 0.8940839767456055s\n", "INFO:root:fragment[0, 0] gradient sum time 1.374051809310913s\n", "INFO:root:level 2 time 1.6118881702423096s\n", "INFO:root:fragment[0, 0] gradient sum time 2.799717903137207s\n", "INFO:root:level 3 time 3.1707115173339844s\n", "INFO:root:fragment[0, 0] gradient sum time 5.533723831176758s\n", "INFO:root:level 4 time 5.947847843170166s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(XgbTreeWorker pid=3820043)\u001b[0m [2023-10-07 18:04:21.826] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(XgbTreeWorker pid=3820042)\u001b[0m [2023-10-07 18:04:21.809] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:epoch 0 time 12.676203489303589s\n", "INFO:root:epoch 1 tree_setup time 0.32253217697143555s\n", "INFO:root:fragment[0, 0] gradient sum time 0.6513419151306152s\n", "INFO:root:level 0 time 0.8437290191650391s\n", "INFO:root:fragment[0, 0] gradient sum time 0.7146239280700684s\n", "INFO:root:level 1 time 0.8953475952148438s\n", "INFO:root:fragment[0, 0] gradient sum time 1.4147369861602783s\n", "INFO:root:level 2 time 1.6838665008544922s\n", "INFO:root:fragment[0, 0] gradient sum time 2.8456201553344727s\n", "INFO:root:level 3 time 3.1713719367980957s\n", "INFO:root:fragment[0, 0] gradient sum time 5.684190511703491s\n", "INFO:root:level 4 time 6.1044602394104s\n", "INFO:root:epoch 1 time 13.149580478668213s\n", "INFO:root:epoch 2 tree_setup time 0.3217151165008545s\n", "INFO:root:fragment[0, 0] gradient sum time 0.6628327369689941s\n", "INFO:root:level 0 time 0.8769242763519287s\n", "INFO:root:fragment[0, 0] gradient sum time 0.7444479465484619s\n", "INFO:root:level 1 time 0.9811975955963135s\n", "INFO:root:fragment[0, 0] gradient sum time 1.4514589309692383s\n", "INFO:root:level 2 time 1.7142736911773682s\n", "INFO:root:fragment[0, 0] gradient sum time 2.8057940006256104s\n", "INFO:root:level 3 time 3.1123876571655273s\n", "INFO:root:fragment[0, 0] gradient sum time 5.516232490539551s\n", "INFO:root:level 4 time 5.923535585403442s\n", "INFO:root:epoch 2 time 12.82395052909851s\n" ] } ], "source": [ "from secretflow.ml.boost.ss_xgb_v import Xgb\n", "\n", "xgb = Xgb(spu)\n", "params = {\n", " 'num_boost_round': 3,\n", " 'max_depth': 5,\n", " 'sketch_eps': 0.25,\n", " 'objective': 'logistic',\n", " 'reg_lambda': 0.2,\n", " 'subsample': 1,\n", " 'colsample_by_tree': 1,\n", " 'base_score': 0.5,\n", "}\n", "xgb_model = xgb.train(params=params, dtrain=train_x, label=train_y)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Xgb.train将会直接执行,请耐心等待。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "Xgb的训练基于SPU Device,双方的原始数据将会被保护。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 模型预测\n", "\n", "接下来,我们将会分别利用刚刚训练好的模型来预测测试集。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 逻辑回归模型\n", "\n", "由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果**reveal**给bob." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "lr_y_hat = lr_model.predict(x=test_x, batch_size=1024, to_pyu=bob)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "逻辑回归的预测基于SPU Device,双方的原始数据将会被保护。\n", "\n", "当设置**to_pyu**,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### XGBoost模型\n", "\n", "由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果**reveal**给bob." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(XgbTreeWorker pid=3821892)\u001b[0m [2023-10-07 18:04:49.886] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(XgbTreeWorker pid=3821893)\u001b[0m [2023-10-07 18:04:49.900] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] } ], "source": [ "xgb_y_hat = xgb_model.predict(dtrain=test_x, to_pyu=bob)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "XGBoost模型的预测基于SPU Device,双方的原始数据将会被保护。\n", "\n", "当设置**to_pyu**,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 模型评估\n", "\n", "接下来,我们将利用测试数据集对模型效果进行评估,包括:\n", "\n", "- 二分类评估\n", "- PVA\n", "- P-Value\n", "- 评分卡转换" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 二分类评估\n", "\n", "隐语中对二分类的评估有集成的支持。\n", "\n", "`BiClassificationEval` 将计算 `AUC`, `KS`, `F1 Score`, `Lift`, `K-S`, `Gain`, `Precision`, `Recall` 等统计数值, 并提供(基于prediction score的)等频和等距分箱的统计报告和总报告。\n", "\n", "不同分桶中评估模型的预测的`threshold`不同。总报告中依赖`threshold`的统计取的是各个分桶的最佳值。\n", "\n", "详情可以参考API文档。" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from secretflow.stats.biclassification_eval import BiClassificationEval\n", "\n", "biclassification_evaluator = BiClassificationEval(\n", " y_true=test_y, y_score=lr_y_hat, bucket_size=20\n", ")\n", "lr_report = sf.reveal(biclassification_evaluator.get_all_reports())" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "positive_samples: 884.0\n", "negative_samples: 6443.0\n", "total_samples: 7327.0\n", "auc: 0.8957031965255737\n", "ks: 0.6429727077484131\n", "f1_score: 0.547467052936554\n" ] } ], "source": [ "print(f'positive_samples: {lr_report.summary_report.positive_samples}')\n", "print(f'negative_samples: {lr_report.summary_report.negative_samples}')\n", "print(f'total_samples: {lr_report.summary_report.total_samples}')\n", "print(f'auc: {lr_report.summary_report.auc}')\n", "print(f'ks: {lr_report.summary_report.ks}')\n", "print(f'f1_score: {lr_report.summary_report.f1_score}')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "biclassification_evaluator = BiClassificationEval(\n", " y_true=test_y, y_score=xgb_y_hat, bucket_size=20\n", ")\n", "xgb_report = sf.reveal(biclassification_evaluator.get_all_reports())" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "positive_samples: 884.0\n", "negative_samples: 6443.0\n", "total_samples: 7327.0\n", "auc: 0.8601635694503784\n", "ks: 0.5916707515716553\n", "f1_score: 0.4953111708164215\n" ] } ], "source": [ "print(f'positive_samples: {xgb_report.summary_report.positive_samples}')\n", "print(f'negative_samples: {xgb_report.summary_report.negative_samples}')\n", "print(f'total_samples: {xgb_report.summary_report.total_samples}')\n", "print(f'auc: {xgb_report.summary_report.auc}')\n", "print(f'ks: {xgb_report.summary_report.ks}')\n", "print(f'f1_score: {xgb_report.summary_report.f1_score}')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 预测偏差\n", "\n", "结果由`abs(mean(Acutal) - mean(Prediction))`计算获得, 值越小越好。" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PredictionBiasReport(buckets=[BucketPredictionBiasReport(left_endpoint=0.0, left_closed=True, right_endpoint=0.25, right_closed=False, isna=False, avg_prediction=0.0, avg_label=0.14190296828746796, bias=0.14190296828746796, absolute=True), BucketPredictionBiasReport(left_endpoint=0.25, left_closed=True, right_endpoint=0.5, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.5, left_closed=True, right_endpoint=0.75, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.75, left_closed=True, right_endpoint=1.0, right_closed=True, isna=False, avg_prediction=1.0, avg_label=0.35504791140556335, bias=0.6449520587921143, absolute=True)])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import prediction_bias_eval\n", "\n", "prediction_bias = prediction_bias_eval(\n", " test_y, lr_y_hat, bucket_num=4, absolute=True, bucket_method='equal_width'\n", ")\n", "\n", "sf.reveal(prediction_bias)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PredictionBiasReport(buckets=[BucketPredictionBiasReport(left_endpoint=0.0, left_closed=True, right_endpoint=0.25, right_closed=False, isna=False, avg_prediction=0.0, avg_label=0.1611528992652893, bias=0.1611528992652893, absolute=True), BucketPredictionBiasReport(left_endpoint=0.25, left_closed=True, right_endpoint=0.5, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.5, left_closed=True, right_endpoint=0.75, right_closed=False, isna=True, avg_prediction=0, avg_label=0, bias=0, absolute=True), BucketPredictionBiasReport(left_endpoint=0.75, left_closed=True, right_endpoint=1.0, right_closed=True, isna=False, avg_prediction=1.0, avg_label=0.36208581924438477, bias=0.6379141807556152, absolute=True)])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xgb_pva_score = prediction_bias_eval(\n", " test_y, xgb_y_hat, bucket_num=4, absolute=True, bucket_method='equal_width'\n", ")\n", "\n", "sf.reveal(xgb_pva_score)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### P-Value\n", "双方可通过p-value的值来判断参数是否显著,即该自变量是否可以有效预测因变量的变异, 从而判定对应的解释变量是否应包括在模型中。\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.773, 0.324, 0.785, 0.513, 0.011, 0.155, 0.992, 0.991, 0.980,\n", " 0.983, 0.999, 0.957, 0.995, 0.990, 0.939, 0.997, 0.996, 0.993,\n", " 0.999, 0.992, 0.990, 0.000, 0.701, 0.828, 0.891, 0.982, 0.986,\n", " 0.976, 0.953, 0.991, 0.745, 0.969, 0.983, 0.976, 0.973, 0.973,\n", " 0.843, 0.839, 0.986, 0.859, 0.940, 0.993, 0.984, 1.000, 0.988,\n", " 0.992, 0.979, 0.999, 0.993, 0.972, 0.997, 0.979, 0.976, 0.999,\n", " 0.970, 0.995, 0.970, 0.994, 0.964, 0.974, 0.998, 0.994, 0.984,\n", " 0.999, 0.972, 0.998, 0.980, 0.998, 0.983, 0.972, 0.975, 0.976,\n", " 0.993, 0.824, 0.979, 0.000])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import SSPValue\n", "\n", "model = lr_model.save_model()\n", "sspv = SSPValue(spu)\n", "pvalues = sspv.pvalues(test_x, test_y, model)\n", "\n", "pvalues" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 评分卡转换\n", "\n", "> 严格来说,评分卡转化是对预测结果的后续处理,并不属于模型评估。\n", "\n", "\n", "我们将 `y = 1` 的概率设为`p`, `odds = p / (1 - p)`, 评分卡设定的分值刻度可以通过将分值表示为比率对数的线性表达式来定义,即可表示为下式:\n", "\n", "`Score = A - B log(odds)`, A 和 B 是可以设定的常数。隐语中提供了评分卡转换功能,详情可以参考API文档。" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[489.410],\n", " [452.695],\n", " [453.148],\n", " ...,\n", " [453.148],\n", " [480.906],\n", " [453.148]])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import BiClassificationEval, ScoreCard\n", "\n", "sc = ScoreCard(20, 600, 20)\n", "score = sc.transform(xgb_y_hat)\n", "\n", "sf.reveal(score.partitions[bob])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 安全性讨论\n", "\n", "以上所有模型评估的方法均为单方运算,由label拥有者的PYU Device进行运算。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 实验结束\n", "\n", "最后,我们需要清理临时文件,并关闭隐语cluster。" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "try:\n", " os.remove(alice_path)\n", " os.remove(alice_psi_path)\n", " os.remove(bob_path)\n", " os.remove(bob_psi_path)\n", "except OSError:\n", " pass\n", "\n", "sf.shutdown()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "恭喜!你已经完成了隐语金融风控全链路的全部实验内容。\n", "\n", "如果你对本实验有任何建议和问题,请在[Github Issues](https://github.com/secretflow/secretflow/issues)上联系我们。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" }, "vscode": { "interpreter": { "hash": "02db3bab010a384e41503da74327ad4dd04080832919be62bcff46931ddfd4bc" } } }, "nbformat": 4, "nbformat_minor": 2 }