{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "a07356b3-744a-4319-9fec-cd62f37fa865", "metadata": {}, "source": [ "# Data Preprocessing with DataFrame" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7d72fa78", "metadata": {}, "source": [ ">The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production." ] }, { "attachments": {}, "cell_type": "markdown", "id": "e89e9dcf", "metadata": {}, "source": [ "It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial." ] }, { "attachments": {}, "cell_type": "markdown", "id": "8ff05a38-2211-4240-a9db-0d79c813ab99", "metadata": {}, "source": [ "Secretflow provides a variety of preprocessing tools to process data." ] }, { "attachments": {}, "cell_type": "markdown", "id": "25ec1569-f9f7-4f27-90b8-a6c7feab28e2", "metadata": { "tags": [] }, "source": [ "## Preparation\n", "\n", "Initialize secretflow and create two parties alice and bob." ] }, { "attachments": {}, "cell_type": "markdown", "id": "83e1596d-8ca1-40ae-9681-7254c563ff7e", "metadata": {}, "source": [ "> 💡 Before using preprocessing, you may need to get to know secretflow's [DataFrame](../user_guide/preprocessing/DataFrame.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "id": "9ad74320-2c3a-4c86-aea4-6688d96d2230", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The version of SecretFlow: 1.2.0.dev20231007\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-10-07 18:02:54,076\tINFO worker.py:1538 -- Started a local Ray instance.\n" ] } ], "source": [ "import secretflow as sf\n", "\n", "# Check the version of your SecretFlow\n", "print('The version of SecretFlow: {}'.format(sf.__version__))\n", "\n", "# In case you have a running secretflow runtime already.\n", "sf.shutdown()\n", "\n", "sf.init(['alice', 'bob'], address='local')\n", "alice = sf.PYU('alice')\n", "bob = sf.PYU('bob')" ] }, { "attachments": {}, "cell_type": "markdown", "id": "94c83c7b-417a-4772-9de1-2efc589cd89f", "metadata": { "tags": [] }, "source": [ "## Data Preparation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "86168ad6-2fe0-4410-b59c-fd65cbe8ea9b", "metadata": {}, "source": [ "Here we use [iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) as example data." ] }, { "cell_type": "code", "execution_count": 2, "id": "9d7d70b8-2d12-40c0-891e-d42cbd567cab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.2setosa
14.9NaN1.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
..................
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "

150 rows × 5 columns

\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 NaN 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", ".. ... ... ... ... \n", "145 6.7 3.0 5.2 2.3 \n", "146 6.3 2.5 5.0 1.9 \n", "147 6.5 3.0 5.2 2.0 \n", "148 6.2 3.4 5.4 2.3 \n", "149 5.9 3.0 5.1 1.8 \n", "\n", " target \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa \n", ".. ... \n", "145 virginica \n", "146 virginica \n", "147 virginica \n", "148 virginica \n", "149 virginica \n", "\n", "[150 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris(as_frame=True)\n", "data = pd.concat([iris.data, iris.target], axis=1)\n", "\n", "# In order to facilitate the subsequent display,\n", "# here we first set some data to None.\n", "data.iloc[1, 1] = None\n", "data.iloc[100, 1] = None\n", "\n", "# Restore target to its original name.\n", "data['target'] = data['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})\n", "data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e109ddee-3998-4c1a-8bb6-797cfe9a31f0", "metadata": {}, "source": [ "Create a vertical partitioned DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "id": "90206eca-f690-412d-9c09-e70e6830f6ac", "metadata": {}, "outputs": [], "source": [ "import tempfile\n", "from secretflow.data.vertical import read_csv as v_read_csv\n", "\n", "# Vertical partitioning.\n", "v_alice, v_bob = data.iloc[:, :2], data.iloc[:, 2:]\n", "\n", "# Save to temprary files.\n", "_, alice_path = tempfile.mkstemp()\n", "_, bob_path = tempfile.mkstemp()\n", "v_alice.to_csv(alice_path, index=False)\n", "v_bob.to_csv(bob_path, index=False)\n", "\n", "\n", "df = v_read_csv({alice: alice_path, bob: bob_path})" ] }, { "attachments": {}, "cell_type": "markdown", "id": "338872d7", "metadata": {}, "source": [ "You can also create a horizontal partitioned DataFrame, which works the same with vertical partitioning for subsequent steps." ] }, { "cell_type": "code", "execution_count": 4, "id": "2fd99e07", "metadata": {}, "outputs": [], "source": [ "# from secretflow.data.horizontal import read_csv as h_read_csv\n", "# from secretflow.security.aggregation import PlainAggregator\n", "# from secretflow.security.compare import PlainComparator\n", "\n", "# # Horizontal partitioning.\n", "# h_alice, h_bob = data.iloc[:70, :], data.iloc[70:, :]\n", "\n", "# # Save to temorary files.\n", "# _, h_alice_path = tempfile.mkstemp()\n", "# _, h_bob_path = tempfile.mkstemp()\n", "# h_alice.to_csv(h_alice_path, index=False)\n", "# h_bob.to_csv(h_bob_path, index=False)\n", "\n", "# df = h_read_csv(\n", "# {alice: h_alice_path, bob: h_bob_path},\n", "# aggregator=PlainAggregator(alice),\n", "# comparator=PlainComparator(alice),\n", "# )" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b346a56c-6a5b-4b59-8c36-99344febdb51", "metadata": { "tags": [] }, "source": [ "## Preprocessing\n", "\n", "Secretflow provides missing value filling, standardization, categorical features encoding, discretization .etc, which are similar to sklearn's preprocessing." ] }, { "attachments": {}, "cell_type": "markdown", "id": "c22e1dc7-508c-487b-8434-fb46f2cd0bcf", "metadata": { "tags": [] }, "source": [ "### Missing value filling\n", "\n", "DataFrame provides the fillna method, which can fill in missing values in the same way as pandas." ] }, { "cell_type": "code", "execution_count": 5, "id": "f53eb206-7ba5-44ab-9b70-713af21ae28c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "148" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Before filling, the sepal width (cm) is missing in two positions.\n", "df.count()['sepal width (cm)']" ] }, { "cell_type": "code", "execution_count": 6, "id": "b8c94841-73b3-4eb8-913d-af2f29c15ee2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fill sepal width (cm) with 10.\n", "df.fillna(value={'sepal width (cm)': 10}).count()['sepal width (cm)']" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fd909a7a-0cde-4cdc-b8a6-bbb09a909773", "metadata": { "tags": [] }, "source": [ "### Standardization\n", "#### Scaling features to a range\n", "\n", "Secretflow provides `MinMaxScaler` for scaling features to lie between a given minimum and maximum value. The input and output of MinMaxScaler are both `DataFrame`.\n", "\n", "Here is an exmaple to scale `sepal length (cm)` to the [0, 1] range." ] }, { "cell_type": "code", "execution_count": 7, "id": "51043015-d6e8-4461-b205-eeb3a966fe83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min: sepal length (cm) 0.0\n", "dtype: float64\n", "Max: sepal length (cm) 1.0\n", "dtype: float64\n" ] } ], "source": [ "from secretflow.preprocessing import MinMaxScaler\n", "\n", "scaler = MinMaxScaler()\n", "\n", "scaled_sepal_len = scaler.fit_transform(df['sepal length (cm)'])\n", "\n", "print('Min: ', scaled_sepal_len.min())\n", "print('Max: ', scaled_sepal_len.max())" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5a006c19", "metadata": {}, "source": [ "#### Variance scaling\n", "Secretflow provides `StandardScaler` for variance scaling. The input and output of StandardScaler are both DataFrames.\n", "\n", "Here is an exmaple to scale `sepal length (cm)` to unit variance." ] }, { "cell_type": "code", "execution_count": 8, "id": "cabf7b70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min: sepal length (cm) -1.870024\n", "dtype: float64\n", "Max: sepal length (cm) 2.492019\n", "dtype: float64\n" ] } ], "source": [ "from secretflow.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "\n", "scaled_sepal_len = scaler.fit_transform(df['sepal length (cm)'])\n", "\n", "print('Min: ', scaled_sepal_len.min())\n", "print('Max: ', scaled_sepal_len.max())" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4d48a464-10df-4a86-82a3-e51fb7d10579", "metadata": { "tags": [] }, "source": [ "### Encoding categorical features\n", "\n", "#### OneHot encoding\n", "Secretflow provides `OneHotEncoder` for OneHot encoding. The input and output of OneHotEncoder are `DataFrame`.\n", "\n", "Here is an example to encode target with onehot." ] }, { "cell_type": "code", "execution_count": 9, "id": "fe2023a4-6e6b-4ed8-892a-78df980e7ded", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns: ['target_setosa', 'target_versicolor', 'target_virginica']\n", "Min: \n", " target_setosa 0.0\n", "target_versicolor 0.0\n", "target_virginica 0.0\n", "dtype: float64\n", "Max: \n", " target_setosa 1.0\n", "target_versicolor 1.0\n", "target_virginica 1.0\n", "dtype: float64\n" ] } ], "source": [ "from secretflow.preprocessing import OneHotEncoder\n", "\n", "onehot_encoder = OneHotEncoder()\n", "onehot_target = onehot_encoder.fit_transform(df['target'])\n", "\n", "print('Columns: ', onehot_target.columns)\n", "print('Min: \\n', onehot_target.min())\n", "print('Max: \\n', onehot_target.max())" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3d2434aa-9f1f-40ed-b513-78c179e24635", "metadata": { "tags": [] }, "source": [ "#### Label encoding\n", "\n", "secretflow provides `LabelEncoder` for encoding target labels with value between 0 and n_classes-1. The input and output of LabelEncoder are `DataFrame`.\n", "\n", "Here is an example to encode target to [0, n_classes-1]." ] }, { "cell_type": "code", "execution_count": 10, "id": "3c55619d-c12a-444d-b3a0-ab05d71dacc4", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns: ['target']\n", "Min: \n", " target 0\n", "dtype: int64\n", "Max: \n", " target 2\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3770865)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but MinMaxScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3770865)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=3770865)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3770865)\u001b[0m warnings.warn(\n" ] } ], "source": [ "from secretflow.preprocessing import LabelEncoder\n", "\n", "label_encoder = LabelEncoder()\n", "encoded_label = label_encoder.fit_transform(df['target'])\n", "\n", "print('Columns: ', encoded_label.columns)\n", "print('Min: \\n', encoded_label.min())\n", "print('Max: \\n', encoded_label.max())" ] }, { "attachments": {}, "cell_type": "markdown", "id": "00d70482", "metadata": {}, "source": [ "### Discretization\n", "\n", "SecretFlow provides `KBinsDiscretizer` for partitioning continuous features into discrete values. The input and output of KBinsDiscretizer are both `DataFrame`.\n", "\n", "Here is an example to partition `petal length (cm)` to 5 bins." ] }, { "cell_type": "code", "execution_count": 11, "id": "a3408341", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min: \n", " petal length (cm) 0.0\n", "dtype: float64\n", "Max: \n", " petal length (cm) 4.0\n", "dtype: float64\n" ] } ], "source": [ "from secretflow.preprocessing import KBinsDiscretizer\n", "\n", "estimator = KBinsDiscretizer(n_bins=5)\n", "binned_petal_len = estimator.fit_transform(df['petal length (cm)'])\n", "\n", "print('Min: \\n', binned_petal_len.min())\n", "print('Max: \\n', binned_petal_len.max())" ] }, { "attachments": {}, "cell_type": "markdown", "id": "343704ae", "metadata": {}, "source": [ "#### WOE encoding\n", "\n", "secretflow provides `VertWoeBinning` to bin the features into buckets by quantile or chimerge method, and calculate the woe value and iv value in each bucket. And `VertBinSubstitution` can substitute the features with the woe value.\n", "\n", "Here is an example to encode features to woe." ] }, { "cell_type": "code", "execution_count": 12, "id": "16e22ed1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "orig ds in alice:\n", " x1 x2 x3\n", "0 -0.514226 0.730010 -0.730391\n", "1 -0.725537 0.482244 -0.823223\n", "2 0.608353 -0.071102 -0.775098\n", "3 -0.686642 0.160470 0.914477\n", "4 -0.198111 0.212909 0.950474\n", "... ... ... ...\n", "9995 -0.367246 -0.296454 0.558596\n", "9996 0.010913 0.629268 -0.384093\n", "9997 -0.238097 0.904069 -0.344859\n", "9998 0.453686 -0.375173 0.899238\n", "9999 -0.776015 -0.772112 0.012110\n", "\n", "[10000 rows x 3 columns]\n", "orig ds in bob:\n", " x18 x19 x20 y\n", "0 0.810261 0.048303 0.937679 1\n", "1 0.312728 0.526637 0.589773 1\n", "2 0.039087 -0.753417 0.516735 0\n", "3 -0.855979 0.250944 0.979465 1\n", "4 -0.238805 0.243109 -0.121446 1\n", "... ... ... ... ..\n", "9995 -0.847253 0.069960 0.786748 1\n", "9996 -0.502486 -0.076290 -0.604832 1\n", "9997 -0.424209 0.434947 0.998955 1\n", "9998 0.914291 -0.473056 0.616257 1\n", "9999 -0.602927 -0.021368 0.885519 0\n", "\n", "[10000 rows x 4 columns]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3770865)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but KBinsDiscretizer was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=3770865)\u001b[0m warnings.warn(\n", "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime pid=3776105)\u001b[0m 2023-10-07 18:02:57.469 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3770127)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3770127)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3770127)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3770127)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3770127)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3770127)\u001b[0m [2023-10-07 18:02:57.670] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime pid=3776105)\u001b[0m 2023-10-07 18:02:58.469 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:51263} (0x0x32ed440): Connection refused [R1][E112]Not connected to 127.0.0.1:51263 yet, server_id=0'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3776105)\u001b[0m 2023-10-07 18:02:58.469 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry\n", "\u001b[2m\u001b[36m(SPURuntime pid=3776107)\u001b[0m 2023-10-07 18:02:59.479 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:44205'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3776105)\u001b[0m 2023-10-07 18:02:59.469 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:51263} (0x0x32ed440): Connection refused [R1][E112]Not connected to 127.0.0.1:51263 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:51263 yet, server_id=0'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3776105)\u001b[0m 2023-10-07 18:02:59.470 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3776107)\u001b[0m 2023-10-07 18:03:01.500 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3776105)\u001b[0m 2023-10-07 18:03:01.503 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n", "bin_rules for alice:\n", " {'variables': [{'name': 'x1', 'type': 'numeric', 'split_points': [-0.6048731088638305, -0.2093792676925656, 0.1864844083786014, 0.59245548248291], 'filling_values': [0.13818949789069251, 0.1043626580338657, 0.012473718947119546, -0.08312553911263658, -0.16055365315128886], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_filling_value': -0.7620562438001163, 'else_counts': 0}, {'name': 'x2', 'type': 'numeric', 'split_points': [-0.6180597543716427, -0.21352910995483343, 0.18739376068115243, 0.5941788196563724], 'filling_values': [-0.5795513521445242, -0.17800092651085536, 0.02175062133493428, 0.32061945260518093, 0.5508555713857505], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_filling_value': -0.7620562438001163, 'else_counts': 0}, {'name': 'x3', 'type': 'numeric', 'split_points': [-0.5902724504470824, -0.19980529546737677, 0.2072824716567998, 0.6102998018264773], 'filling_values': [-0.5371125119817587, -0.25762552591997334, -0.022037294110497735, 0.3445721198562295, 0.6304998785437507], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_filling_value': -0.7620562438001163, 'else_counts': 0}]}\n", "bin_rules for bob:\n", " {'variables': [{'name': 'x18', 'type': 'numeric', 'split_points': [-0.595701837539673, -0.18646149635314926, 0.20281808376312258, 0.5969645977020259], 'filling_values': [0.7644870924575128, 0.3796894156855692, 0.09717493242210018, -0.3856750302449858, -0.6258460389655672], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_filling_value': -0.7620562438001163, 'else_counts': 0}, {'name': 'x19', 'type': 'numeric', 'split_points': [-0.5988080263137814, -0.2046342611312865, 0.1958462238311768, 0.6044608354568479], 'filling_values': [-0.24268812281101115, -0.18886157950622262, 0.061543825157264156, 0.15773711862524092, 0.24528753075504478], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_filling_value': -0.7620562438001163, 'else_counts': 0}, {'name': 'x20', 'type': 'numeric', 'split_points': [-0.6013513207435608, -0.2053116083145139, 0.19144065380096467, 0.5987063169479374], 'filling_values': [1.1083043875152403, 0.5598579367731444, 0.15773711862524092, -0.4618210945247346, -0.9083164208649596], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_filling_value': -0.7620562438001163, 'else_counts': 0}], 'feature_iv_info': [{'name': 'x18', 'ivs': (0.0984553650514661, 0.026672043182215357, 0.0018543743703697353, 0.031572379289703925, 0.08527363286990977), 'else_iv': 6.385923806287768e-05, 'feature_iv': 0.24382779476366487}, {'name': 'x19', 'ivs': (0.012260322787780317, 0.0073647296376464335, 0.0007489127774465146, 0.0048277504221347, 0.011464529974064627), 'else_iv': 6.385923806287768e-05, 'feature_iv': 0.03666624559907259}, {'name': 'x20', 'ivs': (0.1887116758575296, 0.055586120695474514, 0.0048277504221347, 0.04568212645465593, 0.18279475271010598), 'else_iv': 6.385923806287768e-05, 'feature_iv': 0.4776024261399007}, {'name': 'x1', 'ivs': (0.003719895277030594, 0.0021358508878795324, 3.104792224414912e-05, 0.001402356358949689, 0.005298781871642081), 'else_iv': 6.385923806287768e-05, 'feature_iv': 0.012587932317746046}, {'name': 'x2', 'ivs': (0.07282166470764677, 0.006530977061248972, 9.424153452104671e-05, 0.019271267842100412, 0.05393057944296773), 'else_iv': 6.385923806287768e-05, 'feature_iv': 0.15264873058848494}, {'name': 'x3', 'ivs': (0.062290057806557865, 0.013846189451494859, 9.751520288052276e-05, 0.022140413942886045, 0.06928418076454097), 'else_iv': 6.385923806287768e-05, 'feature_iv': 0.16765835716836025}]}\n", "substituted ds in alice:\n", " x1 x2 x3\n", "0 0.104363 0.550856 -0.537113\n", "1 0.138189 0.320619 -0.537113\n", "2 -0.160554 0.021751 -0.537113\n", "3 0.138189 0.021751 0.630500\n", "4 0.012474 0.320619 0.630500\n", "... ... ... ...\n", "9995 0.104363 -0.178001 0.344572\n", "9996 0.012474 0.550856 -0.257626\n", "9997 0.104363 0.550856 -0.257626\n", "9998 -0.083126 -0.178001 0.630500\n", "9999 0.138189 -0.579551 -0.022037\n", "\n", "[10000 rows x 3 columns]\n", "substituted ds in bob:\n", " x18 x19 x20 y\n", "0 -0.625846 0.061544 -0.908316 1\n", "1 -0.385675 0.157737 -0.461821 1\n", "2 0.097175 -0.242688 -0.461821 0\n", "3 0.764487 0.157737 -0.908316 1\n", "4 0.379689 0.157737 0.157737 1\n", "... ... ... ... ..\n", "9995 0.764487 0.061544 -0.908316 1\n", "9996 0.379689 0.061544 1.108304 1\n", "9997 0.379689 0.157737 -0.908316 1\n", "9998 -0.625846 -0.188862 -0.908316 1\n", "9999 0.764487 0.061544 -0.908316 0\n", "\n", "[10000 rows x 4 columns]\n" ] } ], "source": [ "# woe binning use SPU or HEU device to protect label\n", "spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))\n", "\n", "# Only support binary classification label dataset for now.\n", "# use linear dataset as example\n", "from secretflow.utils.simulation.datasets import load_linear\n", "\n", "vdf = load_linear(parts={alice: (1, 4), bob: (18, 22)})\n", "print(f\"orig ds in alice:\\n {sf.reveal(vdf.partitions[alice].data)}\")\n", "print(f\"orig ds in bob:\\n {sf.reveal(vdf.partitions[bob].data)}\")\n", "\n", "from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning\n", "\n", "binning = VertWoeBinning(spu)\n", "bin_rules = binning.binning(\n", " vdf,\n", " binning_method=\"quantile\",\n", " bin_num=5,\n", " bin_names={alice: [\"x1\", \"x2\", \"x3\"], bob: [\"x18\", \"x19\", \"x20\"]},\n", " label_name=\"y\",\n", ")\n", "\n", "print(f\"bin_rules for alice:\\n {sf.reveal(bin_rules[alice])}\")\n", "print(f\"bin_rules for bob:\\n {sf.reveal(bin_rules[bob])}\")\n", "\n", "from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution\n", "\n", "woe_sub = VertBinSubstitution()\n", "sub_data = woe_sub.substitution(vdf, bin_rules)\n", "\n", "print(f\"substituted ds in alice:\\n {sf.reveal(sub_data.partitions[alice].data)}\")\n", "print(f\"substituted ds in bob:\\n {sf.reveal(sub_data.partitions[bob].data)}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5dd4ddd5", "metadata": {}, "source": [ "## Ending" ] }, { "cell_type": "code", "execution_count": 13, "id": "831cc32e", "metadata": {}, "outputs": [], "source": [ "# Clean up temporary files\n", "\n", "import os\n", "\n", "try:\n", " os.remove(alice_path)\n", " os.remove(bob_path)\n", "except OSError:\n", " pass" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('py3.8')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" }, "vscode": { "interpreter": { "hash": "66d1547304beaba725027c44e57cc46fc747862fe9496520910412a3187eb35f" } } }, "nbformat": 4, "nbformat_minor": 5 }