{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Weight Of Evidence encoding" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ">The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Binning create buckets of independent variables based on ranking methods. Binning helps us converting continuous variables into categorical ones.\n", "\n", "WOE binning Implements a binning of numeric variables and factors with respect to a dichotomous target variable.\n", "\n", "```\n", "bin_total = bin_positives + bin_negatives\n", "total_labels = total_positives + total_negatives\n", "bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))\n", "bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe\n", "```\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Currently we provide WOE encoding for vertically partitioned datasets.\n", "\n", "Let's first load a sample dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import secretflow as sf\n", "from secretflow.data.vertical import VDataFrame\n", "from secretflow.utils.simulation.datasets import load_linear" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-10-07 18:03:18,152\tINFO worker.py:1538 -- Started a local Ray instance.\n" ] } ], "source": [ "sf.shutdown()\n", "sf.init(['alice', 'bob'], address='local')\n", "alice, bob = sf.PYU('alice'), sf.PYU('bob')\n", "# similarly for woe in heu\n", "spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "parts = {\n", " bob: (1, 11),\n", " alice: (11, 22),\n", "}\n", "vdf = load_linear(parts=parts)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "label_data = vdf['y']\n", "y = sf.reveal(label_data.partitions[alice].data).values" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, we are ready to perform WOE binning and substitution." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime pid=3788331)\u001b[0m 2023-10-07 18:03:20.845 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3781429)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3781429)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3781429)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3781429)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3781429)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3781429)\u001b[0m [2023-10-07 18:03:21.594] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime pid=3788331)\u001b[0m 2023-10-07 18:03:21.846 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:58305} (0x0x48efdc0): Connection refused [R1][E112]Not connected to 127.0.0.1:58305 yet, server_id=0'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3788331)\u001b[0m 2023-10-07 18:03:21.846 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry\n", "\u001b[2m\u001b[36m(SPURuntime pid=3788331)\u001b[0m 2023-10-07 18:03:22.846 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:58305} (0x0x48efdc0): Connection refused [R1][E112]Not connected to 127.0.0.1:58305 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:58305 yet, server_id=0'\n", "\u001b[2m\u001b[36m(SPURuntime pid=3788331)\u001b[0m 2023-10-07 18:03:22.846 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry\n", "\u001b[2m\u001b[36m(SPURuntime pid=3788330)\u001b[0m 2023-10-07 18:03:22.849 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:34077'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party bob.\n", "INFO:root:Create proxy actor with party alice.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=bob) pid=3788331)\u001b[0m 2023-10-07 18:03:25.092 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime(device_id=None, party=alice) pid=3788330)\u001b[0m 2023-10-07 18:03:25.107 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63\n", " x11 x12 x13 x14 x15 x16 x17 \\\n", "0 0.241531 -0.705729 -0.020094 -0.493792 0.851992 0.035219 -0.796096 \n", "1 -0.402727 0.115744 0.468149 -0.735388 0.386395 0.712798 0.239583 \n", "2 0.872675 -0.559321 0.390246 -0.079604 0.225594 -0.639674 0.279511 \n", "3 -0.644718 -0.409382 0.141747 -0.682014 0.314084 -0.802476 0.348878 \n", "4 -0.949669 -0.940787 -0.951708 0.232566 0.272346 0.124419 0.853226 \n", "... ... ... ... ... ... ... ... \n", "9995 -0.031331 -0.078700 -0.020636 -0.493792 0.210120 -0.288943 -0.262945 \n", "9996 0.047039 0.965614 -0.921435 -0.251231 0.205778 0.155392 0.922683 \n", "9997 0.269438 -0.115586 0.928880 0.513349 0.269042 -0.331772 0.520971 \n", "9998 0.999325 0.433372 -0.805999 0.232566 0.072405 0.973399 -0.123470 \n", "9999 -0.203443 0.772931 -0.146181 -0.305728 0.274590 0.803816 -0.312047 \n", "\n", " x18 x19 x20 y \n", "0 0.810261 0.048303 0.937679 1 \n", "1 0.312728 0.526637 0.589773 1 \n", "2 0.039087 -0.753417 0.516735 0 \n", "3 -0.855979 0.250944 0.979465 1 \n", "4 -0.238805 0.243109 -0.121446 1 \n", "... ... ... ... .. \n", "9995 -0.847253 0.069960 0.786748 1 \n", "9996 -0.502486 -0.076290 -0.604832 1 \n", "9997 -0.424209 0.434947 0.998955 1 \n", "9998 0.914291 -0.473056 0.616257 1 \n", "9999 -0.602927 -0.021368 0.885519 0 \n", "\n", "[10000 rows x 11 columns]\n", " x1 x2 x3 x4 x5 x6 x7 \\\n", "0 -0.514226 0.730010 -0.730391 0.970483 0.038063 -0.800808 -0.082006 \n", "1 -0.725537 0.482244 -0.823223 0.202119 0.484290 -0.139781 -0.341216 \n", "2 0.608353 -0.071102 -0.775098 -0.391496 0.224127 0.082370 -0.341216 \n", "3 -0.686642 0.160470 0.914477 -0.269052 0.224127 -0.547841 -0.341216 \n", "4 -0.198111 0.212909 0.950474 0.775259 -0.590206 -0.840528 -0.742056 \n", "... ... ... ... ... ... ... ... \n", "9995 -0.367246 -0.296454 0.558596 -0.403504 0.542892 0.000142 -0.341216 \n", "9996 0.010913 0.629268 -0.384093 -0.552787 0.542892 -0.100838 0.071673 \n", "9997 -0.238097 0.904069 -0.344859 -0.687887 -0.103900 0.223052 -0.286633 \n", "9998 0.453686 -0.375173 0.899238 0.908135 -0.590206 0.524051 0.347251 \n", "9999 -0.776015 -0.772112 0.012110 -0.898067 0.182405 -0.500491 0.557853 \n", "\n", " x8 x9 x10 \n", "0 -0.499206 -0.750112 -0.910640 \n", "1 -0.652901 0.438065 0.830206 \n", "2 -0.183506 -0.783842 -0.729929 \n", "3 -0.269405 -0.974268 -0.800515 \n", "4 0.800389 0.185542 0.183614 \n", "... ... ... ... \n", "9995 -0.470127 -0.247682 -0.552526 \n", "9996 0.592903 -0.577123 -0.811461 \n", "9997 -0.172245 0.713149 -0.184585 \n", "9998 -0.558997 0.610076 -0.862191 \n", "9999 -0.275658 -0.250420 0.518420 \n", "\n", "[10000 rows x 10 columns]\n" ] } ], "source": [ "from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning\n", "from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution\n", "\n", "binning = VertWoeBinning(spu)\n", "bin_rules = binning.binning(\n", " vdf,\n", " binning_method=\"chimerge\",\n", " bin_num=4,\n", " bin_names={alice: ['x14'], bob: [\"x5\", \"x7\"]},\n", " label_name=\"y\",\n", ")\n", "\n", "woe_sub = VertBinSubstitution()\n", "vdf = woe_sub.substitution(vdf, bin_rules)\n", "\n", "# this is for demo only, be careful with reveal\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we may need the iv values. Releasing bin ivs will potentially leak label information according to issue https://github.com/secretflow/secretflow/issues/565.\n", "Currently, we choose to save bin iv values in label holders device. It is up to label holder's choice to\n", "\n", "1. share no iv information\n", "2. share some chosen iv information\n", "\n", "We will demonstrate how to share the feature ivs." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Recall that the woe_rules is a dictionary `{PYU: PYUObject}`, where each `PYUObject` itself is a dictionary of the following type:\n", "```\n", "{\n", " \"variables\":[\n", " {\n", " \"name\": str, # feature name\n", " \"type\": str, # \"string\" or \"numeric\", if feature is discrete or continuous\n", " \"categories\": list[str], # categories for discrete feature\n", " \"split_points\": list[float], # left-open right-close split points\n", " \"total_counts\": list[int], # total samples count in each bins.\n", " \"else_counts\": int, # np.nan samples count\n", " \"filling_values\": list[float], # woe values for each bins.\n", " \"else_filling_value\": float, # woe value for np.nan samples.\n", " },\n", " # ... others feature\n", " ],\n", " # label holder's PYUObject only\n", " # warning: giving bin_ivs to other party will leak positive samples in each bin.\n", " # it is up to label holder's will to give feature iv or bin ivs or all info to workers.\n", " # for more information, look at: https://github.com/secretflow/secretflow/issues/565\n", "\n", " # in the following comment, by safe we mean label distribution info is not leaked.\n", " \"feature_iv_info\" :[\n", " {\n", " \"name\": str, #feature name\n", " \"ivs\": list[float], #iv values for each bins, not safe to share with workers in any case.\n", " \"else_iv\": float, #iv for nan values, may share to with workers\n", " \"feature_iv\": float, #sum of bin_ivs, safe to share with workers when bin num > 2.\n", " }\n", " ]\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# alice is label holder\n", "dict_pyu_object = bin_rules[alice]\n", "\n", "\n", "def extract_name_and_feature_iv(list_of_feature_iv_info):\n", " return [(d[\"name\"], d[\"feature_iv\"]) for d in list_of_feature_iv_info]\n", "\n", "\n", "feature_ivs = alice(\n", " lambda dict_pyu_object: extract_name_and_feature_iv(\n", " dict_pyu_object[\"feature_iv_info\"]\n", " )\n", ")(dict_pyu_object)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('x14', 0.43219177635839423), ('x5', 0.37848298069087766), ('x7', 0)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we can give the feature_ivs to bob\n", "feature_ivs.to(bob)\n", "# and/or we can reveal it to see it\n", "sf.reveal(feature_ivs)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "feature_iv_info = sf.reveal(feature_ivs)\n", "df = pd.DataFrame.from_records(feature_iv_info, columns=['feature', 'iv'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "How to interpret feature iv?\n", "\n", "- Less than 0.02 -> Not useful for prediction\n", "- 0.02 to 0.1 -> Weak predictive Power\n", "- 0.1 to 0.3 -> Medium predictive Power\n", "- 0.3 to 0.5 -> Strong predictive Power\n", "- >0.5\t-> Suspicious Predictive Power\n", "\n", "let's select top 2 feature iv" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " feature iv\n", "0 x14 0.432192\n", "1 x5 0.378483\n" ] } ], "source": [ "print(df.sort_values('iv', ascending=False).head(2))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Congradulations!\n", "In this tutorial we have learnt how to\n", "\n", "1. do WOE encoding\n", "2. share some iv information to other parties\n", "3. use feature iv for feature selection\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "py38", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }